Extract clean article content from raw HTML

HTML → JSON (title + main content). Remove nav, menus, ads, and clutter.

What you get

Send raw HTML and receive clean, structured JSON with extracted content:

{
  "title": "Article Title",
  "content": "Clean article content as plain text...",
  "excerpt": "Brief summary",
  "author": "Author Name",
  "publishedAt": "2024-01-01T00:00:00Z",
  "language": "en"
}

title

Extracted article title

content

Clean text/HTML of main content only

Why this exists

•

LLM/RAG pipelines need clean text

Language models and RAG systems work best with clean, focused content without navigation, ads, or boilerplate.

•

DOM contains noise

Raw HTML includes navigation menus, footers, sidebars, ads, and other elements that aren't part of the main content.

•

Reliable extraction for many pages

This API extracts main content reliably for many pages using enhanced Readability algorithms and AI processing. Results vary by site structure.

How it works

Send raw HTML

POST raw HTML to our API endpoint with your API token

We extract main content

Our API extracts main article content + title, removing navigation and boilerplate

Receive JSON

Get structured JSON with title and clean content ready for processing

Code examples

cURL

curl -X POST https://api.content-extractor.devstools.net/v1/extract \
  -H "Authorization: Bearer ce_your_token_here" \
  -H "Content-Type: application/json" \
  -d '{
    "html": "<!doctype html><html><head><title>Article</title></head><body><article><h1>Title</h1><p>Content</p></article></body></html>"
  }'

Node.js

const response = await fetch('https://api.content-extractor.devstools.net/v1/extract', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer ce_your_token_here',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    html: '<html><body><article><h1>Title</h1><p>Content</p></article></body></html>'
  })
});

const data = await response.json();
console.log(data.title, data.content);

Python

import requests

response = requests.post(
    'https://api.content-extractor.devstools.net/v1/extract',
    headers={
        'Authorization': 'Bearer ce_your_token_here',
        'Content-Type': 'application/json'
    },
    json={
        'html': '<html><body><article><h1>Title</h1><p>Content</p></article></body></html>'
    }
)

data = response.json()
print(data['title'], data['content'])

Use cases

RAG / Indexing

Clean content for retrieval-augmented generation and search indexing

Summarization Pipelines

Extract clean content for AI summarization workflows

Newsletters / Republishing

Extract content for newsletters and compliant content republishing

Market Monitoring

Monitor competitor content and market trends (compliant usage)

Simple, transparent pricing

Free tier: 100 requests/day. Pay-as-you-go credits never expire.

Frequently asked questions

What HTML should I send?

Send the complete HTML source of the page containing the article. Include the full <html> document with <head> and <body> sections.

Does it work with all websites?

The API works well with many article-based websites, but extraction quality varies by site structure. We recommend testing with your target sites.

What does the API return?

The API returns JSON with the extracted title and clean article content as plain text, with all navigation, ads, and boilerplate removed.

How accurate is the extraction?

Extraction accuracy depends on the HTML structure. Well-structured articles with semantic HTML typically yield better results.

Compliance & Usage

Use only on content you have rights to process. Respect site terms and robots policies where applicable. This service extracts main content from HTML and does not guarantee extraction for every site.