What you get
Send raw HTML and receive clean, structured JSON with extracted content:
{
"title": "Article Title",
"content": "Clean article content as plain text...",
"excerpt": "Brief summary",
"author": "Author Name",
"publishedAt": "2024-01-01T00:00:00Z",
"language": "en"
}title
Extracted article title
content
Clean text/HTML of main content only
Why this exists
LLM/RAG pipelines need clean text
Language models and RAG systems work best with clean, focused content without navigation, ads, or boilerplate.
DOM contains noise
Raw HTML includes navigation menus, footers, sidebars, ads, and other elements that aren't part of the main content.
Reliable extraction for many pages
This API extracts main content reliably for many pages using enhanced Readability algorithms and AI processing. Results vary by site structure.
How it works
Send raw HTML
POST raw HTML to our API endpoint with your API token
We extract main content
Our API extracts main article content + title, removing navigation and boilerplate
Receive JSON
Get structured JSON with title and clean content ready for processing
Code examples
cURL
curl -X POST https://api.content-extractor.devstools.net/v1/extract \
-H "Authorization: Bearer ce_your_token_here" \
-H "Content-Type: application/json" \
-d '{
"html": "<!doctype html><html><head><title>Article</title></head><body><article><h1>Title</h1><p>Content</p></article></body></html>"
}'Node.js
const response = await fetch('https://api.content-extractor.devstools.net/v1/extract', {
method: 'POST',
headers: {
'Authorization': 'Bearer ce_your_token_here',
'Content-Type': 'application/json'
},
body: JSON.stringify({
html: '<html><body><article><h1>Title</h1><p>Content</p></article></body></html>'
})
});
const data = await response.json();
console.log(data.title, data.content);Python
import requests
response = requests.post(
'https://api.content-extractor.devstools.net/v1/extract',
headers={
'Authorization': 'Bearer ce_your_token_here',
'Content-Type': 'application/json'
},
json={
'html': '<html><body><article><h1>Title</h1><p>Content</p></article></body></html>'
}
)
data = response.json()
print(data['title'], data['content'])Use cases
RAG / Indexing
Clean content for retrieval-augmented generation and search indexing
Summarization Pipelines
Extract clean content for AI summarization workflows
Newsletters / Republishing
Extract content for newsletters and compliant content republishing
Market Monitoring
Monitor competitor content and market trends (compliant usage)
Frequently asked questions
What HTML should I send?
Send the complete HTML source of the page containing the article. Include the full <html> document with <head> and <body> sections.
Does it work with all websites?
The API works well with many article-based websites, but extraction quality varies by site structure. We recommend testing with your target sites.
What does the API return?
The API returns JSON with the extracted title and clean article content as plain text, with all navigation, ads, and boilerplate removed.
How accurate is the extraction?
Extraction accuracy depends on the HTML structure. Well-structured articles with semantic HTML typically yield better results.
Compliance & Usage
Use only on content you have rights to process. Respect site terms and robots policies where applicable. This service extracts main content from HTML and does not guarantee extraction for every site.