API Reference
Extract Content
Extract clean content from HTML:
POST /v1/extractRequest Headers
Authorization: Bearer ce_your_token_here
Content-Type: application/jsonRequest Body
{
"html": "<!doctype html><html>...</html>", // Required: Raw HTML (max 2MB)
"url": "https://example.com/article", // Optional: URL for labeling
"contentTypeHint": "article", // Optional: "article" | "blog" | "news"
"languageHint": "en" // Optional: Language code
}Response
{
"title": "Article Title", // Extracted title
"content": "Clean article content...", // Main content as plain text
"excerpt": "Brief summary", // Optional: Article excerpt
"author": "Author Name", // Optional: Author name
"publishedAt": "2024-01-01T00:00:00Z", // Optional: Publication date (ISO 8601)
"language": "en" // Optional: Detected language code
}What "Main Content Extraction" Means
Main content extraction identifies and returns only the primary article or post content, removing:
- Navigation menus and headers
- Footers and sidebars
- Advertisements and promotional content
- Cookie notices and consent banners
- Share buttons and social widgets
- Related posts and recommendations
- Comment sections
The result is clean, focused content ready for LLM processing, indexing, summarization, or other content workflows.
Input Requirements
- Raw HTML: Send complete HTML source, not just fragments
- Include <title>: Helps with title extraction
- Avoid truncated DOM: Send full HTML for best results
- Max size: 2MB per request
Using the Result
Once you receive the JSON response, you can:
- Store: Save title and content to your database
- Summarize: Send content to LLM for summarization
- Embed: Create embeddings for RAG systems
- Index: Add to search indexes
- Process: Use in content pipelines and workflows