API Reference

Name: Content Extractor
Rating: 4.8 (150 reviews)

Extract Content

Extract clean content from HTML:

POST /v1/extract

Request Headers

Authorization: Bearer ce_your_token_here
Content-Type: application/json

Request Body

{
  "html": "<!doctype html><html>...</html>",  // Required: Raw HTML (max 2MB)
  "url": "https://example.com/article",        // Optional: URL for labeling
  "contentTypeHint": "article",                // Optional: "article" | "blog" | "news"
  "languageHint": "en"                         // Optional: Language code
}

Response

{
  "title": "Article Title",                    // Extracted title
  "content": "Clean article content...",       // Main content as plain text
  "excerpt": "Brief summary",                  // Optional: Article excerpt
  "author": "Author Name",                     // Optional: Author name
  "publishedAt": "2024-01-01T00:00:00Z",       // Optional: Publication date (ISO 8601)
  "language": "en"                             // Optional: Detected language code
}

What "Main Content Extraction" Means

Main content extraction identifies and returns only the primary article or post content, removing:

Navigation menus and headers
Footers and sidebars
Advertisements and promotional content
Cookie notices and consent banners
Share buttons and social widgets
Related posts and recommendations
Comment sections

The result is clean, focused content ready for LLM processing, indexing, summarization, or other content workflows.

Input Requirements

Raw HTML: Send complete HTML source, not just fragments
Include <title>: Helps with title extraction
Avoid truncated DOM: Send full HTML for best results
Max size: 2MB per request

Using the Result

Once you receive the JSON response, you can:

Store: Save title and content to your database
Summarize: Send content to LLM for summarization
Embed: Create embeddings for RAG systems
Index: Add to search indexes
Process: Use in content pipelines and workflows