API Reference

Extract Content

Extract clean content from HTML:

POST /v1/extract

Request Headers

Authorization: Bearer ce_your_token_here
Content-Type: application/json

Request Body

{
  "html": "<!doctype html><html>...</html>",  // Required: Raw HTML (max 2MB)
  "url": "https://example.com/article",        // Optional: URL for labeling
  "contentTypeHint": "article",                // Optional: "article" | "blog" | "news"
  "languageHint": "en"                         // Optional: Language code
}

Response

{
  "title": "Article Title",                    // Extracted title
  "content": "Clean article content...",       // Main content as plain text
  "excerpt": "Brief summary",                  // Optional: Article excerpt
  "author": "Author Name",                     // Optional: Author name
  "publishedAt": "2024-01-01T00:00:00Z",       // Optional: Publication date (ISO 8601)
  "language": "en"                             // Optional: Detected language code
}

What "Main Content Extraction" Means

Main content extraction identifies and returns only the primary article or post content, removing:

  • Navigation menus and headers
  • Footers and sidebars
  • Advertisements and promotional content
  • Cookie notices and consent banners
  • Share buttons and social widgets
  • Related posts and recommendations
  • Comment sections

The result is clean, focused content ready for LLM processing, indexing, summarization, or other content workflows.

Input Requirements

  • Raw HTML: Send complete HTML source, not just fragments
  • Include <title>: Helps with title extraction
  • Avoid truncated DOM: Send full HTML for best results
  • Max size: 2MB per request

Using the Result

Once you receive the JSON response, you can:

  • Store: Save title and content to your database
  • Summarize: Send content to LLM for summarization
  • Embed: Create embeddings for RAG systems
  • Index: Add to search indexes
  • Process: Use in content pipelines and workflows