RAG / Indexing
Problem
RAG systems need clean, focused content for embedding and retrieval. Raw HTML contains navigation, ads, footers, and other noise that dilutes embeddings and reduces retrieval accuracy.
Solution
Use Content Extractor API to get clean article content before creating embeddings. This ensures your RAG system indexes only the main content, improving retrieval quality and reducing noise.
Example Workflow
- Fetch HTML from your content sources
- Extract clean content using Content Extractor API
- Create embeddings from the clean content
- Store embeddings in your vector database
- Retrieve relevant content for LLM context
Example Request
const response = await fetch('https://api.content-extractor.devstools.net/v1/extract', {
method: 'POST',
headers: {
'Authorization': 'Bearer ce_your_token_here',
'Content-Type': 'application/json'
},
body: JSON.stringify({ html: rawHtml })
});
const { title, content } = await response.json();
// Create embeddings from clean content
const embedding = await createEmbedding(content);
// Store in vector database
await vectorDB.upsert({
id: articleId,
embedding,
metadata: { title, content }
});Benefits
- Cleaner embeddings focused on main content
- Better retrieval accuracy
- Reduced noise in vector database
- More relevant context for LLM generation