Best Practices for Extract
Learn the best practices for web content extraction process
Extracting web content using Tavily
Efficiently extracting content from web pages is crucial for AI-powered applications. Tavily provides two main approaches to content extraction, each suited for different use cases.
1. One-step extraction: directly retrieve raw_content
You can extract web content by enabling include_raw_content = true
when making a Tavily Search API call. This allows you to retrieve both search results and extracted content in a single step.
However, this can increase latency because you may extract raw content from sources that are not relevant in the first place. It’s recommended to split the process into two steps: running multiple sub-queries to expand the pool of sources, then curating the most relevant documents based on content snippets or source scores. By extracting raw content from the most relevant sources, you get high-quality RAG documents.
2. Two-step process: search, then extract
For better accuracy and customization, we recommend a two-step process:
Step 1: Search
Use the Tavily Search API to retrieve relevant web pages, which output URLs.
Step 2: Extract
Use the Tavily Extract API to fetch the full content from the most relevant URLs.
Example:
Pros of two-Step extraction
✅ More control – Extract only from selected URLs.
✅ Higher accuracy – Filter out irrelevant results before extraction.
✅ Advanced extraction capabilities – Using search_depth = "advanced"
.
Cons of two-step extraction
❌ slightly more expensive.
Using advanced extraction
Using extract_depth = "advanced"
in the Extract API allows for more comprehensive content retrieval. This mode is particularly useful when dealing with:
- Complex web pages with dynamic content, embedded media, or structured data.
- Tables and structured information that require accurate parsing.
- Higher success rates.
If precision and depth are priorities for your application,
extract_depth = "advanced"
is the recommended choice.