Documentation Index
Fetch the complete documentation index at: https://docs.tavily.com/llms.txt
Use this file to discover all available pages before exploring further.
Crawl vs Map
Understanding when to use each API:
| Feature | Crawl | Map |
|---|
| Content extraction | Full content | URLs only |
| Use case | Deep content analysis | Site structure discovery |
| Speed | Slower (extracts content) | Faster (URLs only) |
| Best for | RAG, analysis, documentation | Sitemap generation |
Use Crawl when you need:
- Full content extraction from pages
- Deep content analysis
- Processing of paginated or nested content
- Extraction of specific content patterns
- Integration with RAG systems
Use Map when you need:
- Quick site structure discovery
- URL collection without content extraction
- Sitemap generation
- Path pattern matching
- Domain structure analysis
Crawl Parameters
Instructions
Guide the crawl with natural language to focus on relevant content:
{
"url": "example.com",
"max_depth": 2,
"instructions": "Find all documentation pages about Python"
}
When to use instructions:
- To focus crawling on specific topics or content types
- When you need semantic filtering of pages
- For agentic use cases where relevance is critical
Chunks per Source
Control the amount of content returned per page to prevent context window explosion:
{
"url": "example.com",
"instructions": "Find all documentation about authentication",
"chunks_per_source": 3
}
Key benefits:
- Returns only relevant content snippets (max 500 characters each) instead of full page content
- Prevents context window from exploding in agentic use cases
- Chunks appear in
raw_content as: <chunk 1> [...] <chunk 2> [...] <chunk 3>
chunks_per_source is only available when instructions are provided.
Depth and breadth
| Parameter | Description | Impact |
|---|
max_depth | How many levels deep to crawl from starting URL | Exponential latency growth |
max_breadth | Maximum links to follow per page | Horizontal spread |
limit | Total maximum pages to crawl | Hard cap on pages |
Performance tip: Each level of depth increases crawl time exponentially. Start with max_depth=1 and increase as needed.
// Conservative crawl
{
"url": "example.com",
"max_depth": 1,
"max_breadth": 20,
"limit": 20
}
// Comprehensive crawl
{
"url": "example.com",
"max_depth": 3,
"max_breadth": 100,
"limit": 500
}
Filtering and Focusing
Path patterns
Use regex patterns to include or exclude specific paths:
// Target specific sections
{
"url": "example.com",
"select_paths": ["/blog/.*", "/docs/.*", "/guides/.*"],
"exclude_paths": ["/private/.*", "/admin/.*", "/test/.*"]
}
// Paginated content
{
"url": "example.com/blog",
"max_depth": 2,
"select_paths": ["/blog/.*", "/blog/page/.*"],
"exclude_paths": ["/blog/tag/.*"]
}
Domain filtering
Control which domains to crawl:
// Stay within subdomain
{
"url": "docs.example.com",
"select_domains": ["^docs.example.com$"],
"max_depth": 2
}
// Exclude specific domains
{
"url": "example.com",
"exclude_domains": ["^ads.example.com$", "^tracking.example.com$"],
"max_depth": 2
}
Controls extraction quality vs. speed.
| Depth | When to use |
|---|
basic (default) | Simple content, faster processing |
advanced | Complex pages, tables, structured data |
{
"url": "docs.example.com",
"max_depth": 2,
"extract_depth": "advanced",
"select_paths": ["/docs/.*"]
}
Use Cases
1. Deep or Unlinked Content
Many sites have content that’s difficult to access through standard means:
- Deeply nested pages not in main navigation
- Paginated archives (old blog posts, changelogs)
- Internal search-only content
Best Practice:
{
"url": "example.com",
"max_depth": 3,
"max_breadth": 50,
"limit": 200,
"select_paths": ["/blog/.*", "/changelog/.*"],
"exclude_paths": ["/private/.*", "/admin/.*"]
}
2. Structured but Nonstandard Layouts
For content that’s structured but not marked up in schema.org:
- Documentation
- Changelogs
- FAQs
Best Practice:
{
"url": "docs.example.com",
"max_depth": 2,
"extract_depth": "advanced",
"select_paths": ["/docs/.*"]
}
When you need to combine information from multiple sections:
- Cross-referencing content
- Finding related information
- Building comprehensive knowledge bases
Best Practice:
{
"url": "example.com",
"max_depth": 2,
"instructions": "Find all documentation pages that link to API reference docs",
"extract_depth": "advanced"
}
4. Rapidly Changing Content
For content that updates frequently:
- API documentation
- Product announcements
- News sections
Best Practice:
{
"url": "api.example.com",
"max_depth": 1,
"max_breadth": 100
}
5. Behind Auth / Paywalls
For content requiring authentication:
- Internal knowledge bases
- Customer help centers
- Gated documentation
Best Practice:
{
"url": "help.example.com",
"max_depth": 2,
"select_domains": ["^help.example.com$"],
"exclude_domains": ["^public.example.com$"]
}
6. Complete Coverage / Auditing
For comprehensive content analysis:
- Legal compliance checks
- Security audits
- Policy verification
Best Practice:
{
"url": "example.com",
"max_depth": 3,
"max_breadth": 100,
"limit": 1000,
"extract_depth": "advanced",
"instructions": "Find all mentions of GDPR and data protection policies"
}
7. Semantic Search or RAG Integration
For feeding content into LLMs or search systems:
- RAG systems
- Enterprise search
- Knowledge bases
Best Practice:
{
"url": "docs.example.com",
"max_depth": 2,
"extract_depth": "advanced",
"include_images": true
}
8. Known URL Patterns
When you have specific paths to crawl:
- Sitemap-based crawling
- Section-specific extraction
- Pattern-based content collection
Best Practice:
{
"url": "example.com",
"max_depth": 1,
"select_paths": ["/docs/.*", "/api/.*", "/guides/.*"],
"exclude_paths": ["/private/.*", "/admin/.*"]
}
- Each level of depth increases crawl time exponentially
- Start with max_depth: 1 and increase as needed
- Use max_breadth to control horizontal expansion
- Set appropriate limit to prevent excessive crawling
Rate Limiting
- Respect site’s robots.txt
- Implement appropriate delays between requests
- Monitor API usage and limits
- Use appropriate error handling for rate limits
Integration with Map
Consider using Map before Crawl to:
- Discover site structure
- Identify relevant paths
- Plan crawl strategy
- Validate URL patterns
Example workflow:
- Use Map to get site structure
- Analyze paths and patterns
- Configure Crawl with discovered paths
- Execute focused crawl
Benefits:
- Discover site structure before crawling
- Identify relevant path patterns
- Avoid unnecessary crawling
- Validate URL patterns work correctly
Common Pitfalls
Excessive depth
- Problem: Setting
max_depth=4 or higher
- Impact: Exponential crawl time, unnecessary pages
- Solution: Start with 1-2 levels, increase only if needed
Unfocused crawling
- Problem: No
instructions provided, crawling entire site
- Impact: Wasted resources, irrelevant content, context explosion
- Solution: Use instructions to focus the crawl semantically
Missing limits
- Problem: No
limit parameter set
- Impact: Runaway crawls, unexpected costs
- Solution: Always set a reasonable
limit value
Ignoring failed results
- Problem: Not checking which pages failed extraction
- Impact: Incomplete data, missed content
- Solution: Monitor failed results and adjust parameters
Summary
- Use instructions and chunks_per_source for focused, relevant results in agentic use cases
- Start with conservative parameters (
max_depth=1, max_breadth=20)
- Use path patterns to focus crawling on relevant content
- Choose appropriate extract_depth based on content complexity
- Set reasonable limits to prevent excessive crawling
- Monitor failed results and adjust patterns accordingly
- Use Map first to understand site structure
- Implement error handling for rate limits and failures
- Respect robots.txt and site policies
- Optimize for your use case (speed vs. completeness)
- Process results incrementally rather than waiting for full crawl
Crawling is powerful but resource-intensive. Focus your crawls, start small, monitor results, and scale gradually based on actual needs.