Best Practices for Crawl
Learn how to effectively use Tavily’s Crawl API to extract and process web content.
When to Use crawl
vs map
Use Crawl when you need:
- Full content extraction from pages
- Deep content analysis
- Processing of paginated or nested content
- Extraction of specific content patterns
- Integration with RAG systems
Use Map when you need:
- Quick site structure discovery
- URL collection without content extraction
- Sitemap generation
- Path pattern matching
- Domain structure analysis
Use Cases
1. Deep or Unlinked Content
Many sites have content that’s difficult to access through standard means:
- Deeply nested pages not in main navigation
- Paginated archives (old blog posts, changelogs)
- Internal search-only content
Best Practice:
2. Structured but Nonstandard Layouts
For content that’s structured but not marked up in schema.org:
- Documentation
- Changelogs
- FAQs
Best Practice:
3. Multi-modal Information Needs
When you need to combine information from multiple sections:
- Cross-referencing content
- Finding related information
- Building comprehensive knowledge bases
Best Practice:
4. Rapidly Changing Content
For content that updates frequently:
- API documentation
- Product announcements
- News sections
Best Practice:
5. Behind Auth / Paywalls
For content requiring authentication:
- Internal knowledge bases
- Customer help centers
- Gated documentation
Best Practice:
6. Complete Coverage / Auditing
For comprehensive content analysis:
- Legal compliance checks
- Security audits
- Policy verification
Best Practice:
7. Semantic Search or RAG Integration
For feeding content into LLMs or search systems:
- RAG systems
- Enterprise search
- Knowledge bases
Best Practice:
8. Known URL Patterns
When you have specific paths to crawl:
- Sitemap-based crawling
- Section-specific extraction
- Pattern-based content collection
Best Practice:
Performance Considerations
Depth vs. Performance
- Each level of depth increases crawl time exponentially
- Start with
max_depth: 1
and increase as needed - Use
max_breadth
to control horizontal expansion - Set appropriate
limit
to prevent excessive crawling
Resource Optimization
- Use
basic
extract_depth for simple content - Use
advanced
extract_depth only when needed - Set appropriate
max_breadth
based on site structure - Use
select_paths
andexclude_paths
to focus crawling
Rate Limiting
- Respect site’s robots.txt
- Implement appropriate delays between requests
- Monitor API usage and limits
- Use appropriate error handling for rate limits
Best Practices Summary
-
Start Small
- Begin with limited depth and breadth
- Gradually increase based on needs
- Monitor performance and adjust
-
Be Specific
- Use path patterns to focus crawling
- Exclude irrelevant sections
- Set appropriate categories
-
Optimize Resources
- Choose appropriate extract_depth
- Set reasonable limits
- Use include_images only when needed
-
Handle Errors
- Implement retry logic
- Monitor failed results
- Handle rate limits appropriately
-
Security
- Respect robots.txt
- Use appropriate authentication
- Exclude sensitive paths
-
Integration
- Plan for data processing
- Consider storage requirements
- Design for scalability
Common Pitfalls
-
Excessive Depth
- Avoid setting max_depth too high
- Start with 1-2 levels
- Increase only if necessary
-
Unfocused Crawling
- Set appropriate categories
- Use instructions for guidance
-
Resource Overuse
- Monitor API usage
- Set appropriate limits
- Use basic extract_depth when possible
-
Missing Content
- Verify path patterns
- Monitor crawl coverage
Integration with Map
Consider using Map before Crawl to:
- Discover site structure
- Identify relevant paths
- Plan crawl strategy
- Validate URL patterns
Example workflow:
- Use Map to get site structure
- Analyze paths and patterns
- Configure Crawl with discovered paths
- Execute focused crawl
Conclusion
Tavily’s Crawl API is powerful for extracting structured content from websites. By following these best practices, you can:
- Optimize crawl performance
- Ensure complete coverage
- Maintain resource efficiency
- Build robust content extraction pipelines
Remember to:
- Start with limited scope
- Use appropriate parameters
- Monitor performance
- Handle errors gracefully
- Respect site policies