Best Practices for Crawl

When to Use `crawl` vs `map`

Use Crawl when you need:

Full content extraction from pages
Deep content analysis
Processing of paginated or nested content
Extraction of specific content patterns
Integration with RAG systems

Use Map when you need:

Quick site structure discovery
URL collection without content extraction
Sitemap generation
Path pattern matching
Domain structure analysis

Use Cases

1. Deep or Unlinked Content

Many sites have content that’s difficult to access through standard means:

Deeply nested pages not in main navigation
Paginated archives (old blog posts, changelogs)
Internal search-only content

Best Practice:

{
  "url": "example.com",
  "max_depth": 3,
  "max_breadth": 50,
  "limit": 200,
  "select_paths": ["/blog/.*", "/changelog/.*"],
  "exclude_paths": ["/private/.*", "/admin/.*"]
}

2. Structured but Nonstandard Layouts

For content that’s structured but not marked up in schema.org:

Documentation
Changelogs
FAQs

Best Practice:

{
  "url": "docs.example.com",
  "max_depth": 2,
  "extract_depth": "advanced",
  "categories": ["Documentation"],
  "select_paths": ["/docs/.*"]
}

When you need to combine information from multiple sections:

Cross-referencing content
Finding related information
Building comprehensive knowledge bases

Best Practice:

{
  "url": "example.com",
  "max_depth": 2,
  "instructions": "Find all documentation pages that link to API reference docs",
  "extract_depth": "advanced"
}

4. Rapidly Changing Content

For content that updates frequently:

API documentation
Product announcements
News sections

Best Practice:

{
  "url": "api.example.com",
  "max_depth": 1,
  "max_breadth": 100,
  "extract_depth": "basic"
}

5. Behind Auth / Paywalls

For content requiring authentication:

Internal knowledge bases
Customer help centers
Gated documentation

Best Practice:

{
  "url": "help.example.com",
  "max_depth": 2,
  "select_domains": ["^help\\.example\\.com$"],
  "exclude_domains": ["^public\\.example\\.com$"]
}

6. Complete Coverage / Auditing

For comprehensive content analysis:

Legal compliance checks
Security audits
Policy verification

Best Practice:

{
  "url": "example.com",
  "max_depth": 3,
  "max_breadth": 100,
  "limit": 1000,
  "extract_depth": "advanced",
  "instructions": "Find all mentions of GDPR and data protection policies"
}

7. Semantic Search or RAG Integration

For feeding content into LLMs or search systems:

RAG systems
Enterprise search
Knowledge bases

Best Practice:

{
  "url": "docs.example.com",
  "max_depth": 2,
  "extract_depth": "advanced",
  "include_images": true,
  "categories": ["Documentation", "API"]
}

8. Known URL Patterns

When you have specific paths to crawl:

Sitemap-based crawling
Section-specific extraction
Pattern-based content collection

Best Practice:

{
  "url": "example.com",
  "max_depth": 1,
  "select_paths": ["/docs/.*", "/api/.*", "/guides/.*"],
  "exclude_paths": ["/private/.*", "/admin/.*"]
}

Performance Considerations

Depth vs. Performance

Each level of depth increases crawl time exponentially
Start with max_depth: 1 and increase as needed
Use max_breadth to control horizontal expansion
Set appropriate limit to prevent excessive crawling

Resource Optimization

Use basic extract_depth for simple content
Use advanced extract_depth only when needed
Set appropriate max_breadth based on site structure
Use select_paths and exclude_paths to focus crawling

Rate Limiting

Respect site’s robots.txt
Implement appropriate delays between requests
Monitor API usage and limits
Use appropriate error handling for rate limits

Best Practices Summary

Start Small
- Begin with limited depth and breadth
- Gradually increase based on needs
- Monitor performance and adjust
Be Specific
- Use path patterns to focus crawling
- Exclude irrelevant sections
- Set appropriate categories
Optimize Resources
- Choose appropriate extract_depth
- Set reasonable limits
- Use include_images only when needed
Handle Errors
- Implement retry logic
- Monitor failed results
- Handle rate limits appropriately
Security
- Respect robots.txt
- Use appropriate authentication
- Exclude sensitive paths
Integration
- Plan for data processing
- Consider storage requirements
- Design for scalability

Common Pitfalls

Excessive Depth
- Avoid setting max_depth too high
- Start with 1-2 levels
- Increase only if necessary
Unfocused Crawling
- Set appropriate categories
- Use instructions for guidance
Resource Overuse
- Monitor API usage
- Set appropriate limits
- Use basic extract_depth when possible
Missing Content
- Verify path patterns
- Monitor crawl coverage

Integration with Map

Consider using Map before Crawl to:

Discover site structure
Identify relevant paths
Plan crawl strategy
Validate URL patterns

Example workflow:

Use Map to get site structure
Analyze paths and patterns
Configure Crawl with discovered paths
Execute focused crawl

Conclusion

Tavily’s Crawl API is powerful for extracting structured content from websites. By following these best practices, you can:

Optimize crawl performance
Ensure complete coverage
Maintain resource efficiency
Build robust content extraction pipelines

Remember to:

Start with limited scope
Use appropriate parameters
Monitor performance
Handle errors gracefully
Respect site policies

Overview

API Reference

Best Practices

Tavily MCP Server

Partnerships

Integrations

Legal

Help

Tavily Search Crawler

Best Practices for Crawl

When to Use `crawl` vs `map`

Use Crawl when you need:

Use Map when you need:

Use Cases

1. Deep or Unlinked Content

2. Structured but Nonstandard Layouts

4. Rapidly Changing Content

5. Behind Auth / Paywalls

6. Complete Coverage / Auditing

7. Semantic Search or RAG Integration

8. Known URL Patterns

Performance Considerations

Depth vs. Performance

Resource Optimization

Rate Limiting

Best Practices Summary

Common Pitfalls

Integration with Map

Conclusion

Overview

API Reference

Best Practices

Tavily MCP Server

Partnerships

Integrations

Legal

Help

Tavily Search Crawler

​When to Use crawl vs map

​Use Crawl when you need:

​Use Map when you need:

​Use Cases

​1. Deep or Unlinked Content

​2. Structured but Nonstandard Layouts

​3. Multi-modal Information Needs

​4. Rapidly Changing Content

​5. Behind Auth / Paywalls

​6. Complete Coverage / Auditing

​7. Semantic Search or RAG Integration

​8. Known URL Patterns

​Performance Considerations

​Depth vs. Performance

​Resource Optimization

​Rate Limiting

​Best Practices Summary

​Common Pitfalls

​Integration with Map

​Conclusion

When to Use `crawl` vs `map`

Use Crawl when you need:

Use Map when you need:

Use Cases

1. Deep or Unlinked Content

2. Structured but Nonstandard Layouts

3. Multi-modal Information Needs

4. Rapidly Changing Content

5. Behind Auth / Paywalls

6. Complete Coverage / Auditing

7. Semantic Search or RAG Integration

8. Known URL Patterns

Performance Considerations

Depth vs. Performance

Resource Optimization

Rate Limiting

Best Practices Summary

Common Pitfalls

Integration with Map

Conclusion