Crawl vs Map
Understanding when to use each API:| Feature | Crawl | Map |
|---|---|---|
| Content extraction | Full content | URLs only |
| Use case | Deep content analysis | Site structure discovery |
| Speed | Slower (extracts content) | Faster (URLs only) |
| Best for | RAG, analysis, documentation | Sitemap generation |
Use Crawl when you need:
- Full content extraction from pages
- Deep content analysis
- Processing of paginated or nested content
- Extraction of specific content patterns
- Integration with RAG systems
Use Map when you need:
- Quick site structure discovery
- URL collection without content extraction
- Sitemap generation
- Path pattern matching
- Domain structure analysis
Crawl Parameters
Instructions
Guide the crawl with natural language to focus on relevant content:- To focus crawling on specific topics or content types
- When you need semantic filtering of pages
- For agentic use cases where relevance is critical
Chunks per Source
Control the amount of content returned per page to prevent context window explosion:- Returns only relevant content snippets (max 500 characters each) instead of full page content
- Prevents context window from exploding in agentic use cases
- Chunks appear in
raw_contentas:<chunk 1> [...] <chunk 2> [...] <chunk 3>
chunks_per_source is only available when instructions are provided.
Depth and breadth
| Parameter | Description | Impact |
|---|---|---|
max_depth | How many levels deep to crawl from starting URL | Exponential latency growth |
max_breadth | Maximum links to follow per page | Horizontal spread |
limit | Total maximum pages to crawl | Hard cap on pages |
max_depth=1 and increase as needed.
Filtering and Focusing
Path patterns
Use regex patterns to include or exclude specific paths:Domain filtering
Control which domains to crawl:Extract depth
Controls extraction quality vs. speed.| Depth | When to use |
|---|---|
basic (default) | Simple content, faster processing |
advanced | Complex pages, tables, structured data |
Use Cases
1. Deep or Unlinked Content
Many sites have content that’s difficult to access through standard means:- Deeply nested pages not in main navigation
- Paginated archives (old blog posts, changelogs)
- Internal search-only content
2. Structured but Nonstandard Layouts
For content that’s structured but not marked up in schema.org:- Documentation
- Changelogs
- FAQs
3. Multi-modal Information Needs
When you need to combine information from multiple sections:- Cross-referencing content
- Finding related information
- Building comprehensive knowledge bases
4. Rapidly Changing Content
For content that updates frequently:- API documentation
- Product announcements
- News sections
5. Behind Auth / Paywalls
For content requiring authentication:- Internal knowledge bases
- Customer help centers
- Gated documentation
6. Complete Coverage / Auditing
For comprehensive content analysis:- Legal compliance checks
- Security audits
- Policy verification
7. Semantic Search or RAG Integration
For feeding content into LLMs or search systems:- RAG systems
- Enterprise search
- Knowledge bases
8. Known URL Patterns
When you have specific paths to crawl:- Sitemap-based crawling
- Section-specific extraction
- Pattern-based content collection
Performance Optimization
Depth vs. Performance
- Each level of depth increases crawl time exponentially
- Start with max_depth: 1 and increase as needed
- Use max_breadth to control horizontal expansion
- Set appropriate limit to prevent excessive crawling
Rate Limiting
- Respect site’s robots.txt
- Implement appropriate delays between requests
- Monitor API usage and limits
- Use appropriate error handling for rate limits
Integration with Map
Consider using Map before Crawl to:- Discover site structure
- Identify relevant paths
- Plan crawl strategy
- Validate URL patterns
- Use Map to get site structure
- Analyze paths and patterns
- Configure Crawl with discovered paths
- Execute focused crawl
- Discover site structure before crawling
- Identify relevant path patterns
- Avoid unnecessary crawling
- Validate URL patterns work correctly
Common Pitfalls
Excessive depth
- Problem: Setting
max_depth=4or higher - Impact: Exponential crawl time, unnecessary pages
- Solution: Start with 1-2 levels, increase only if needed
Unfocused crawling
- Problem: No
instructionsprovided, crawling entire site - Impact: Wasted resources, irrelevant content, context explosion
- Solution: Use instructions to focus the crawl semantically
Missing limits
- Problem: No
limitparameter set - Impact: Runaway crawls, unexpected costs
- Solution: Always set a reasonable
limitvalue
Ignoring failed results
- Problem: Not checking which pages failed extraction
- Impact: Incomplete data, missed content
- Solution: Monitor failed results and adjust parameters
Summary
- Use instructions and chunks_per_source for focused, relevant results in agentic use cases
- Start with conservative parameters (
max_depth=1, max_breadth=20) - Use path patterns to focus crawling on relevant content
- Choose appropriate extract_depth based on content complexity
- Set reasonable limits to prevent excessive crawling
- Monitor failed results and adjust patterns accordingly
- Use Map first to understand site structure
- Implement error handling for rate limits and failures
- Respect robots.txt and site policies
- Optimize for your use case (speed vs. completeness)
- Process results incrementally rather than waiting for full crawl
Crawling is powerful but resource-intensive. Focus your crawls, start small, monitor results, and scale gradually based on actual needs.