Skip to main content

Crawl vs Map

Understanding when to use each API:
FeatureCrawlMap
Content extractionFull contentURLs only
Use caseDeep content analysisSite structure discovery
SpeedSlower (extracts content)Faster (URLs only)
Best forRAG, analysis, documentationSitemap generation

Use Crawl when you need:

  • Full content extraction from pages
  • Deep content analysis
  • Processing of paginated or nested content
  • Extraction of specific content patterns
  • Integration with RAG systems

Use Map when you need:

  • Quick site structure discovery
  • URL collection without content extraction
  • Sitemap generation
  • Path pattern matching
  • Domain structure analysis

Crawl Parameters

Instructions

Guide the crawl with natural language to focus on relevant content:
{
  "url": "example.com",
  "max_depth": 2,
  "instructions": "Find all documentation pages about Python"
}
When to use instructions:
  • To focus crawling on specific topics or content types
  • When you need semantic filtering of pages
  • For agentic use cases where relevance is critical

Chunks per Source

Control the amount of content returned per page to prevent context window explosion:
{
  "url": "example.com",
  "instructions": "Find all documentation about authentication",
  "chunks_per_source": 3
}
Key benefits:
  • Returns only relevant content snippets (max 500 characters each) instead of full page content
  • Prevents context window from exploding in agentic use cases
  • Chunks appear in raw_content as: <chunk 1> [...] <chunk 2> [...] <chunk 3>
chunks_per_source is only available when instructions are provided.

Depth and breadth

ParameterDescriptionImpact
max_depthHow many levels deep to crawl from starting URLExponential latency growth
max_breadthMaximum links to follow per pageHorizontal spread
limitTotal maximum pages to crawlHard cap on pages
Performance tip: Each level of depth increases crawl time exponentially. Start with max_depth=1 and increase as needed.
// Conservative crawl
{
  "url": "example.com",
  "max_depth": 1,
  "max_breadth": 20,
  "limit": 20
}

// Comprehensive crawl
{
  "url": "example.com",
  "max_depth": 3,
  "max_breadth": 100,
  "limit": 500
}

Filtering and Focusing

Path patterns

Use regex patterns to include or exclude specific paths:
// Target specific sections
{
  "url": "example.com",
  "select_paths": ["/blog/.*", "/docs/.*", "/guides/.*"],
  "exclude_paths": ["/private/.*", "/admin/.*", "/test/.*"]
}

// Paginated content
{
  "url": "example.com/blog",
  "max_depth": 2,
  "select_paths": ["/blog/.*", "/blog/page/.*"],
  "exclude_paths": ["/blog/tag/.*"]
}

Domain filtering

Control which domains to crawl:
// Stay within subdomain
{
  "url": "docs.example.com",
  "select_domains": ["^docs.example.com$"],
  "max_depth": 2
}

// Exclude specific domains
{
  "url": "example.com",
  "exclude_domains": ["^ads.example.com$", "^tracking.example.com$"],
  "max_depth": 2
}

Extract depth

Controls extraction quality vs. speed.
DepthWhen to use
basic (default)Simple content, faster processing
advancedComplex pages, tables, structured data
{
  "url": "docs.example.com",
  "max_depth": 2,
  "extract_depth": "advanced",
  "select_paths": ["/docs/.*"]
}

Use Cases

1. Deep or Unlinked Content

Many sites have content that’s difficult to access through standard means:
  • Deeply nested pages not in main navigation
  • Paginated archives (old blog posts, changelogs)
  • Internal search-only content
Best Practice:
{
  "url": "example.com",
  "max_depth": 3,
  "max_breadth": 50,
  "limit": 200,
  "select_paths": ["/blog/.*", "/changelog/.*"],
  "exclude_paths": ["/private/.*", "/admin/.*"]
}

2. Structured but Nonstandard Layouts

For content that’s structured but not marked up in schema.org:
  • Documentation
  • Changelogs
  • FAQs
Best Practice:
{
  "url": "docs.example.com",
  "max_depth": 2,
  "extract_depth": "advanced",
  "select_paths": ["/docs/.*"]
}

3. Multi-modal Information Needs

When you need to combine information from multiple sections:
  • Cross-referencing content
  • Finding related information
  • Building comprehensive knowledge bases
Best Practice:
{
  "url": "example.com",
  "max_depth": 2,
  "instructions": "Find all documentation pages that link to API reference docs",
  "extract_depth": "advanced"
}

4. Rapidly Changing Content

For content that updates frequently:
  • API documentation
  • Product announcements
  • News sections
Best Practice:
{
  "url": "api.example.com",
  "max_depth": 1,
  "max_breadth": 100
}

5. Behind Auth / Paywalls

For content requiring authentication:
  • Internal knowledge bases
  • Customer help centers
  • Gated documentation
Best Practice:
{
  "url": "help.example.com",
  "max_depth": 2,
  "select_domains": ["^help.example.com$"],
  "exclude_domains": ["^public.example.com$"]
}

6. Complete Coverage / Auditing

For comprehensive content analysis:
  • Legal compliance checks
  • Security audits
  • Policy verification
Best Practice:
{
  "url": "example.com",
  "max_depth": 3,
  "max_breadth": 100,
  "limit": 1000,
  "extract_depth": "advanced",
  "instructions": "Find all mentions of GDPR and data protection policies"
}

7. Semantic Search or RAG Integration

For feeding content into LLMs or search systems:
  • RAG systems
  • Enterprise search
  • Knowledge bases
Best Practice:
{
  "url": "docs.example.com",
  "max_depth": 2,
  "extract_depth": "advanced",
  "include_images": true
}

8. Known URL Patterns

When you have specific paths to crawl:
  • Sitemap-based crawling
  • Section-specific extraction
  • Pattern-based content collection
Best Practice:
{
  "url": "example.com",
  "max_depth": 1,
  "select_paths": ["/docs/.*", "/api/.*", "/guides/.*"],
  "exclude_paths": ["/private/.*", "/admin/.*"]
}

Performance Optimization

Depth vs. Performance

  • Each level of depth increases crawl time exponentially
  • Start with max_depth: 1 and increase as needed
  • Use max_breadth to control horizontal expansion
  • Set appropriate limit to prevent excessive crawling

Rate Limiting

  • Respect site’s robots.txt
  • Implement appropriate delays between requests
  • Monitor API usage and limits
  • Use appropriate error handling for rate limits

Integration with Map

Consider using Map before Crawl to:
  1. Discover site structure
  2. Identify relevant paths
  3. Plan crawl strategy
  4. Validate URL patterns
Example workflow:
  1. Use Map to get site structure
  2. Analyze paths and patterns
  3. Configure Crawl with discovered paths
  4. Execute focused crawl
Benefits:
  • Discover site structure before crawling
  • Identify relevant path patterns
  • Avoid unnecessary crawling
  • Validate URL patterns work correctly

Common Pitfalls

Excessive depth

  • Problem: Setting max_depth=4 or higher
  • Impact: Exponential crawl time, unnecessary pages
  • Solution: Start with 1-2 levels, increase only if needed

Unfocused crawling

  • Problem: No instructions provided, crawling entire site
  • Impact: Wasted resources, irrelevant content, context explosion
  • Solution: Use instructions to focus the crawl semantically

Missing limits

  • Problem: No limit parameter set
  • Impact: Runaway crawls, unexpected costs
  • Solution: Always set a reasonable limit value

Ignoring failed results

  • Problem: Not checking which pages failed extraction
  • Impact: Incomplete data, missed content
  • Solution: Monitor failed results and adjust parameters

Summary

  • Use instructions and chunks_per_source for focused, relevant results in agentic use cases
  • Start with conservative parameters (max_depth=1, max_breadth=20)
  • Use path patterns to focus crawling on relevant content
  • Choose appropriate extract_depth based on content complexity
  • Set reasonable limits to prevent excessive crawling
  • Monitor failed results and adjust patterns accordingly
  • Use Map first to understand site structure
  • Implement error handling for rate limits and failures
  • Respect robots.txt and site policies
  • Optimize for your use case (speed vs. completeness)
  • Process results incrementally rather than waiting for full crawl
Crawling is powerful but resource-intensive. Focus your crawls, start small, monitor results, and scale gradually based on actual needs.