What You’ll Learn
- Crawling a website and extracting clean content from its pages
- Using path filters and instructions for selective crawling
- When to use Crawl vs Map
- Feeding crawled content into a retrieval pipeline
How Does It Work?
Tavily Crawl follows links from a starting URL and extracts clean content from each page it visits. Unlike Map (which only discovers URLs), Crawl returns the full page content as markdown or text, ready for LLM consumption.| Feature | Crawl | Map |
|---|---|---|
| Returns | URLs + full page content | URL list only |
| Speed | Slower (extracts content) | Fast (seconds) |
| Cost | Higher (extraction per page) | Lower |
| Best for | RAG pipelines, content analysis, documentation | Site discovery, URL filtering, sitemap generation |
Getting Started
Get your Tavily API key
Documentation Ingestion
Crawl a docs site withselect_paths to focus on the pages that matter, and extract_depth: "advanced" for complex pages with tables or code blocks.
Selective Path Crawling
Combine path patterns with natural-languageinstructions to focus the crawl semantically. When instructions are set, you can also use chunks_per_source to get only the most relevant snippets per page.
chunks_per_source, the raw_content field contains the top relevant chunks separated by [...] instead of the full page, keeping context windows small.
Crawl-to-Retrieval Pipeline
Crawl a site, chunk the content, and build a searchable index. This sketch shows the pattern — for a complete implementation, see the Crawl to RAG app example.Critical Knobs
max_depth
max_depth
- Range: 1–5, default: 1
- Each level increases crawl time exponentially
- Start at 1 and increase only after verifying results
max_breadth
max_breadth
- Range: 1–500, default: 20
- Controls how many links are followed per page level
limit
limit
- Hard cap on total pages crawled
- Always set this to prevent runaway crawls and unexpected costs
select_paths / exclude_paths
select_paths / exclude_paths
- Regex patterns to include or exclude URL paths
- Example:
"/docs/.*"to target docs,"/blog/.*"to skip blog posts
extract_depth
extract_depth
"basic"(default) — standard content, faster"advanced"— tables, embedded content, JS-rendered pages, slower but more thorough
instructions
instructions
- Natural-language guidance for the crawler
- Enables semantic filtering of pages
- Unlocks
chunks_per_sourcefor targeted content retrieval
Production Notes
- Cost control: Always set
limitto cap the number of pages. Each crawled page consumes credits based onextract_depth. - Timeouts: Large crawls can take time. Use the
timeoutparameter (10-150s) to set upper bounds. - Failed pages: Check
response["failed_results"]for pages that couldn’t be extracted. Adjustextract_depthor path filters accordingly. - Map first: Consider using Map to discover the site structure before crawling. This lets you identify the right
select_pathspatterns and set a realisticlimit.
Next Steps
Crawl API Reference
Full parameter list, response schema, and interactive playground.
Crawl Best Practices
Depth tuning, path filtering, domain controls, and common pitfalls.
Python SDK Reference
Python client methods, async support, and type details.
JavaScript SDK Reference
JavaScript/TypeScript client methods and usage.