Skip to main content

What You’ll Learn

  • Crawling a website and extracting clean content from its pages
  • Using path filters and instructions for selective crawling
  • When to use Crawl vs Map
  • Feeding crawled content into a retrieval pipeline

How Does It Work?

Tavily Crawl follows links from a starting URL and extracts clean content from each page it visits. Unlike Map (which only discovers URLs), Crawl returns the full page content as markdown or text, ready for LLM consumption.
FeatureCrawlMap
ReturnsURLs + full page contentURL list only
SpeedSlower (extracts content)Fast (seconds)
CostHigher (extraction per page)Lower
Best forRAG pipelines, content analysis, documentationSite discovery, URL filtering, sitemap generation
Rule of thumb: Use Map when you need to find pages. Use Crawl when you need to read pages.

Getting Started

Get your Tavily API key

1

Install the Tavily Python SDK

uv venv
uv pip install tavily-python
2

Crawl a website

import os
from tavily import TavilyClient

client = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])

response = client.crawl(
    url="https://docs.tavily.com",
    max_depth=1,
    limit=10,
)

for page in response["results"]:
    print(f"\n--- {page['url']} ---")
    print(f"Content length: {len(page['raw_content'])} chars")
    print(page["raw_content"][:200])
3

Output

--- https://docs.tavily.com/ ---
Content length: 4040 chars
# Tavily docs

Search, crawl, and extract content from the web with APIs
built for LLMs and autonomous agents...

--- https://docs.tavily.com/documentation/api-reference/introduction ---
Content length: 3647 chars
# API Reference Introduction

This section covers Tavily endpoint APIs, request parameters,
and response schemas...

Documentation Ingestion

Crawl a docs site with select_paths to focus on the pages that matter, and extract_depth: "advanced" for complex pages with tables or code blocks.
import os
from tavily import TavilyClient

client = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])

response = client.crawl(
    url="https://docs.tavily.com",
    max_depth=2,
    limit=50,
    select_paths=["/documentation/.*", "/sdk/.*"],
    exclude_paths=["/changelog/.*"],
    extract_depth="advanced",
)

pages = response["results"]
print(f"Crawled {len(pages)} pages")

for page in pages:
    print(f"  {page['url']} ({len(page['raw_content'])} chars)")
Start with max_depth=1 and a conservative limit. Each level of depth increases crawl time exponentially — scale up only after verifying results.

Selective Path Crawling

Combine path patterns with natural-language instructions to focus the crawl semantically. When instructions are set, you can also use chunks_per_source to get only the most relevant snippets per page.
import os
from tavily import TavilyClient

client = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])

response = client.crawl(
    url="https://docs.tavily.com",
    max_depth=2,
    select_paths=["/documentation/api-reference/.*"],
    exclude_paths=["/documentation/api-reference/endpoint/research-streaming"],
    instructions="Find pages about Search and Extract endpoints",
    chunks_per_source=3,
)

for page in response["results"]:
    print(f"\n{page['url']}")
    print(page["raw_content"])
With chunks_per_source, the raw_content field contains the top relevant chunks separated by [...] instead of the full page, keeping context windows small.

Crawl-to-Retrieval Pipeline

Crawl a site, chunk the content, and build a searchable index. This sketch shows the pattern — for a complete implementation, see the Crawl to RAG app example.
import os
from tavily import TavilyClient

client = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])

def chunk_text(text, chunk_size=500, overlap=50):
    """Split text into overlapping chunks."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks

response = client.crawl(
    url="https://docs.tavily.com",
    max_depth=2,
    limit=30,
    extract_depth="advanced",
)

all_chunks = []
for page in response["results"]:
    chunks = chunk_text(page["raw_content"])
    for chunk in chunks:
        all_chunks.append({
            "text": chunk,
            "url": page["url"],
        })

print(f"Created {len(all_chunks)} chunks from {len(response['results'])} pages")

# Next steps: embed chunks and load into a vector store
# See the Crawl to RAG example for the full pipeline

Critical Knobs

  • Range: 1–5, default: 1
  • Each level increases crawl time exponentially
  • Start at 1 and increase only after verifying results
  • Range: 1–500, default: 20
  • Controls how many links are followed per page level
  • Hard cap on total pages crawled
  • Always set this to prevent runaway crawls and unexpected costs
  • Regex patterns to include or exclude URL paths
  • Example: "/docs/.*" to target docs, "/blog/.*" to skip blog posts
  • "basic" (default) — standard content, faster
  • "advanced" — tables, embedded content, JS-rendered pages, slower but more thorough
  • Natural-language guidance for the crawler
  • Enables semantic filtering of pages
  • Unlocks chunks_per_source for targeted content retrieval
For the complete parameter list, see the Crawl API reference.

Production Notes

  • Cost control: Always set limit to cap the number of pages. Each crawled page consumes credits based on extract_depth.
  • Timeouts: Large crawls can take time. Use the timeout parameter (10-150s) to set upper bounds.
  • Failed pages: Check response["failed_results"] for pages that couldn’t be extracted. Adjust extract_depth or path filters accordingly.
  • Map first: Consider using Map to discover the site structure before crawling. This lets you identify the right select_paths patterns and set a realistic limit.

Next Steps

Crawl API Reference

Full parameter list, response schema, and interactive playground.

Crawl Best Practices

Depth tuning, path filtering, domain controls, and common pitfalls.

Python SDK Reference

Python client methods, async support, and type details.

JavaScript SDK Reference

JavaScript/TypeScript client methods and usage.