> ## Documentation Index
> Fetch the complete documentation index at: https://docs.tavily.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Best Practices for Crawl

> Learn how to optimize crawl parameters, focus your crawls, and efficiently extract content from websites.

## Crawl vs Map

Understanding when to use each API:

| Feature                | Crawl                        | Map                      |
| ---------------------- | ---------------------------- | ------------------------ |
| **Content extraction** | Full content                 | URLs only                |
| **Use case**           | Deep content analysis        | Site structure discovery |
| **Speed**              | Slower (extracts content)    | Faster (URLs only)       |
| **Best for**           | RAG, analysis, documentation | Sitemap generation       |

### Use Crawl when you need:

* Full content extraction from pages
* Deep content analysis
* Processing of paginated or nested content
* Extraction of specific content patterns
* Integration with RAG systems

### Use Map when you need:

* Quick site structure discovery
* URL collection without content extraction
* Sitemap generation
* Path pattern matching
* Domain structure analysis

## Crawl Parameters

### Instructions

Guide the crawl with natural language to focus on relevant content:

```json theme={null}
{
  "url": "example.com",
  "max_depth": 2,
  "instructions": "Find all documentation pages about Python"
}
```

**When to use instructions:**

* To focus crawling on specific topics or content types
* When you need semantic filtering of pages
* For agentic use cases where relevance is critical

### Chunks per Source

Control the amount of content returned per page to prevent context window explosion:

```json theme={null}
{
  "url": "example.com",
  "instructions": "Find all documentation about authentication",
  "chunks_per_source": 3
}
```

**Key benefits:**

* Returns only relevant content snippets (max 500 characters each) instead of full page content
* Prevents context window from exploding in agentic use cases
* Chunks appear in `raw_content` as: `<chunk 1> [...] <chunk 2> [...] <chunk 3>`

> `chunks_per_source` is only available when instructions are provided.

### Depth and breadth

| Parameter     | Description                                     | Impact                     |
| ------------- | ----------------------------------------------- | -------------------------- |
| `max_depth`   | How many levels deep to crawl from starting URL | Exponential latency growth |
| `max_breadth` | Maximum links to follow per page                | Horizontal spread          |
| `limit`       | Total maximum pages to crawl                    | Hard cap on pages          |

**Performance tip:** Each level of depth increases crawl time exponentially. Start with `max_depth=1` and increase as needed.

```json theme={null}
// Conservative crawl
{
  "url": "example.com",
  "max_depth": 1,
  "max_breadth": 20,
  "limit": 20
}

// Comprehensive crawl
{
  "url": "example.com",
  "max_depth": 3,
  "max_breadth": 100,
  "limit": 500
}
```

## Filtering and Focusing

### Path patterns

Use regex patterns to include or exclude specific paths:

```json theme={null}
// Target specific sections
{
  "url": "example.com",
  "select_paths": ["/blog/.*", "/docs/.*", "/guides/.*"],
  "exclude_paths": ["/private/.*", "/admin/.*", "/test/.*"]
}

// Paginated content
{
  "url": "example.com/blog",
  "max_depth": 2,
  "select_paths": ["/blog/.*", "/blog/page/.*"],
  "exclude_paths": ["/blog/tag/.*"]
}
```

### Domain filtering

Control which domains to crawl:

```json theme={null}
// Stay within subdomain
{
  "url": "docs.example.com",
  "select_domains": ["^docs.example.com$"],
  "max_depth": 2
}

// Exclude specific domains
{
  "url": "example.com",
  "exclude_domains": ["^ads.example.com$", "^tracking.example.com$"],
  "max_depth": 2
}
```

### Extract depth

Controls extraction quality vs. speed.

| Depth             | When to use                            |
| ----------------- | -------------------------------------- |
| `basic` (default) | Simple content, faster processing      |
| `advanced`        | Complex pages, tables, structured data |

```json theme={null}
{
  "url": "docs.example.com",
  "max_depth": 2,
  "extract_depth": "advanced",
  "select_paths": ["/docs/.*"]
}
```

## Use Cases

### 1. Deep or Unlinked Content

Many sites have content that's difficult to access through standard means:

* Deeply nested pages not in main navigation
* Paginated archives (old blog posts, changelogs)
* Internal search-only content

**Best Practice:**

```json theme={null}
{
  "url": "example.com",
  "max_depth": 3,
  "max_breadth": 50,
  "limit": 200,
  "select_paths": ["/blog/.*", "/changelog/.*"],
  "exclude_paths": ["/private/.*", "/admin/.*"]
}
```

### 2. Structured but Nonstandard Layouts

For content that's structured but not marked up in schema.org:

* Documentation
* Changelogs
* FAQs

**Best Practice:**

```json theme={null}
{
  "url": "docs.example.com",
  "max_depth": 2,
  "extract_depth": "advanced",
  "select_paths": ["/docs/.*"]
}
```

### 3. Multi-modal Information Needs

When you need to combine information from multiple sections:

* Cross-referencing content
* Finding related information
* Building comprehensive knowledge bases

**Best Practice:**

```json theme={null}
{
  "url": "example.com",
  "max_depth": 2,
  "instructions": "Find all documentation pages that link to API reference docs",
  "extract_depth": "advanced"
}
```

### 4. Rapidly Changing Content

For content that updates frequently:

* API documentation
* Product announcements
* News sections

**Best Practice:**

```json theme={null}
{
  "url": "api.example.com",
  "max_depth": 1,
  "max_breadth": 100
}
```

### 5. Behind Auth / Paywalls

For content requiring authentication:

* Internal knowledge bases
* Customer help centers
* Gated documentation

**Best Practice:**

```json theme={null}
{
  "url": "help.example.com",
  "max_depth": 2,
  "select_domains": ["^help.example.com$"],
  "exclude_domains": ["^public.example.com$"]
}
```

### 6. Complete Coverage / Auditing

For comprehensive content analysis:

* Legal compliance checks
* Security audits
* Policy verification

**Best Practice:**

```json theme={null}
{
  "url": "example.com",
  "max_depth": 3,
  "max_breadth": 100,
  "limit": 1000,
  "extract_depth": "advanced",
  "instructions": "Find all mentions of GDPR and data protection policies"
}
```

### 7. Semantic Search or RAG Integration

For feeding content into LLMs or search systems:

* RAG systems
* Enterprise search
* Knowledge bases

**Best Practice:**

```json theme={null}
{
  "url": "docs.example.com",
  "max_depth": 2,
  "extract_depth": "advanced",
  "include_images": true
}
```

### 8. Known URL Patterns

When you have specific paths to crawl:

* Sitemap-based crawling
* Section-specific extraction
* Pattern-based content collection

**Best Practice:**

```json theme={null}
{
  "url": "example.com",
  "max_depth": 1,
  "select_paths": ["/docs/.*", "/api/.*", "/guides/.*"],
  "exclude_paths": ["/private/.*", "/admin/.*"]
}
```

## Performance Optimization

### Depth vs. Performance

* Each level of depth increases crawl time exponentially
* Start with max\_depth: 1 and increase as needed
* Use max\_breadth to control horizontal expansion
* Set appropriate limit to prevent excessive crawling

### Rate Limiting

* Respect site's robots.txt
* Implement appropriate delays between requests
* Monitor API usage and limits
* Use appropriate error handling for rate limits

## Integration with Map

Consider using Map before Crawl to:

1. Discover site structure
2. Identify relevant paths
3. Plan crawl strategy
4. Validate URL patterns

**Example workflow:**

1. Use Map to get site structure
2. Analyze paths and patterns
3. Configure Crawl with discovered paths
4. Execute focused crawl

**Benefits:**

* Discover site structure before crawling
* Identify relevant path patterns
* Avoid unnecessary crawling
* Validate URL patterns work correctly

## Common Pitfalls

### Excessive depth

* **Problem:** Setting `max_depth=4` or higher
* **Impact:** Exponential crawl time, unnecessary pages
* **Solution:** Start with 1-2 levels, increase only if needed

### Unfocused crawling

* **Problem:** No `instructions` provided, crawling entire site
* **Impact:** Wasted resources, irrelevant content, context explosion
* **Solution:** Use instructions to focus the crawl semantically

### Missing limits

* **Problem:** No `limit` parameter set
* **Impact:** Runaway crawls, unexpected costs
* **Solution:** Always set a reasonable `limit` value

### Ignoring failed results

* **Problem:** Not checking which pages failed extraction
* **Impact:** Incomplete data, missed content
* **Solution:** Monitor failed results and adjust parameters

## Summary

* Use instructions and chunks\_per\_source for focused, relevant results in agentic use cases
* Start with conservative parameters (`max_depth=1, max_breadth=20`)
* Use path patterns to focus crawling on relevant content
* Choose appropriate extract\_depth based on content complexity
* Set reasonable limits to prevent excessive crawling
* Monitor failed results and adjust patterns accordingly
* Use Map first to understand site structure
* Implement error handling for rate limits and failures
* Respect robots.txt and site policies
* Optimize for your use case (speed vs. completeness)
* Process results incrementally rather than waiting for full crawl

> Crawling is powerful but resource-intensive. Focus your crawls, start small, monitor results, and scale gradually based on actual needs.
