Clean Content Extraction

What You’ll Learn

Extracting clean content from one or many URLs
Basic vs advanced extraction depth
Query-focused extraction for targeted content retrieval
Batch extraction (up to 20 URLs in a single call)

How Does It Work?

Tavily Extract takes a URL (or list of URLs) and returns the page content as clean markdown or plain text. It handles JavaScript-rendered pages, removes boilerplate (ads, navigation, footers), and returns structured content ready for LLM consumption. Two extraction depths are available:

Depth	Speed	Success Rate	Content	Cost
`basic`	Fast	Good	Standard page content	1 credit per 5 URLs
`advanced`	Slower	Higher	Tables, embedded content, JS-rendered pages	2 credits per 5 URLs

Getting Started

Get your Tavily API key

Install the Tavily Python SDK

uv venv
uv pip install tavily-python

Extract content from a URL

import os
from tavily import TavilyClient

client = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])

response = client.extract(
    urls="https://en.wikipedia.org/wiki/Artificial_intelligence",
    extract_depth="advanced",
)

result = response["results"][0]
print(f"URL: {result['url']}")
print(f"Content length: {len(result['raw_content'])} chars")
print(result["raw_content"][:500])

Output

URL: https://en.wikipedia.org/wiki/Artificial_intelligence
Content length: 48231 chars
# Artificial intelligence

**Artificial intelligence (AI)**, in its broadest sense,
is intelligence exhibited by machines, particularly
computer systems. It is a field of research in computer
science that develops and studies methods and software
that enable machines to perceive their environment and
use learning and intelligence to take actions...

Batch Extraction

Extract content from up to 20 URLs in a single call. Failed URLs are reported separately without blocking successful ones.

import os
from tavily import TavilyClient

client = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])

urls = [
    "https://en.wikipedia.org/wiki/Artificial_intelligence",
    "https://en.wikipedia.org/wiki/Machine_learning",
    "https://en.wikipedia.org/wiki/Data_science",
]

response = client.extract(urls=urls, include_images=True)

for result in response["results"]:
    print(f"{result['url']}: {len(result['raw_content'])} chars")

if response["failed_results"]:
    for fail in response["failed_results"]:
        print(f"Failed: {fail['url']} - {fail['error']}")

Query-Focused Extraction

When you pass a query parameter, Extract reranks the content chunks by relevance to your question. Combined with chunks_per_source, this returns only the most relevant portions of each page.

import os
from tavily import TavilyClient

client = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])

response = client.extract(
    urls="https://en.wikipedia.org/wiki/Artificial_intelligence",
    query="What are the main ethical concerns with AI?",
    chunks_per_source=3,
)

print(response["results"][0]["raw_content"])

The raw_content field will contain the top 3 most relevant chunks separated by [...], rather than the full page content. This is useful for keeping LLM context windows small while maintaining relevance.

Choosing the Right Extraction Depth

When to use basic extraction

Static HTML pages (blogs, articles, documentation)
When speed matters more than completeness
High-volume batch jobs where cost is a concern
Pages with straightforward content structure

When to use advanced extraction

JavaScript-rendered single-page applications
Pages with tables, charts, or embedded content
When you need the highest success rate
Complex pages where basic extraction misses content

Critical Knobs

extract_depth

"basic" (default) — standard HTML pages, 1 credit per 5 URLs
"advanced" — JS-rendered pages, tables, embedded content, 2 credits per 5 URLs

query + chunks_per_source

Pass a query to rerank content by relevance to your question
Pair with chunks_per_source (1–5) to return only the top snippets
Without query, full page content is returned

format

"markdown" (default) — preserves headings, links, and structure
"text" — plain text, lighter for simple pipelines

For the complete parameter list, see the Extract API reference.

Next Steps

Extract API Reference

Full parameter list, response schema, and interactive playground.

Extract Best Practices

Depth selection, two-step search-then-extract, and optimization tips.

Python SDK Reference

Python client methods, async support, and type details.

JavaScript SDK Reference

JavaScript/TypeScript client methods and usage.

Hub

Agent Toolkit

Apps

Cookbook

Open Source

Clean Content Extraction

What You’ll Learn

How Does It Work?

Getting Started

Get your Tavily API key

Batch Extraction

Query-Focused Extraction

Choosing the Right Extraction Depth

Critical Knobs

Next Steps

Extract API Reference

Extract Best Practices

Python SDK Reference

JavaScript SDK Reference

​What You’ll Learn

​How Does It Work?

​Getting Started

Get your Tavily API key

​Batch Extraction

​Query-Focused Extraction

​Choosing the Right Extraction Depth

​Critical Knobs

​Next Steps

Extract API Reference

Extract Best Practices

Python SDK Reference

JavaScript SDK Reference

What You’ll Learn

How Does It Work?

Getting Started

Batch Extraction

Query-Focused Extraction

Choosing the Right Extraction Depth

Critical Knobs

Next Steps