Skip to main content

What You’ll Learn

  • Extracting clean content from one or many URLs
  • Basic vs advanced extraction depth
  • Query-focused extraction for targeted content retrieval
  • Batch extraction (up to 20 URLs in a single call)

How Does It Work?

Tavily Extract takes a URL (or list of URLs) and returns the page content as clean markdown or plain text. It handles JavaScript-rendered pages, removes boilerplate (ads, navigation, footers), and returns structured content ready for LLM consumption. Two extraction depths are available:
DepthSpeedSuccess RateContentCost
basicFastGoodStandard page content1 credit per 5 URLs
advancedSlowerHigherTables, embedded content, JS-rendered pages2 credits per 5 URLs

Getting Started

Get your Tavily API key

1

Install the Tavily Python SDK

uv venv
uv pip install tavily-python
2

Extract content from a URL

import os
from tavily import TavilyClient

client = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])

response = client.extract(
    urls="https://en.wikipedia.org/wiki/Artificial_intelligence",
    extract_depth="advanced",
)

result = response["results"][0]
print(f"URL: {result['url']}")
print(f"Content length: {len(result['raw_content'])} chars")
print(result["raw_content"][:500])
3

Output

URL: https://en.wikipedia.org/wiki/Artificial_intelligence
Content length: 48231 chars
# Artificial intelligence

**Artificial intelligence (AI)**, in its broadest sense,
is intelligence exhibited by machines, particularly
computer systems. It is a field of research in computer
science that develops and studies methods and software
that enable machines to perceive their environment and
use learning and intelligence to take actions...

Batch Extraction

Extract content from up to 20 URLs in a single call. Failed URLs are reported separately without blocking successful ones.
import os
from tavily import TavilyClient

client = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])

urls = [
    "https://en.wikipedia.org/wiki/Artificial_intelligence",
    "https://en.wikipedia.org/wiki/Machine_learning",
    "https://en.wikipedia.org/wiki/Data_science",
]

response = client.extract(urls=urls, include_images=True)

for result in response["results"]:
    print(f"{result['url']}: {len(result['raw_content'])} chars")

if response["failed_results"]:
    for fail in response["failed_results"]:
        print(f"Failed: {fail['url']} - {fail['error']}")

Query-Focused Extraction

When you pass a query parameter, Extract reranks the content chunks by relevance to your question. Combined with chunks_per_source, this returns only the most relevant portions of each page.
import os
from tavily import TavilyClient

client = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])

response = client.extract(
    urls="https://en.wikipedia.org/wiki/Artificial_intelligence",
    query="What are the main ethical concerns with AI?",
    chunks_per_source=3,
)

print(response["results"][0]["raw_content"])
The raw_content field will contain the top 3 most relevant chunks separated by [...], rather than the full page content. This is useful for keeping LLM context windows small while maintaining relevance.

Choosing the Right Extraction Depth

  • Static HTML pages (blogs, articles, documentation)
  • When speed matters more than completeness
  • High-volume batch jobs where cost is a concern
  • Pages with straightforward content structure
  • JavaScript-rendered single-page applications
  • Pages with tables, charts, or embedded content
  • When you need the highest success rate
  • Complex pages where basic extraction misses content

Critical Knobs

  • "basic" (default) — standard HTML pages, 1 credit per 5 URLs
  • "advanced" — JS-rendered pages, tables, embedded content, 2 credits per 5 URLs
  • Pass a query to rerank content by relevance to your question
  • Pair with chunks_per_source (1–5) to return only the top snippets
  • Without query, full page content is returned
  • "markdown" (default) — preserves headings, links, and structure
  • "text" — plain text, lighter for simple pipelines
For the complete parameter list, see the Extract API reference.

Next Steps

Extract API Reference

Full parameter list, response schema, and interactive playground.

Extract Best Practices

Depth selection, two-step search-then-extract, and optimization tips.

Python SDK Reference

Python client methods, async support, and type details.

JavaScript SDK Reference

JavaScript/TypeScript client methods and usage.