Introduction

Every data science enthusiast knows that a vital first step to building a successful model or algorithm is having a reliable evaluation set to aspire to. In the rapidly evolving landscape of Retrieval-Augmented Generation (RAG) and AI-driven search systems, the importance of high-quality eval datasets is crucial.

In this article, we introduce an agentic workflow designed to generate subject-specific dynamic evaluation datasets, enabling precise validation of web search augmented agents’ performance.

Known RAG evaluation datasets, such as HotPotQA, CRAG, and MultiHop-RAG, have been pivotal in benchmarking and fine-tuning models. However, these datasets primarily focus on evaluating performance with static, pre-defined document sets. As a result, they fall short when it comes to evaluating web-based RAG systems, where data is dynamic, contextual, and ever-changing.

This gap presents a significant challenge: how do we effectively test and refine RAG systems designed for real-world web search scenarios? Enter the Real-Time Dataset Generator for RAG Evals — an agentic tool leveraging Tavily’s Search Layer and the LangGraph framework to create diverse, relevant, and dynamic datasets tailored specifically for web based RAG agents.

How does it work?

The Real-Time Dataset Generator follows a systematic workflow to create high-quality evaluation datasets:

1

Input

The workflow begins with user-provided inputs.

2

Domain-Specific Search Query Generation

If a subject is provided (e.g., “NBA Basketball”), the system generates a set of search queries. This ensures queries are tailored to gather high-quality, recent, and subject-specific information.

3

Web Search with Tavily

This step guarantees that the dataset reflects current and relevant information, particularly for web search RAG evaluation, where up-to-date data is crucial.This is the heart of the RAG Dataset Generator, transforming queries into actionable, high-quality data that forms the foundation of the evaluation set.

4

Q&A Pair Generation

For each website returned by Tavily, the system generates question-answer pair using a map-reduce paradigm to ensure efficient processing across multiple sources. This step is implemented using LangGraph’s Send API.

5

Saving the Evaluation Set

Finally, the generated dataset is saved either locally or to Langsmith, based on the input configuration.

6

Output

The result is a well-structured, subject-specific evaluation dataset, ready for use in advanced evaluation methods like LLM-as-a-Judge.

Learn More

Want to dive deeper into web-based RAG evaluation? Check out these resources: