Retrieval evaluation using LLM-as-a-judge via Pydantic AI
We’ll explore how to evaluate retrieval systems using Large Language Models (LLMs) as judges.Retrieval evaluation is a critical component in building high-quality information retrieval systems, and recent advancements in LLMs have made it possible to automate this evaluation process.
What we’ll cover
- How to query the Wikipedia API
- How to implement and compare two different retrieval approaches:
- The original search results from the Wikipedia API
- Using Cohere’s reranking model to rerank the search results
- How to set up an LLM-as-a-judge evaluation framework using Pydantic AI
Tools We’ll Use
- Cohere’s API: For reranking search results and providing evaluation models
- Wikipedia’s API: As our information source
- Pydantic AI: For creating evaluation agents
This tutorial demonstrates a methodology for comparing different retrieval systems objectively. By the end, you’ll have an example you can adapt to evaluate your own retrieval systems across different domains and use cases.
Setup
First, let’s import the necessary libraries.
Perform Wikipedia search
Next, we implement a function to query Wikipedia for relevant information based on user input. The search_wikipedia()
function allows us to retrieve a specified number of Wikipedia search results, extracting their titles, snippets, and page IDs.
This will provide us with the dataset for our retrieval evaluation experiment, where we’ll compare different approaches to finding and ranking relevant information.
We’ll use a small dataset of 10 questions about geography to test the Wikipedia search.
Rerank the search results and filter the top_n results (“Engine A”)
In this section, we’ll implement our first retrieval approach using Cohere’s reranking model. Reranking is a technique that takes an initial set of search results and reorders them based on their relevance to the original query.
We’ll use Cohere’s rerank
API to:
- Take the Wikipedia search results we obtained earlier
- Send them to Cohere’s reranking model along with the original query
- Filter to keep only the top-n most relevant results
This approach will be referred to as “Engine A” in our evaluation, and we’ll compare its performance against the original Wikipedia search rankings.
Take the original search results and filter the top_n results (“Engine B”)
In this section, we’ll implement our second retrieval approach as a baseline comparison. For “Engine B”, we’ll simply take the original Wikipedia search results without any reranking and select the top-n results.
This approach reflects how many traditional search engines work - returning results based on their original relevance score from the data source. By comparing this baseline against our reranked approach (Engine A), we can evaluate whether reranking provides meaningful improvements in result quality.
We’ll use the same number of results (top_n) as Engine A to ensure a fair comparison in our evaluation.
Run LLM-as-a-judge evaluation to compare the two engines
Now we’ll implement an evaluation framework using LLMs as judges to compare our two retrieval approaches:
- Engine A: Wikipedia results reranked by Cohere’s reranking model
- Engine B: Original Wikipedia search results
Using LLMs as evaluators allows us to programmatically assess the quality of search results without human annotation. The following code implements the following steps:
- Setting up the evaluation protocol
- First, define a clear protocol for how the LLM judges will evaluate the search results. This includes creating a system prompt and a template for each evaluation.
- Using multiple models as independent judges
- To get more robust results, use multiple LLM models as independent judges. This reduces bias from any single model.
- Implementing a majority voting system
- Combine judgments from multiple models using a majority voting system to determine which engine performed better for each query:
- Presenting the results
- After evaluating all queries, present the results to determine which retrieval approach performed better overall.
This approach provides a scalable, reproducible method to evaluate and compare retrieval systems quantitatively.
Conclusion
This tutorial demonstrates how to evaluate retrieval systems using LLMs as judges through Pydantic AI, comparing original Wikipedia search results against those reranked by Cohere’s reranking model.
The evaluation framework uses multiple Cohere models as independent judges with majority voting to determine which system provides more relevant results.
Results showed the reranked approach (Engine A) outperformed the original search rankings (Engine B) by winning 80% of queries, demonstrating the effectiveness of neural reranking in improving search relevance.