Retrieval evaluation using LLM-as-a-judge via Pydantic AI

Back to Cookbooks Open in GitHub

We’ll explore how to evaluate retrieval systems using Large Language Models (LLMs) as judges.Retrieval evaluation is a critical component in building high-quality information retrieval systems, and recent advancements in LLMs have made it possible to automate this evaluation process.

What we’ll cover

How to query the Wikipedia API
How to implement and compare two different retrieval approaches:
- The original search results from the Wikipedia API
- Using Cohere’s reranking model to rerank the search results
How to set up an LLM-as-a-judge evaluation framework using Pydantic AI

Tools We’ll Use

Cohere’s API: For reranking search results and providing evaluation models
Wikipedia’s API: As our information source
Pydantic AI: For creating evaluation agents

This tutorial demonstrates a methodology for comparing different retrieval systems objectively. By the end, you’ll have an example you can adapt to evaluate your own retrieval systems across different domains and use cases.

Setup

First, let’s import the necessary libraries.

PYTHON

1 %pip install -U cohere pydantic-ai

PYTHON

1 import requests
2 import cohere
3 import pandas as pd
4 from pydantic_ai import Agent
5 from pydantic_ai.models import KnownModelName
6 from collections import Counter
7 
8 import os
9 co = cohere.ClientV2(os.getenv("COHERE_API_KEY"))

PYTHON

1 import nest_asyncio
2 nest_asyncio.apply()

Perform Wikipedia search

Next, we implement a function to query Wikipedia for relevant information based on user input. The search_wikipedia() function allows us to retrieve a specified number of Wikipedia search results, extracting their titles, snippets, and page IDs.

This will provide us with the dataset for our retrieval evaluation experiment, where we’ll compare different approaches to finding and ranking relevant information.

We’ll use a small dataset of 10 questions about geography to test the Wikipedia search.

PYTHON

1 import requests
2 
3 def search_wikipedia(query, limit=10):
4     url = "https://en.wikipedia.org/w/api.php"
5     params = {
6         'action': 'query',
7         'list': 'search',
8         'srsearch': query,
9         'format': 'json',
10         'srlimit': limit
11     }
12 
13     response = requests.get(url, params=params)
14     data = response.json()
15     
16     # Format the results
17     results = []
18     for item in data['query']['search']:
19         results.append({
20             "title": item["title"],
21             "snippet": item["snippet"].replace("<span class=\"searchmatch\">", "").replace("</span>", ""),
22         })
23             
24     return results

PYTHON

1 # Generate 10 questions about geography to test the Wikipedia search
2 geography_questions = [
3     "What is the capital of France?",
4     "What is the longest river in the world?",
5     "What is the largest desert in the world?",
6     "What is the highest mountain peak on Earth?",
7     "What are the major tectonic plates?",
8     "What is the Ring of Fire?",
9     "What is the largest ocean on Earth?",
10     "What are the Seven Wonders of the Natural World?",
11     "What causes the Northern Lights?",
12     "What is the Great Barrier Reef?"
13 ]

PYTHON

1 # Run search_wikipedia for each question
2 results = []
3 
4 for question in geography_questions:
5     question_results = search_wikipedia(question, limit=10)
6     
7     # Format the results as requested
8     formatted_results = []
9     for item in question_results:
10         formatted_result = f"{item['title']}\n{item['snippet']}"
11         formatted_results.append(formatted_result)
12     
13     # Add to the results list
14     results.append({
15         "question": question,
16         "search_results": formatted_results
17     })

Rerank the search results and filter the top_n results (“Engine A”)

In this section, we’ll implement our first retrieval approach using Cohere’s reranking model. Reranking is a technique that takes an initial set of search results and reorders them based on their relevance to the original query.

We’ll use Cohere’s rerank API to:

Take the Wikipedia search results we obtained earlier
Send them to Cohere’s reranking model along with the original query
Filter to keep only the top-n most relevant results

This approach will be referred to as “Engine A” in our evaluation, and we’ll compare its performance against the original Wikipedia search rankings.

PYTHON

1 # Rerank the search results for each question
2 top_n = 3
3 results_reranked_top_n = []
4 
5 for item in results:
6     question = item["question"]
7     documents = item["search_results"]
8     
9     # Rerank the documents using Cohere
10     reranked = co.rerank(
11         model="rerank-v3.5",
12         query=question,
13         documents=documents,
14         top_n=top_n  # Get top 3 results
15     )
16     
17     # Format the reranked results
18     top_results = []
19     for result in reranked.results:
20         top_results.append(documents[result.index])
21     
22     # Add to the reranked results list
23     results_reranked_top_n.append({
24         "question": question,
25         "search_results": top_results
26     })
27 
28 # Print a sample of the reranked results
29 print(f"Original question: {results_reranked_top_n[0]['question']}")
30 print(f"Top 3 reranked results:")
31 for i, result in enumerate(results_reranked_top_n[0]['search_results']):
32     print(f"\n{i+1}. {result}")

Original question: What is the capital of France?
Top 3 reranked results:
1. France
semi-presidential republic and its capital, largest city and main cultural and economic centre is Paris. Metropolitan France was settled during the Iron Age by Celtic
2. Closed-ended question
variants of the above closed-ended questions that possess specific responses are: On what day were you born? (&quot;Saturday.&quot;) What is the capital of France? (&quot;Paris
3. Capital city
seat of the government. A capital is typically a city that physically encompasses the government&#039;s offices and meeting places; the status as capital is often

Take the original search results and filter the top_n results (“Engine B”)

In this section, we’ll implement our second retrieval approach as a baseline comparison. For “Engine B”, we’ll simply take the original Wikipedia search results without any reranking and select the top-n results.

This approach reflects how many traditional search engines work - returning results based on their original relevance score from the data source. By comparing this baseline against our reranked approach (Engine A), we can evaluate whether reranking provides meaningful improvements in result quality.

We’ll use the same number of results (top_n) as Engine A to ensure a fair comparison in our evaluation.

PYTHON

1 results_top_n = []
2 
3 for item in results:
4     results_top_n.append({
5         "question": item["question"],
6         "search_results": item["search_results"][:top_n]
7     })
8     
9 # Print a sample of the top_n results (without reranking)
10 print(f"Original question: {results_top_n[0]['question']}")
11 print(f"Top {top_n} results (without reranking):")
12 for i, result in enumerate(results_top_n[0]['search_results']):
13     print(f"\n{i+1}. {result}")

Original question: What is the capital of France?
Top 3 results (without reranking):
1. Closed-ended question
variants of the above closed-ended questions that possess specific responses are: On what day were you born? (&quot;Saturday.&quot;) What is the capital of France? (&quot;Paris
2. France
semi-presidential republic and its capital, largest city and main cultural and economic centre is Paris. Metropolitan France was settled during the Iron Age by Celtic
3. What Is a Nation?
&quot;What Is a Nation?&quot; (French: Qu&#039;est-ce qu&#039;une nation ?) is an 1882 lecture by French historian Ernest Renan (1823–1892) at the Sorbonne, known for the

Run LLM-as-a-judge evaluation to compare the two engines

Now we’ll implement an evaluation framework using LLMs as judges to compare our two retrieval approaches:

Engine A: Wikipedia results reranked by Cohere’s reranking model
Engine B: Original Wikipedia search results

Using LLMs as evaluators allows us to programmatically assess the quality of search results without human annotation. The following code implements the following steps:

Setting up the evaluation protocol
- First, define a clear protocol for how the LLM judges will evaluate the search results. This includes creating a system prompt and a template for each evaluation.
Using multiple models as independent judges
- To get more robust results, use multiple LLM models as independent judges. This reduces bias from any single model.
Implementing a majority voting system
- Combine judgments from multiple models using a majority voting system to determine which engine performed better for each query:
Presenting the results
- After evaluating all queries, present the results to determine which retrieval approach performed better overall.

This approach provides a scalable, reproducible method to evaluate and compare retrieval systems quantitatively.

PYTHON

1 # System prompt for the AI evaluator
2 SYSTEM_PROMPT = """
3 You are an AI search evaluator. You will compare search results from two engines and
4 determine which set provides more relevant and diverse information. You will only
5 answer with the verdict rather than explaining your reasoning; simply say "Engine A" or
6 "Engine B".
7 """
8 
9 # Prompt template for each evaluation
10 PROMPT_TEMPLATE = """
11 For the following question, which search engine provides more relevant results?
12 
13 ## Question:
14 {query}
15 
16 ## Engine A:
17 {engine_a_results}
18 
19 ## Engine B:
20 {engine_b_results}
21 """
22 
23 def format_results(results):
24     """Format search results in a readable way"""
25     formatted = []
26     for i, result in enumerate(results):
27         formatted.append(f"Result {i+1}: {result[:200]}...")
28     return "\n\n".join(formatted)
29 
30 def judge_query(query, engine_a_results, engine_b_results, model_name):
31     """Use a single model to judge which engine has better results"""
32     agent = Agent(model_name, system_prompt=SYSTEM_PROMPT)
33     
34     # Format the results
35     engine_a_formatted = format_results(engine_a_results)
36     engine_b_formatted = format_results(engine_b_results)
37     
38     # Create the prompt
39     prompt = PROMPT_TEMPLATE.format(
40         query=query,
41         engine_a_results=engine_a_formatted,
42         engine_b_results=engine_b_formatted
43     )
44     
45     # Get the model's judgment
46     response = agent.run_sync(prompt)
47     return response.data
48 
49 def evaluate_search_results(reranked_results, regular_results, models):
50     """
51     Evaluate both sets of search results using multiple models.
52     
53     Args:
54         reranked_results: List of dictionaries with 'question' and 'search_results'
55         regular_results: List of dictionaries with 'question' and 'search_results'
56         models: List of model names to use as judges
57     
58     Returns:
59         DataFrame with evaluation results
60     """
61     # Prepare data structure for results
62     evaluation_results = []
63     
64     # Evaluate each query
65     for i in range(len(reranked_results)):
66         query = reranked_results[i]['question']
67         engine_a_results = reranked_results[i]['search_results']  # Reranked results
68         engine_b_results = regular_results[i]['search_results']   # Regular results
69         
70         # Get judgments from each model
71         judgments = []
72         for model in models:
73             judgment = judge_query(query, engine_a_results, engine_b_results, model)
74             judgments.append(judgment)
75         
76         # Determine winner by majority vote
77         votes = Counter(judgments)
78         if votes["Engine A"] > votes["Engine B"]:
79             winner = "Engine A"
80         elif votes["Engine B"] > votes["Engine A"]:
81             winner = "Engine B"
82         else:
83             winner = "Tie"
84         
85         # Add results for this query
86         row = [query] + judgments + [winner]
87         evaluation_results.append(row)
88     
89     # Create DataFrame
90     column_names = ["question"] + [f"judge_{i+1} ({model})" for i, model in enumerate(models)] + ["winner"]
91     df = pd.DataFrame(evaluation_results, columns=column_names)
92     
93     return df

PYTHON

1 # Define the search engines
2 engine_a = results_reranked_top_n
3 engine_b = results_top_n
4 
5 # Define the models to use as judges
6 models = [
7     "cohere:command-a-03-2025",
8     "cohere:command-r-plus-08-2024",
9     "cohere:command-r-08-2024",
10     "cohere:c4ai-aya-expanse-32b",
11 ]
12 
13 # Get evaluation results
14 results_df = evaluate_search_results(engine_a, engine_b, models)
15 
16 # Calculate overall statistics
17 winner_counts = Counter(results_df["winner"])
18 total_queries = len(results_df)
19 
20 # Display summary of results
21 print("\nPercentage of questions won by each engine:")
22 for engine, count in winner_counts.items():
23     percentage = (count / total_queries) * 100
24     print(f"{engine}: {percentage:.2f}% ({count}/{total_queries})")
25     
26 # Display dataframe
27 results_df.head()
28 
29 # Save to CSV
30 results_csv = results_df.to_csv("search_results_evaluation.csv", index=False)

Percentage of questions won by each engine:
Engine A: 80.00% (8/10)
Tie: 10.00% (1/10)
Engine B: 10.00% (1/10)

Conclusion

This tutorial demonstrates how to evaluate retrieval systems using LLMs as judges through Pydantic AI, comparing original Wikipedia search results against those reranked by Cohere’s reranking model.

The evaluation framework uses multiple Cohere models as independent judges with majority voting to determine which system provides more relevant results.

Results showed the reranked approach (Engine A) outperformed the original search rankings (Engine B) by winning 80% of queries, demonstrating the effectiveness of neural reranking in improving search relevance.