Retrieval evaluation using LLM-as-a-judge via Pydantic AI

We’ll explore how to evaluate retrieval systems using Large Language Models (LLMs) as judges.Retrieval evaluation is a critical component in building high-quality information retrieval systems, and recent advancements in LLMs have made it possible to automate this evaluation process.

What we’ll cover

  • How to query the Wikipedia API
  • How to implement and compare two different retrieval approaches:
    • The original search results from the Wikipedia API
    • Using Cohere’s reranking model to rerank the search results
  • How to set up an LLM-as-a-judge evaluation framework using Pydantic AI

Tools We’ll Use

  • Cohere’s API: For reranking search results and providing evaluation models
  • Wikipedia’s API: As our information source
  • Pydantic AI: For creating evaluation agents

This tutorial demonstrates a methodology for comparing different retrieval systems objectively. By the end, you’ll have an example you can adapt to evaluate your own retrieval systems across different domains and use cases.

Setup

First, let’s import the necessary libraries.

PYTHON
1%pip install -U cohere pydantic-ai
PYTHON
1import requests
2import cohere
3import pandas as pd
4from pydantic_ai import Agent
5from pydantic_ai.models import KnownModelName
6from collections import Counter
7
8import os
9co = cohere.ClientV2(os.getenv("COHERE_API_KEY"))
PYTHON
1import nest_asyncio
2nest_asyncio.apply()

Next, we implement a function to query Wikipedia for relevant information based on user input. The search_wikipedia() function allows us to retrieve a specified number of Wikipedia search results, extracting their titles, snippets, and page IDs.

This will provide us with the dataset for our retrieval evaluation experiment, where we’ll compare different approaches to finding and ranking relevant information.

We’ll use a small dataset of 10 questions about geography to test the Wikipedia search.

PYTHON
1import requests
2
3def search_wikipedia(query, limit=10):
4 url = "https://en.wikipedia.org/w/api.php"
5 params = {
6 'action': 'query',
7 'list': 'search',
8 'srsearch': query,
9 'format': 'json',
10 'srlimit': limit
11 }
12
13 response = requests.get(url, params=params)
14 data = response.json()
15
16 # Format the results
17 results = []
18 for item in data['query']['search']:
19 results.append({
20 "title": item["title"],
21 "snippet": item["snippet"].replace("<span class=\"searchmatch\">", "").replace("</span>", ""),
22 })
23
24 return results
PYTHON
1# Generate 10 questions about geography to test the Wikipedia search
2geography_questions = [
3 "What is the capital of France?",
4 "What is the longest river in the world?",
5 "What is the largest desert in the world?",
6 "What is the highest mountain peak on Earth?",
7 "What are the major tectonic plates?",
8 "What is the Ring of Fire?",
9 "What is the largest ocean on Earth?",
10 "What are the Seven Wonders of the Natural World?",
11 "What causes the Northern Lights?",
12 "What is the Great Barrier Reef?"
13]
PYTHON
1# Run search_wikipedia for each question
2results = []
3
4for question in geography_questions:
5 question_results = search_wikipedia(question, limit=10)
6
7 # Format the results as requested
8 formatted_results = []
9 for item in question_results:
10 formatted_result = f"{item['title']}\n{item['snippet']}"
11 formatted_results.append(formatted_result)
12
13 # Add to the results list
14 results.append({
15 "question": question,
16 "search_results": formatted_results
17 })

Rerank the search results and filter the top_n results (“Engine A”)

In this section, we’ll implement our first retrieval approach using Cohere’s reranking model. Reranking is a technique that takes an initial set of search results and reorders them based on their relevance to the original query.

We’ll use Cohere’s rerank API to:

  1. Take the Wikipedia search results we obtained earlier
  2. Send them to Cohere’s reranking model along with the original query
  3. Filter to keep only the top-n most relevant results

This approach will be referred to as “Engine A” in our evaluation, and we’ll compare its performance against the original Wikipedia search rankings.

PYTHON
1# Rerank the search results for each question
2top_n = 3
3results_reranked_top_n = []
4
5for item in results:
6 question = item["question"]
7 documents = item["search_results"]
8
9 # Rerank the documents using Cohere
10 reranked = co.rerank(
11 model="rerank-v3.5",
12 query=question,
13 documents=documents,
14 top_n=top_n # Get top 3 results
15 )
16
17 # Format the reranked results
18 top_results = []
19 for result in reranked.results:
20 top_results.append(documents[result.index])
21
22 # Add to the reranked results list
23 results_reranked_top_n.append({
24 "question": question,
25 "search_results": top_results
26 })
27
28# Print a sample of the reranked results
29print(f"Original question: {results_reranked_top_n[0]['question']}")
30print(f"Top 3 reranked results:")
31for i, result in enumerate(results_reranked_top_n[0]['search_results']):
32 print(f"\n{i+1}. {result}")
Original question: What is the capital of France?
Top 3 reranked results:
1. France
semi-presidential republic and its capital, largest city and main cultural and economic centre is Paris. Metropolitan France was settled during the Iron Age by Celtic
2. Closed-ended question
variants of the above closed-ended questions that possess specific responses are: On what day were you born? (&quot;Saturday.&quot;) What is the capital of France? (&quot;Paris
3. Capital city
seat of the government. A capital is typically a city that physically encompasses the government&#039;s offices and meeting places; the status as capital is often

Take the original search results and filter the top_n results (“Engine B”)

In this section, we’ll implement our second retrieval approach as a baseline comparison. For “Engine B”, we’ll simply take the original Wikipedia search results without any reranking and select the top-n results.

This approach reflects how many traditional search engines work - returning results based on their original relevance score from the data source. By comparing this baseline against our reranked approach (Engine A), we can evaluate whether reranking provides meaningful improvements in result quality.

We’ll use the same number of results (top_n) as Engine A to ensure a fair comparison in our evaluation.

PYTHON
1results_top_n = []
2
3for item in results:
4 results_top_n.append({
5 "question": item["question"],
6 "search_results": item["search_results"][:top_n]
7 })
8
9# Print a sample of the top_n results (without reranking)
10print(f"Original question: {results_top_n[0]['question']}")
11print(f"Top {top_n} results (without reranking):")
12for i, result in enumerate(results_top_n[0]['search_results']):
13 print(f"\n{i+1}. {result}")
Original question: What is the capital of France?
Top 3 results (without reranking):
1. Closed-ended question
variants of the above closed-ended questions that possess specific responses are: On what day were you born? (&quot;Saturday.&quot;) What is the capital of France? (&quot;Paris
2. France
semi-presidential republic and its capital, largest city and main cultural and economic centre is Paris. Metropolitan France was settled during the Iron Age by Celtic
3. What Is a Nation?
&quot;What Is a Nation?&quot; (French: Qu&#039;est-ce qu&#039;une nation ?) is an 1882 lecture by French historian Ernest Renan (1823–1892) at the Sorbonne, known for the

Run LLM-as-a-judge evaluation to compare the two engines

Now we’ll implement an evaluation framework using LLMs as judges to compare our two retrieval approaches:

  • Engine A: Wikipedia results reranked by Cohere’s reranking model
  • Engine B: Original Wikipedia search results

Using LLMs as evaluators allows us to programmatically assess the quality of search results without human annotation. The following code implements the following steps:

  • Setting up the evaluation protocol
    • First, define a clear protocol for how the LLM judges will evaluate the search results. This includes creating a system prompt and a template for each evaluation.
  • Using multiple models as independent judges
    • To get more robust results, use multiple LLM models as independent judges. This reduces bias from any single model.
  • Implementing a majority voting system
    • Combine judgments from multiple models using a majority voting system to determine which engine performed better for each query:
  • Presenting the results
    • After evaluating all queries, present the results to determine which retrieval approach performed better overall.

This approach provides a scalable, reproducible method to evaluate and compare retrieval systems quantitatively.

PYTHON
1# System prompt for the AI evaluator
2SYSTEM_PROMPT = """
3You are an AI search evaluator. You will compare search results from two engines and
4determine which set provides more relevant and diverse information. You will only
5answer with the verdict rather than explaining your reasoning; simply say "Engine A" or
6"Engine B".
7"""
8
9# Prompt template for each evaluation
10PROMPT_TEMPLATE = """
11For the following question, which search engine provides more relevant results?
12
13## Question:
14{query}
15
16## Engine A:
17{engine_a_results}
18
19## Engine B:
20{engine_b_results}
21"""
22
23def format_results(results):
24 """Format search results in a readable way"""
25 formatted = []
26 for i, result in enumerate(results):
27 formatted.append(f"Result {i+1}: {result[:200]}...")
28 return "\n\n".join(formatted)
29
30def judge_query(query, engine_a_results, engine_b_results, model_name):
31 """Use a single model to judge which engine has better results"""
32 agent = Agent(model_name, system_prompt=SYSTEM_PROMPT)
33
34 # Format the results
35 engine_a_formatted = format_results(engine_a_results)
36 engine_b_formatted = format_results(engine_b_results)
37
38 # Create the prompt
39 prompt = PROMPT_TEMPLATE.format(
40 query=query,
41 engine_a_results=engine_a_formatted,
42 engine_b_results=engine_b_formatted
43 )
44
45 # Get the model's judgment
46 response = agent.run_sync(prompt)
47 return response.data
48
49def evaluate_search_results(reranked_results, regular_results, models):
50 """
51 Evaluate both sets of search results using multiple models.
52
53 Args:
54 reranked_results: List of dictionaries with 'question' and 'search_results'
55 regular_results: List of dictionaries with 'question' and 'search_results'
56 models: List of model names to use as judges
57
58 Returns:
59 DataFrame with evaluation results
60 """
61 # Prepare data structure for results
62 evaluation_results = []
63
64 # Evaluate each query
65 for i in range(len(reranked_results)):
66 query = reranked_results[i]['question']
67 engine_a_results = reranked_results[i]['search_results'] # Reranked results
68 engine_b_results = regular_results[i]['search_results'] # Regular results
69
70 # Get judgments from each model
71 judgments = []
72 for model in models:
73 judgment = judge_query(query, engine_a_results, engine_b_results, model)
74 judgments.append(judgment)
75
76 # Determine winner by majority vote
77 votes = Counter(judgments)
78 if votes["Engine A"] > votes["Engine B"]:
79 winner = "Engine A"
80 elif votes["Engine B"] > votes["Engine A"]:
81 winner = "Engine B"
82 else:
83 winner = "Tie"
84
85 # Add results for this query
86 row = [query] + judgments + [winner]
87 evaluation_results.append(row)
88
89 # Create DataFrame
90 column_names = ["question"] + [f"judge_{i+1} ({model})" for i, model in enumerate(models)] + ["winner"]
91 df = pd.DataFrame(evaluation_results, columns=column_names)
92
93 return df
PYTHON
1# Define the search engines
2engine_a = results_reranked_top_n
3engine_b = results_top_n
4
5# Define the models to use as judges
6models = [
7 "cohere:command-a-03-2025",
8 "cohere:command-r-plus-08-2024",
9 "cohere:command-r-08-2024",
10 "cohere:c4ai-aya-expanse-32b",
11 "cohere:c4ai-aya-expanse-8b",
12]
13
14# Get evaluation results
15results_df = evaluate_search_results(engine_a, engine_b, models)
16
17# Calculate overall statistics
18winner_counts = Counter(results_df["winner"])
19total_queries = len(results_df)
20
21# Display summary of results
22print("\nPercentage of questions won by each engine:")
23for engine, count in winner_counts.items():
24 percentage = (count / total_queries) * 100
25 print(f"{engine}: {percentage:.2f}% ({count}/{total_queries})")
26
27# Display dataframe
28results_df.head()
29
30# Save to CSV
31results_csv = results_df.to_csv("search_results_evaluation.csv", index=False)
Percentage of questions won by each engine:
Engine A: 80.00% (8/10)
Tie: 10.00% (1/10)
Engine B: 10.00% (1/10)

Conclusion

This tutorial demonstrates how to evaluate retrieval systems using LLMs as judges through Pydantic AI, comparing original Wikipedia search results against those reranked by Cohere’s reranking model.

The evaluation framework uses multiple Cohere models as independent judges with majority voting to determine which system provides more relevant results.

Results showed the reranked approach (Engine A) outperformed the original search rankings (Engine B) by winning 80% of queries, demonstrating the effectiveness of neural reranking in improving search relevance.

Built with