Deep Dive Into RAG Evaluation
In this notebook, we’ll show you how to evaluate the output of a RAG system. The high-level RAG flow is depicted in the diagram below.
We will focus on the evaluation of Retrieve and Response (or Generation), and present a set of metrics for each phase. We will deep dive into each metric, to give you a full understanding of how we evaluate models and why we do it this way, and provide code so you can repdroduce on your own data.
To demonstrate the metrics, we will use data from the Docugami’s KG-RAG dataset, a RAG dataset for financial 10Q filing reports. We will focus only on evaluation, without performing the actual Retrieval and response Generation steps.
Table of content
Getting Started
Let’s start by setting the environment and downloading the dataset.
For Response evaluation, we will use an LLM as a judge.
Any LLM can be used for this goal, but because evaluation is a very challenging task, we recommend using powerful LLMs, possibly as an ensemble of models. In previous work, it has been shown that models tend to assign higher scores to their own output. Since we generated the answers in this notebook using command-r
, we will not use it for evaluation. We will provide two alternatives, gpt-4
and mistral
. We set gpt-4
as the default model because, as mentioned above, evaluation is challenging, and gpt-4
is powerful enough to efficiently perform the task.
Retrieval Evaluation
In the Retrieval phase, we evaluate the set of retrieved documents against the golden documents set.
We use three standard metrics to evaluate retrieval:
- Precision: the proportion of returned documents that are relevant, according to the gold annotation
- Recall: the proportion of relevant documents in the gold data found in the retrieved documents
- Mean Average Precision (MAP): measures the capability of the retriever to return relevant documents at the top of the list
We implement these three metrics in the class below:
Let’s now see how to use the class above to compute the results on a single datapoint.
What are the figures above telling us?
- Precision (0.67) tells us that 2 out of 3 of the retrieved docs are correct
- Recall (0.5) means that 2 out of 4 relevant docs have been retrieved
- MAP (0.83) is computed as the average of 1/1 (the highest ranked doc is correct) and 2/3 (the 2nd ranked doc is wrong, the 3rd is correct).
While the example here focuses on a single datapoint, you can easily apply the same metrics to all your dataset and get the overall performance of your Retrieve phase.
Generation Evaluation
Evaluating grounded generation (the second step of RAG) is notoriously difficult, because generations are usually complex and rich of information, and simply labelling an answer as “good” or “bad” is not enough. To overcome this issue, we first decompose complex answers into a set of basic claims, where a claim is any sentence or part of a sentence in the answer that expresses a verifiable fact. Subsequently, we check the validity of each claim independently, defining the overall quality of the answer based on the correctness of the claims it includes.
We use claims to compute three metrics:
-
Faithfulness, which measures how many of the claims in the generated response are supported by the retrieved documents. This is a fundamental metric, as it tells us how grounded in the documents the response is, and, contextually, it allows us to spot hallucinations
-
Correctness, which checks which claims in the response also occur in the gold answer
-
And Coverage, by which we assess how many of the claims in the gold answer are included in the generated response.
Note that Faithfulness and Correctness share the exact same approach, the difference being that the former checks the claims against the supporting docs, while the latter against the golden answer. Also, while Correctness is measuring the precision of the claims in the response, Coverage can be seen as complementary, as it measures recall.
Claim Extraction
Let’s now see how we implement the evaluation described above using LLMs. Let’s start with claim extraction.
Claim Assessment
Nice! now that we have the list of claims, we can go ahead and assess the validity of each claim.
Faithfulness
Great, we now have an assessment for each of the claims: in the last step, we just need to use these assessments to define the final score.
The final Faithfulness score is 1, which means that the model’s response is fully grounded in the retrieved documents: that’s a very good news :)
Before moving on, let’s modify the model’s response by adding a piece of information which is not grounded in any document, and re-compute Faithfulness.
As you can see, by assessing claims one by one, we are able to spot hallucinations, that is, the (corrupted) cases in which the information provided by the model is not grounded in any of the retrieved documents.
Correctness
As said, Faithfulness and Correctness share the same logic, the only difference being that we will check the claims against the gold answer. We can therefore repeat the process above, and just substitute the context
.
As mentioned above, automatic evaluation is a hard task, and even when using powerful models, claim assessment can present problems: for example, the third claim is labelled as 0, even if it might be inferred from the information in the gold answer.
For Correctness, we found that only half of the claims in the generated response are found in the gold answer. Note that this is not necessarily an issue: reference answers are often non-exhaustive, especially in dataset including open-ended questions, like the one we are considering in this post, and both the generated and golden answer can include relevant information.
Coverage
We finally move to Coverage. Remember that, in this case, we want to check how many of the claims in the gold answer are included in the generated response. To do it, we first need to extract the claims from the gold answer.
Then, we check which of these claims is present in the response generated by the model.
The Coverage score is telling us that 1/3 of the information in the gold answer is present in the generated answer. This is a useful information, that, similarly to what said above regarding Correctness, can raise further questions, such as: is it acceptable to have diverging information in the generated answer? Is any crucial piece of information missing in the generated answer?
The answer to these questions is use case-specific, and has to be made by the end user: The claim-based approach implemented here supports the user by providing a clear and detailed view on what the model is assessing and how.
Final Comments
RAG evaluation is a hard task, especially the evaluation of the generated response. In this notebook we offer a clear, robust and replicable approach to evaluation, on which you can build on to build your evaluation pipeline.