In this cookbook, we will be demonstrating an approach we use for evaluating summarization tasks using LLM evaluation.
You’ll need a Cohere API key to run this notebook. If you don’t have a key, head to https://cohere.com/ to generate your key.
As test data, we’ll use transcripts from the QMSum dataset. Note that in addition to the transcripts, this dataset also contains reference summaries — we will use only the transcripts as our approach is reference-free.
We are interested in evaluating summarization in real-world, enterprise use cases, which typically have two distinguishing features as compared to academic summarization benchmarks:
Therefore, we must first create a dataset that contains diverse summarization prompts. We will do this programmatically by building prompts from their components, as defined below:
First, we define the prompt that combines the text and instructions. Here, we use a very basic prompt:
Next, we build the instructions. Because each instruction may have a different objective and modifiers, we track them using metadata. This will later be required for evaluation (i.e. to know what the prompt is asking).
Let’s combine the objectives and format/length modifiers to finish building the instructions.
Finally, let’s build the final prompts by semi-randomly pairing the instructions with transcripts from the QMSum dataset.
We now setup the tools we will use for evaluation.
We use three criteria that are graded using LLMs:
In this cookbook, we will use Command-R to grade the completions. However, note that in practice, we typically use an ensemble of multiple LLM evaluators to reduce any bias.
In addition, we have two criteria that are graded programmatically:
Now that we have our evaluation dataset and defined our evaluation functions, let’s run evaluations!
First, we generate completions to be graded. We will use Cohere’s Command-R model, boasting a context length of 128K.
PhD D is transcribing recorded sessions to locate overlapping speech zones and categorizing them as acoustic events. The team discusses the parameters PhD D should use and how to define these events, considering the number of speakers and silence.
Let’s grade the completions using our LLM and non-LLM checks.
Finally, let’s print the average scores per critiera.