Summarization Evals
In this cookbook, we will be demonstrating an approach we use for evaluating summarization tasks using LLM evaluation.
Get Started
You’ll need a Cohere API key to run this notebook. If you don’t have a key, head to https://cohere.com/ to generate your key.
As test data, we’ll use transcripts from the QMSum dataset. Note that in addition to the transcripts, this dataset also contains reference summaries — we will use only the transcripts as our approach is reference-free.
Construct the evaluation dataset
We are interested in evaluating summarization in real-world, enterprise use cases, which typically have two distinguishing features as compared to academic summarization benchmarks:
- Enterprise use cases often focus on specific summarization objectives, e.g. “summarize action items”.
- Enterprise use cases often feature specific instruction constraints, e.g. “summarize in bullets with each bullet under 20 words”.
Therefore, we must first create a dataset that contains diverse summarization prompts. We will do this programmatically by building prompts from their components, as defined below:
- Prompt = text (e.g. transcript to be summarized) + instruction
- Instruction = instruction objective (e.g. “summarize action items”) + modifiers
- Modifiers = format/length modifiers (e.g. “use bullets”) + style/tone modifiers (e.g. “do not mention names”) + …
First, we define the prompt that combines the text and instructions. Here, we use a very basic prompt:
Next, we build the instructions. Because each instruction may have a different objective and modifiers, we track them using metadata. This will later be required for evaluation (i.e. to know what the prompt is asking).
Let’s combine the objectives and format/length modifiers to finish building the instructions.
Finally, let’s build the final prompts by semi-randomly pairing the instructions with transcripts from the QMSum dataset.
Build the evaluation framework
We now setup the tools we will use for evaluation.
We use three criteria that are graded using LLMs:
- Completeness: checks if the summary includes all the important information from the original text that it should include
- Correctness: checks if there are any hallucinations or factual inaccuracies in the summary
- Conciseness: checks if the summary includes any unnecessary information or wordiness
In this cookbook, we will use Command-R to grade the completions. However, note that in practice, we typically use an ensemble of multiple LLM evaluators to reduce any bias.
In addition, we have two criteria that are graded programmatically:
- Format: checks if the summary follows the format (e.g. bullets) that was requested in the prompt
- Length: checks if the summary follows the length that was requested in the prompt.
Run evaluations
Now that we have our evaluation dataset and defined our evaluation functions, let’s run evaluations!
First, we generate completions to be graded. We will use Cohere’s Command-R model, boasting a context length of 128K.
PhD D is transcribing recorded sessions to locate overlapping speech zones and categorizing them as acoustic events. The team discusses the parameters PhD D should use and how to define these events, considering the number of speakers and silence.
Let’s grade the completions using our LLM and non-LLM checks.
instruction | eval_metadata | objective | transcript | prompt | transcript_token_len | completion | format_score | length_score | completeness_score | correctness_score | conciseness_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Summarize the meeting based on the transcript… | {‘format’: ‘paragraphs’, ‘min_length’: 10, ‘ma… | general_summarization | PhD F: As opposed to the rest of us \nPhD D: W… | ## meeting transcript\nPhD F: As opposed to th… | 1378 | PhD D is transcribing recorded sessions to loc… | 1 | 1 | 0.8 | 1.0 | 0.8 |
1 | Summarize the meeting based on the transcript… | {‘format’: ‘paragraphs’, ‘min_length’: 50, ‘ma… | general_summarization | Lynne Neagle AM: Thank you very much And the n… | ## meeting transcript\nLynne Neagle AM: Thank … | 1649 | The discussion focused on the impact of COVID1… | 1 | 1 | 0.8 | 1.0 | 0.8 |
2 | Summarize the meeting based on the transcript… | {‘format’: ‘bullets’, ‘number’: 3, ‘min_length… | general_summarization | Industrial Designer: Yep So we are to mainly d… | ## meeting transcript\nIndustrial Designer: Ye… | 1100 | - The team is designing a remote control with … | 1 | 0 | 0.8 | 1.0 | 0.8 |
3 | Summarize the meeting based on the transcript… | {‘format’: ‘bullets’, ‘number’: 2, ‘min_length… | general_summarization | Industrial Designer: Mm I think one of the ver… | ## meeting transcript\nIndustrial Designer: Mm… | 2618 | - The team discusses the target demographic fo… | 1 | 1 | 0.8 | 1.0 | 0.8 |
4 | What are the follow-up items based on the meet… | {‘format’: ‘bullets’, ‘number’: 3, ‘min_length… | action_items | Marketing: so a lot of people have to be able … | ## meeting transcript\nMarketing: so a lot of … | 2286 | - Investigate how the remote will interact wit… | 1 | 1 | 0.8 | 1.0 | 0.8 |
5 | What are the follow-up items based on the meet… | {‘format’: ‘bullets’, ‘number’: 2, ‘min_length… | action_items | Project Manager: Alright So finance And we wil… | ## meeting transcript\nProject Manager: Alrigh… | 1965 | - The project manager will send the updated de… | 1 | 1 | 0.8 | 1.0 | 0.8 |
Finally, let’s print the average scores per critiera.