Summarization Evals

In this cookbook, we will be demonstrating an approach we use for evaluating summarization tasks using LLM evaluation.

Get Started

You’ll need a Cohere API key to run this notebook. If you don’t have a key, head to https://cohere.com/ to generate your key.

PYTHON
1!pip install cohere datasets --quiet
PYTHON
1import json
2import random
3import re
4from typing import List, Optional
5
6import cohere
7from getpass import getpass
8from datasets import load_dataset
9import pandas as pd
10
11co_api_key = getpass("Enter your Cohere API key: ")
12co_model = "command-r"
13co = cohere.Client(api_key=co_api_key)

As test data, we’ll use transcripts from the QMSum dataset. Note that in addition to the transcripts, this dataset also contains reference summaries — we will use only the transcripts as our approach is reference-free.

PYTHON
1qmsum = load_dataset("MocktaiLEngineer/qmsum-processed", split="validation")
2transcripts = [x for x in qmsum["meeting_transcript"] if x is not None]
Output
Generating train split: 0%| | 0/1095 [00:00<?, ? examples/s]
Generating validation split: 0%| | 0/237 [00:00<?, ? examples/s]
Generating test split: 0%| | 0/244 [00:00<?, ? examples/s]

Construct the evaluation dataset

We are interested in evaluating summarization in real-world, enterprise use cases, which typically have two distinguishing features as compared to academic summarization benchmarks:

  • Enterprise use cases often focus on specific summarization objectives, e.g. “summarize action items”.
  • Enterprise use cases often feature specific instruction constraints, e.g. “summarize in bullets with each bullet under 20 words”.

Therefore, we must first create a dataset that contains diverse summarization prompts. We will do this programmatically by building prompts from their components, as defined below:

  • Prompt = text (e.g. transcript to be summarized) + instruction
  • Instruction = instruction objective (e.g. “summarize action items”) + modifiers
  • Modifiers = format/length modifiers (e.g. “use bullets”) + style/tone modifiers (e.g. “do not mention names”) + …

First, we define the prompt that combines the text and instructions. Here, we use a very basic prompt:

PYTHON
1prompt_template = """## meeting transcript
2{transcript}
3
4## instructions
5{instructions}"""

Next, we build the instructions. Because each instruction may have a different objective and modifiers, we track them using metadata. This will later be required for evaluation (i.e. to know what the prompt is asking).

PYTHON
1instruction_objectives = {
2 "general_summarization": "Summarize the meeting based on the transcript.",
3 "action_items": "What are the follow-up items based on the meeting transcript?",
4}
5
6format_length_modifiers = {
7 "paragraphs_short": {
8 "text": "In paragraph form, output your response. Use at least 10 words and at most 50 words in total.",
9 "objectives": ["general_summarization"],
10 "eval_metadata": {
11 "format": "paragraphs",
12 "min_length": 10,
13 "max_length": 50,
14 },
15 },
16 "paragraphs_medium": {
17 "text": "Return the answer in the form of paragraphs. Make sure your answer is between 50 and 200 words long.",
18 "objectives": ["general_summarization"],
19 "eval_metadata": {
20 "format": "paragraphs",
21 "min_length": 50,
22 "max_length": 200,
23 },
24 },
25 "bullets_short_3": {
26 "text": "Format your answer in the form of bullets. Use exactly 3 bullets. Each bullet should be at least 10 words and at most 20 words.",
27 "objectives": ["general_summarization", "action_items"],
28 "eval_metadata": {
29 "format": "bullets",
30 "number": 3,
31 "min_length": 10,
32 "max_length": 20,
33 },
34 },
35 "bullets_medium_2": {
36 "text": "In bullets, output your response. Make sure to use exactly 2 bullets. Make sure each bullet is between 20 and 80 words long.",
37 "objectives": ["general_summarization", "action_items"],
38 "eval_metadata": {
39 "format": "bullets",
40 "number": 2,
41 "min_length": 20,
42 "max_length": 80,
43 },
44 },
45}

Let’s combine the objectives and format/length modifiers to finish building the instructions.

PYTHON
1instructions = []
2for obj_name, obj_text in instruction_objectives.items():
3 for mod_data in format_length_modifiers.values():
4 for mod_obj in mod_data["objectives"]:
5 if mod_obj == obj_name:
6 instruction = {
7 "instruction": f"{obj_text} {mod_data['text']}",
8 "eval_metadata": mod_data["eval_metadata"],
9 "objective": obj_name,
10 }
11 instructions.append(instruction)
12
13print(json.dumps(instructions[:2], indent=4))
Output
1[
2 {
3 "instruction": "Summarize the meeting based on the transcript. In paragraph form, output your response. Use at least 10 words and at most 50 words in total.",
4 "eval_metadata": {
5 "format": "paragraphs",
6 "min_length": 10,
7 "max_length": 50
8 },
9 "objective": "general_summarization"
10 },
11 {
12 "instruction": "Summarize the meeting based on the transcript. Return the answer in the form of paragraphs. Make sure your answer is between 50 and 200 words long.",
13 "eval_metadata": {
14 "format": "paragraphs",
15 "min_length": 50,
16 "max_length": 200
17 },
18 "objective": "general_summarization"
19 }
20]

Finally, let’s build the final prompts by semi-randomly pairing the instructions with transcripts from the QMSum dataset.

PYTHON
1data = pd.DataFrame(instructions)
2
3transcripts = sorted(transcripts, key=lambda x: len(x), reverse=True)[:int(len(transcripts) * 0.25)]
4random.seed(42)
5random.shuffle(transcripts)
6data["transcript"] = transcripts[:len(data)]
7
8data["prompt"] = data.apply(lambda x: prompt_template.format(transcript=x["transcript"], instructions=x["instruction"]), axis=1)
PYTHON
1data["transcript_token_len"] = [len(x) for x in co.batch_tokenize(data["transcript"].tolist(), model=co_model)]
PYTHON
1print(data["prompt"][0])
Output
## meeting transcript
PhD F: As opposed to the rest of us
PhD D: Well comment OK I I remind that me my first objective eh in the project is to to study difference parameters to to find a a good solution to detect eh the overlapping zone in eh speech recorded But eh tsk comment ehhh comment In that way comment I I I begin to to study and to analyze the ehn the recorded speech eh the different session to to find and to locate and to mark eh the the different overlapping zone And eh so eh I was eh I am transcribing the the first session and I I have found eh eh one thousand acoustic events eh besides the overlapping zones eh I I I mean the eh breaths eh aspiration eh eh talk eh eh clap eh comment I do not know what is the different names eh you use to to name the the pause n speech
Grad G: Oh I do not think we ve been doing it at that level of detail So
PhD D: Eh I I I do I do not need to to to mmm to m to label the the different acoustic but I prefer because eh I would like to to study if eh I I will find eh eh a good eh parameters eh to detect overlapping I would like to to to test these parameters eh with the another eh eh acoustic events to nnn to eh to find what is the ehm the false eh the false eh hypothesis eh nnn which eh are produced when we use the the ehm this eh parameter eh I mean pitch eh eh difference eh feature
PhD A: You know I think some of these that are the nonspeech overlapping events may be difficult even for humans to tell that there s two there I mean if it s a tapping sound you would not necessarily or you know something like that it would be it might be hard to know that it was two separate events
Grad G: Well You were not talking about just overlaps were you ? You were just talking about acoustic events
PhD D: I I I I t I t I talk eh about eh acoustic events in general but eh my my objective eh will be eh to study eh overlapping zone Eh ? comment n Eh in twelve minutes I found eh eh one thousand acoustic events
Professor E: How many overlaps were there in it ? No no how many of them were the overlaps of speech though ?
PhD D: How many ? Eh almost eh three hundred eh in one session in five eh in forty five minutes Alm Three hundred overlapping zone With the overlapping zone overlapping speech speech what eh different duration
Postdoc B: Does this ? So if you had an overlap involving three people how many times was that counted ?
PhD D: three people two people Eh I would like to consider eh one people with difference noise eh in the background be
Professor E: No no but I think what she s asking is pause if at some particular for some particular stretch you had three people talking instead of two did you call that one event ?
PhD D: Oh Oh I consider one event eh for th for that eh for all the zone This th I I I con I consider I consider eh an acoustic event the overlapping zone the period where three speaker or eh are talking together
Grad G: So let s say me and Jane are talking at the same time and then Liz starts talking also over all of us How many events would that be ?
PhD D: So I do not understand
Grad G: So two people are talking comment and then a third person starts talking Is there an event right here ?
PhD D: Eh no No no For me is the overlapping zone because because you you have s you have more one eh more one voice eh eh produced in a in in a moment
Grad G: So i if two or more people are talking
Professor E: OK So I think We just wanted to understand how you are defining it So then in the region between since there there is some continuous region in between regions where there is only one person speaking And one contiguous region like that you are calling an event Is it Are you calling the beginning or the end of it the event or are you calling the entire length of it the event ?
PhD D: I consider the the nnn the nnn nnn eh the entirety eh eh all all the time there were the voice has overlapped This is the idea But eh I I do not distinguish between the the numbers of eh speaker I m not considering eh the the ehm eh the fact of eh eh for example what did you say ? Eh at first eh eh two talkers are eh speaking and eh eh a third person eh join to to that For me it s eh it s eh all overlap zone with eh several numbers of speakers is eh eh the same acoustic event Wi but without any mark between the zone of the overlapping zone with two speakers eh speaking together and the zone with the three speakers
Postdoc B: That would j just be one
PhD D: Eh with eh a beginning mark and the ending mark Because eh for me is the is the zone with eh some kind of eh distortion the spectral I do not mind By the moment by the moment
Grad G: Well but But you could imagine that three people talking has a different spectral characteristic than two
PhD D: I I do not but eh but eh I have to study comment What will happen in a general way
Grad G: So You had to start somewhere
PhD C: So there s a lot of overlap
PhD D: I I do not know what eh will will happen with the
Grad G: That s a lot of overlap
Professor E: So again that s that s three three hundred in forty five minutes that are that are speakers just speakers
Postdoc B: But a a a th
Professor E: So that s about eight per minute
Postdoc B: But a thousand events in twelve minutes that s
PhD C: But that can include taps
Postdoc B: Well but a thousand taps in eight minutes is a l in twelve minutes is a lot
PhD D: I I con I consider I consider acoustic events eh the silent too
Grad G: Silence starting or silence ending
PhD D: silent ground to bec to detect eh because I consider acoustic event all the things are not eh speech In ge in in in a general point of view
Professor E: OK so how many of those thousand were silence ?
PhD F: Not speech not speech or too much speech
Professor E: Right So how many of those thousand were silence silent sections ?
PhD D: silent I I I I do not I I have not the eh I I would like to to do a stylistic study
## instructions
Summarize the meeting based on the transcript. In paragraph form, output your response. Use at least 10 words and at most 50 words in total.

Build the evaluation framework

We now setup the tools we will use for evaluation.

We use three criteria that are graded using LLMs:

  • Completeness: checks if the summary includes all the important information from the original text that it should include
  • Correctness: checks if there are any hallucinations or factual inaccuracies in the summary
  • Conciseness: checks if the summary includes any unnecessary information or wordiness

In this cookbook, we will use Command-R to grade the completions. However, note that in practice, we typically use an ensemble of multiple LLM evaluators to reduce any bias.

PYTHON
1grading_prompt_template = """You are an AI grader that given a prompt, a completion, and a criterion, grades the completion based on the prompt and criterion. Below is a prompt, a completion, and a criterion with which to grade the completion. You need to respond according to the criterion instructions.
2
3PROMPT:
4{prompt}
5
6COMPLETION:
7{completion}
8
9CRITERION:
10{criteria}
11
12Do not provide an explanation or respond with additional text. Only respond with the final score in the format SCORE: <final_score>
13
14SCORE:"""
15
16
17criteria_completeness = """Please judge the above completion on whether or not all the completion fully addresses the task posed in the prompt. The completion should contain all the information that is asked for by the prompt while also upholding the constraints mentioned in the prompt. Please return a grade between 1 and 5, where 1 means the completion does not address the prompt at all, 2 means the completion mostly does not address the prompt, 3 is neutral, 4 means the completion mostly addresses the prompt, and 5 means the completion fully addresses the prompt."""
18
19criteria_correctness = """Please judge the above completion on whether or not all information was derived from the prompt and that no information is false. Return a grade between 1 and 5, where 1 means the answers are entirely untruthful relative to the prompt, 2 means the answers are somewhat untruthful relative to the prompt, 3 means neutral, 4 means the answers are somewhat truthful relative to the prompt, and 5 means the answers are entirely truthful relative to the prompt. Evaluate only the truthfulness of the answers, not whether or not they capture all the relevant information in the prompt."""
20
21criteria_conciseness = """Please judge the above completion on whether or not the completion contains any unnecessary information or wordiness that does not help answer the specific instruction given in the prompt. Return a grade between 1 and 5, where 1 means the completion contains many unnecessary details and wordiness that do not answer the specific instruction given in the prompt, 2 means the completion contains some unnecessary details or wordiness, 3 means neutral, 4 means the completion contains few unnecessary details or wordiness, and 5 means the completion contains only necessary details that answer the specific instruction given in the prompt."""
22
23
24def score_llm(prompt: str, completion: str, criteria: str) -> int:
25 """
26 Score a completion based on a prompt and a criterion using LLM Because we
27 grade all completions on a scale of 1-5, we will normalize the scores by 5 so that the final score
28 is between 0 and 1.
29 """
30 grading_prompt = grading_prompt_template.format(
31 prompt=prompt, completion=completion, criteria=criteria
32 )
33 # Use Cohere to grade the completion
34 completion = co.chat(message=grading_prompt, model=co_model, temperature=0.2).text
35
36 ### Alternatively, use OpenAI to grade the completion (requires key)
37 # import openai
38 # completion = openai.OpenAI(api_key="INSERT OPENAI KEY HERE").chat.completions.create(
39 # model="gpt-4",
40 # messages=[{"role": "user", "content": grading_prompt}],
41 # temperature=0.2,
42 # ).choices[0].message.content
43
44 # Extract the score from the completion
45 score = float(re.search(r"[12345]", completion).group()) / 5
46 return score

In addition, we have two criteria that are graded programmatically:

  • Format: checks if the summary follows the format (e.g. bullets) that was requested in the prompt
  • Length: checks if the summary follows the length that was requested in the prompt.
PYTHON
1def score_format(completion: str, format_type: str) -> int:
2 """
3 Returns 1 if the completion is in the correct format, 0 otherwise.
4 """
5 if format_type == "paragraphs":
6 return int(_is_only_paragraphs(completion))
7 elif format_type == "bullets":
8 return int(_is_only_bullets(completion))
9 return 0
10
11def score_length(
12 completion: str,
13 format_type: str,
14 min_val: int,
15 max_val: int,
16 number: Optional[int] = None
17) -> int:
18 """
19 Returns 1 if the completion has the correct length for the given format, 0 otherwise. This
20 includes both word count and number of items (optional).
21 """
22 # Split into items (each bullet for bullets or each paragraph for paragraphs)
23 if format_type == "bullets":
24 items = _extract_markdown_bullets(completion, include_bullet=False)
25 elif format_type == "paragraphs":
26 items = completion.split("\n")
27
28 # Strip whitespace and remove empty items
29 items = [item for item in items if item.strip() != ""]
30
31 # Check number of items if provided
32 if number is not None and len(items) != number:
33 return 0
34
35 # Check length of each item
36 for item in items:
37 num_words = item.strip().split()
38 if min_val is None and len(num_words) > max_val:
39 return 0
40 elif max_val is None and len(num_words) < min_val:
41 return 0
42 elif not min_val <= len(num_words) <= max_val:
43 return 0
44 return 1
45
46
47def _is_only_bullets(text: str) -> bool:
48 """
49 Returns True if text is only markdown bullets.
50 """
51 bullets = _extract_markdown_bullets(text, include_bullet=True)
52
53 for bullet in bullets:
54 text = text.replace(bullet, "")
55
56 return text.strip() == ""
57
58
59def _is_only_paragraphs(text: str) -> bool:
60 """
61 Returns True if text is only paragraphs (no bullets).
62 """
63 bullets = _extract_markdown_bullets(text, include_bullet=True)
64
65 return len(bullets) == 0
66
67
68def _extract_markdown_bullets(text: str, include_bullet: bool = False) -> List[str]:
69 """
70 Extracts markdown bullets from text as a list. If include_bullet is True, the bullet will be
71 included in the output. The list of accepted bullets is: -, *, +, •, and any number followed by
72 a period.
73 """
74 if include_bullet:
75 return re.findall(r"^[ \t]*(?:[-*+•]|[\d]+\.).*\w+.*$", text, flags=re.MULTILINE)
76 return re.findall(r"^[ \t]*(?:[-*+•]|[\d]+\.)(.*\w+.*)$", text, flags=re.MULTILINE)

Run evaluations

Now that we have our evaluation dataset and defined our evaluation functions, let’s run evaluations!

First, we generate completions to be graded. We will use Cohere’s Command-R model, boasting a context length of 128K.

PYTHON
1completions = []
2for prompt in data["prompt"]:
3 completion = co.chat(message=prompt, model="command-r", temperature=0.2).text
4 completions.append(completion)
5
6data["completion"] = completions
PYTHON
1print(data["completion"][0])

PhD D is transcribing recorded sessions to locate overlapping speech zones and categorizing them as acoustic events. The team discusses the parameters PhD D should use and how to define these events, considering the number of speakers and silence.

Let’s grade the completions using our LLM and non-LLM checks.

PYTHON
1data["format_score"] = data.apply(
2 lambda x: score_format(x["completion"], x["eval_metadata"]["format"]), axis=1
3)
4
5data["length_score"] = data.apply(
6 lambda x: score_length(
7 x["completion"],
8 x["eval_metadata"]["format"],
9 x["eval_metadata"].get("min_length"),
10 x["eval_metadata"].get("max_length"),
11 ),
12 axis=1,
13)
14
15data["completeness_score"] = data.apply(
16 lambda x: score_llm(x["prompt"], x["completion"], criteria_completeness), axis=1
17)
18
19data["correctness_score"] = data.apply(
20 lambda x: score_llm(x["prompt"], x["completion"], criteria_correctness), axis=1
21)
22
23data["conciseness_score"] = data.apply(
24 lambda x: score_llm(x["prompt"], x["completion"], criteria_conciseness), axis=1
25)
PYTHON
1data
instructioneval_metadataobjectivetranscriptprompttranscript_token_lencompletionformat_scorelength_scorecompleteness_scorecorrectness_scoreconciseness_score
0Summarize the meeting based on the transcript…{‘format’: ‘paragraphs’, ‘min_length’: 10, ‘ma…general_summarizationPhD F: As opposed to the rest of us \nPhD D: W…## meeting transcript\nPhD F: As opposed to th…1378PhD D is transcribing recorded sessions to loc…110.81.00.8
1Summarize the meeting based on the transcript…{‘format’: ‘paragraphs’, ‘min_length’: 50, ‘ma…general_summarizationLynne Neagle AM: Thank you very much And the n…## meeting transcript\nLynne Neagle AM: Thank …1649The discussion focused on the impact of COVID1…110.81.00.8
2Summarize the meeting based on the transcript…{‘format’: ‘bullets’, ‘number’: 3, ‘min_length…general_summarizationIndustrial Designer: Yep So we are to mainly d…## meeting transcript\nIndustrial Designer: Ye…1100- The team is designing a remote control with …100.81.00.8
3Summarize the meeting based on the transcript…{‘format’: ‘bullets’, ‘number’: 2, ‘min_length…general_summarizationIndustrial Designer: Mm I think one of the ver…## meeting transcript\nIndustrial Designer: Mm…2618- The team discusses the target demographic fo…110.81.00.8
4What are the follow-up items based on the meet…{‘format’: ‘bullets’, ‘number’: 3, ‘min_length…action_itemsMarketing: so a lot of people have to be able …## meeting transcript\nMarketing: so a lot of …2286- Investigate how the remote will interact wit…110.81.00.8
5What are the follow-up items based on the meet…{‘format’: ‘bullets’, ‘number’: 2, ‘min_length…action_itemsProject Manager: Alright So finance And we wil…## meeting transcript\nProject Manager: Alrigh…1965- The project manager will send the updated de…110.81.00.8

Finally, let’s print the average scores per critiera.

PYTHON
1avg_scores = data[["format_score", "length_score", "completeness_score", "correctness_score", "conciseness_score"]].mean()
2print(avg_scores)
Output
format_score 1.000000
length_score 0.833333
completeness_score 0.800000
correctness_score 1.000000
conciseness_score 0.800000
dtype: float64
Built with