Summarization Evals

Summarization Evals

In this cookbook, we will be demonstrating an approach we use for evaluating summarization tasks using LLM evaluation.

Table of contents:

  1. Get started
  2. Construct the evaluation dataset
  3. Build the evaluation framework
  4. Run evaluations


Get Started

You'll need a Cohere API key to run this notebook. If you don't have a key, head to https://cohere.com/ to generate your key.

!pip install cohere datasets --quiet
import json
import random
import re
from typing import List, Optional

import cohere
from getpass import getpass
from datasets import load_dataset
import pandas as pd

co_api_key = getpass("Enter your Cohere API key: ")
co_model = "command-r"
co = cohere.Client(api_key=co_api_key)

As test data, we'll use transcripts from the QMSum dataset. Note that in addition to the transcripts, this dataset also contains reference summaries -- we will use only the transcripts as our approach is reference-free.

qmsum = load_dataset("MocktaiLEngineer/qmsum-processed", split="validation")
transcripts = [x for x in qmsum["meeting_transcript"] if x is not None]
Generating train split:   0%|          | 0/1095 [00:00<?, ? examples/s]



Generating validation split:   0%|          | 0/237 [00:00<?, ? examples/s]



Generating test split:   0%|          | 0/244 [00:00<?, ? examples/s]


Construct the evaluation dataset

We are interested in evaluating summarization in real-world, enterprise use cases, which typically have two distinguishing features as compared to academic summarization benchmarks:

  • Enterprise use cases often focus on specific summarization objectives, e.g. "summarize action items".
  • Enterprise use cases often feature specific instruction constraints, e.g. "summarize in bullets with each bullet under 20 words".

Therefore, we must first create a dataset that contains diverse summarization prompts. We will do this programmatically by building prompts from their components, as defined below:

  • Prompt = text (e.g. transcript to be summarized) + instruction
  • Instruction = instruction objective (e.g. "summarize action items") + modifiers
  • Modifiers = format/length modifiers (e.g. "use bullets") + style/tone modifiers (e.g. "do not mention names") + ...

First, we define the prompt that combines the text and instructions. Here, we use a very basic prompt:

prompt_template = """## meeting transcript
{transcript}

## instructions
{instructions}"""

Next, we build the instructions. Because each instruction may have a different objective and modifiers, we track them using metadata. This will later be required for evaluation (i.e. to know what the prompt is asking).


instruction_objectives = {
    "general_summarization": "Summarize the meeting based on the transcript.",
    "action_items": "What are the follow-up items based on the meeting transcript?",
}

format_length_modifiers = {
    "paragraphs_short": {
        "text": "In paragraph form, output your response. Use at least 10 words and at most 50 words in total.",
        "objectives": ["general_summarization"],
        "eval_metadata": {
            "format": "paragraphs",
            "min_length": 10,
            "max_length": 50,
        },
    },
    "paragraphs_medium": {
        "text": "Return the answer in the form of paragraphs. Make sure your answer is between 50 and 200 words long.",
        "objectives": ["general_summarization"],
        "eval_metadata": {
            "format": "paragraphs",
            "min_length": 50,
            "max_length": 200,
        },
    },
    "bullets_short_3": {
        "text": "Format your answer in the form of bullets. Use exactly 3 bullets. Each bullet should be at least 10 words and at most 20 words.",
        "objectives": ["general_summarization", "action_items"],
        "eval_metadata": {
            "format": "bullets",
            "number": 3,
            "min_length": 10,
            "max_length": 20,
        },
    },
    "bullets_medium_2": {
        "text": "In bullets, output your response. Make sure to use exactly 2 bullets. Make sure each bullet is between 20 and 80 words long.",
        "objectives": ["general_summarization", "action_items"],
        "eval_metadata": {
            "format": "bullets",
            "number": 2,
            "min_length": 20,
            "max_length": 80,
        },
    },
}

Let's combine the objectives and format/length modifiers to finish building the instructions.

instructions = []
for obj_name, obj_text in instruction_objectives.items():
    for mod_data in format_length_modifiers.values():
        for mod_obj in mod_data["objectives"]:
            if mod_obj == obj_name:
                instruction = {
                        "instruction": f"{obj_text} {mod_data['text']}",
                        "eval_metadata": mod_data["eval_metadata"],
                        "objective": obj_name,
                    }
                instructions.append(instruction)

print(json.dumps(instructions[:2], indent=4))
[
    {
        "instruction": "Summarize the meeting based on the transcript. In paragraph form, output your response. Use at least 10 words and at most 50 words in total.",
        "eval_metadata": {
            "format": "paragraphs",
            "min_length": 10,
            "max_length": 50
        },
        "objective": "general_summarization"
    },
    {
        "instruction": "Summarize the meeting based on the transcript. Return the answer in the form of paragraphs. Make sure your answer is between 50 and 200 words long.",
        "eval_metadata": {
            "format": "paragraphs",
            "min_length": 50,
            "max_length": 200
        },
        "objective": "general_summarization"
    }
]

Finally, let's build the final prompts by semi-randomly pairing the instructions with transcripts from the QMSum dataset.

data = pd.DataFrame(instructions)

transcripts = sorted(transcripts, key=lambda x: len(x), reverse=True)[:int(len(transcripts) * 0.25)]
random.seed(42)
random.shuffle(transcripts)
data["transcript"] = transcripts[:len(data)]

data["prompt"] = data.apply(lambda x: prompt_template.format(transcript=x["transcript"], instructions=x["instruction"]), axis=1)
data["transcript_token_len"] = [len(x) for x in co.batch_tokenize(data["transcript"].tolist(), model=co_model)]
print(data["prompt"][0])
## meeting transcript
PhD F: As opposed to the rest of us 
PhD D: Well comment OK I I remind that me my first objective eh in the project is to to study difference parameters to to find a a good solution to detect eh the overlapping zone in eh speech recorded But eh tsk comment ehhh comment In that way comment I I I begin to to study and to analyze the ehn the recorded speech eh the different session to to find and to locate and to mark eh the the different overlapping zone And eh so eh I was eh I am transcribing the the first session and I I have found eh eh one thousand acoustic events eh besides the overlapping zones eh I I I mean the eh breaths eh aspiration eh eh talk eh eh clap eh comment I do not know what is the different names eh you use to to name the the pause n speech 
Grad G: Oh I do not think we ve been doing it at that level of detail So 
PhD D: Eh I I I do I do not need to to to mmm to m to label the the different acoustic but I prefer because eh I would like to to study if eh I I will find eh eh a good eh parameters eh to detect overlapping I would like to to to test these parameters eh with the another eh eh acoustic events to nnn to eh to find what is the ehm the false eh the false eh hypothesis eh nnn which eh are produced when we use the the ehm this eh parameter eh I mean pitch eh eh difference eh feature 
PhD A: You know I think some of these that are the nonspeech overlapping events may be difficult even for humans to tell that there s two there I mean if it s a tapping sound you would not necessarily or you know something like that it would be it might be hard to know that it was two separate events 
Grad G: Well You were not talking about just overlaps were you ? You were just talking about acoustic events 
PhD D: I I I I t I t I talk eh about eh acoustic events in general but eh my my objective eh will be eh to study eh overlapping zone Eh ? comment n Eh in twelve minutes I found eh eh one thousand acoustic events 
Professor E: How many overlaps were there in it ? No no how many of them were the overlaps of speech though ? 
PhD D: How many ? Eh almost eh three hundred eh in one session in five eh in forty five minutes Alm Three hundred overlapping zone With the overlapping zone overlapping speech speech what eh different duration 
Postdoc B: Does this ? So if you had an overlap involving three people how many times was that counted ? 
PhD D: three people two people Eh I would like to consider eh one people with difference noise eh in the background be 
Professor E: No no but I think what she s asking is pause if at some particular for some particular stretch you had three people talking instead of two did you call that one event ? 
PhD D: Oh Oh I consider one event eh for th for that eh for all the zone This th I I I con I consider I consider eh an acoustic event the overlapping zone the period where three speaker or eh are talking together 
Grad G: So let s say me and Jane are talking at the same time and then Liz starts talking also over all of us How many events would that be ? 
PhD D: So I do not understand 
Grad G: So two people are talking comment and then a third person starts talking Is there an event right here ? 
PhD D: Eh no No no For me is the overlapping zone because because you you have s you have more one eh more one voice eh eh produced in a in in a moment 
Grad G: So i if two or more people are talking 
Professor E: OK So I think We just wanted to understand how you are defining it So then in the region between since there there is some continuous region in between regions where there is only one person speaking And one contiguous region like that you are calling an event Is it Are you calling the beginning or the end of it the event or are you calling the entire length of it the event ? 
PhD D: I consider the the nnn the nnn nnn eh the entirety eh eh all all the time there were the voice has overlapped This is the idea But eh I I do not distinguish between the the numbers of eh speaker I m not considering eh the the ehm eh the fact of eh eh for example what did you say ? Eh at first eh eh two talkers are eh speaking and eh eh a third person eh join to to that For me it s eh it s eh all overlap zone with eh several numbers of speakers is eh eh the same acoustic event Wi but without any mark between the zone of the overlapping zone with two speakers eh speaking together and the zone with the three speakers 
Postdoc B: That would j just be one 
PhD D: Eh with eh a beginning mark and the ending mark Because eh for me is the is the zone with eh some kind of eh distortion the spectral I do not mind By the moment by the moment 
Grad G: Well but But you could imagine that three people talking has a different spectral characteristic than two 
PhD D: I I do not but eh but eh I have to study comment What will happen in a general way 
Grad G: So You had to start somewhere 
PhD C: So there s a lot of overlap 
PhD D: I I do not know what eh will will happen with the 
Grad G: That s a lot of overlap 
Professor E: So again that s that s three three hundred in forty five minutes that are that are speakers just speakers 
Postdoc B: But a a a th 
Professor E: So that s about eight per minute 
Postdoc B: But a thousand events in twelve minutes that s 
PhD C: But that can include taps 
Postdoc B: Well but a thousand taps in eight minutes is a l in twelve minutes is a lot 
PhD D: I I con I consider I consider acoustic events eh the silent too 
Grad G: Silence starting or silence ending 
PhD D: silent ground to bec to detect eh because I consider acoustic event all the things are not eh speech In ge in in in a general point of view 
Professor E: OK so how many of those thousand were silence ? 
PhD F: Not speech not speech or too much speech 
Professor E: Right So how many of those thousand were silence silent sections ? 
PhD D: silent I I I I do not I I have not the eh I I would like to to do a stylistic study

## instructions
Summarize the meeting based on the transcript. In paragraph form, output your response. Use at least 10 words and at most 50 words in total.


Build the evaluation framework

We now setup the tools we will use for evaluation.

We use three criteria that are graded using LLMs:

  • Completeness: checks if the summary includes all the important information from the original text that it should include
  • Correctness: checks if there are any hallucinations or factual inaccuracies in the summary
  • Conciseness: checks if the summary includes any unnecessary information or wordiness

In this cookbook, we will use Command-R to grade the completions. However, note that in practice, we typically use an ensemble of multiple LLM evaluators to reduce any bias.


grading_prompt_template = """You are an AI grader that given a prompt, a completion, and a criterion, grades the completion based on the prompt and criterion. Below is a prompt, a completion, and a criterion with which to grade the completion. You need to respond according to the criterion instructions.

PROMPT:
{prompt}

COMPLETION:
{completion}

CRITERION:
{criteria}

Do not provide an explanation or respond with additional text. Only respond with the final score in the format SCORE: <final_score>

SCORE:"""


criteria_completeness = """Please judge the above completion on whether or not all the completion fully addresses the task posed in the prompt. The completion should contain all the information that is asked for by the prompt while also upholding the constraints mentioned in the prompt. Please return a grade between 1 and 5, where 1 means the completion does not address the prompt at all, 2 means the completion mostly does not address the prompt, 3 is neutral, 4 means the completion mostly addresses the prompt, and 5 means the completion fully addresses the prompt."""

criteria_correctness = """Please judge the above completion on whether or not all information was derived from the prompt and that no information is false. Return a grade between 1 and 5, where 1 means the answers are entirely untruthful relative to the prompt, 2 means the answers are somewhat untruthful relative to the prompt, 3 means neutral, 4 means the answers are somewhat truthful relative to the prompt, and 5 means the answers are entirely truthful relative to the prompt. Evaluate only the truthfulness of the answers, not whether or not they capture all the relevant information in the prompt."""

criteria_conciseness = """Please judge the above completion on whether or not the completion contains any unnecessary information or wordiness that does not help answer the specific instruction given in the prompt. Return a grade between 1 and 5, where 1 means the completion contains many unnecessary details and wordiness that do not answer the specific instruction given in the prompt, 2 means the completion contains some unnecessary details or wordiness, 3 means neutral, 4 means the completion contains few unnecessary details or wordiness, and 5 means the completion contains only necessary details that answer the specific instruction given in the prompt."""


def score_llm(prompt: str, completion: str, criteria: str) -> int:
    """
    Score a completion based on a prompt and a criterion using LLM Because we
    grade all completions on a scale of 1-5, we will normalize the scores by 5 so that the final score
    is between 0 and 1.
    """
    grading_prompt = grading_prompt_template.format(
        prompt=prompt, completion=completion, criteria=criteria
    )
    # Use Cohere to grade the completion
    completion = co.chat(message=grading_prompt, model=co_model, temperature=0.2).text

    ### Alternatively, use OpenAI to grade the completion (requires key)
    # import openai
    # completion = openai.OpenAI(api_key="INSERT OPENAI KEY HERE").chat.completions.create(
    #     model="gpt-4",
    #     messages=[{"role": "user", "content": grading_prompt}],
    #     temperature=0.2,
    # ).choices[0].message.content

    # Extract the score from the completion
    score = float(re.search(r"[12345]", completion).group()) / 5
    return score

In addition, we have two criteria that are graded programmatically:

  • Format: checks if the summary follows the format (e.g. bullets) that was requested in the prompt
  • Length: checks if the summary follows the length that was requested in the prompt.

def score_format(completion: str, format_type: str) -> int:
    """
    Returns 1 if the completion is in the correct format, 0 otherwise.
    """
    if format_type == "paragraphs":
        return int(_is_only_paragraphs(completion))
    elif format_type == "bullets":
        return int(_is_only_bullets(completion))
    return 0

def score_length(
    completion: str,
    format_type: str,
    min_val: int,
    max_val: int,
    number: Optional[int] = None
) -> int:
    """
    Returns 1 if the completion has the correct length for the given format, 0 otherwise. This
    includes both word count and number of items (optional).
    """
    # Split into items (each bullet for bullets or each paragraph for paragraphs)
    if format_type == "bullets":
        items = _extract_markdown_bullets(completion, include_bullet=False)
    elif format_type == "paragraphs":
        items = completion.split("\n")

    # Strip whitespace and remove empty items
    items = [item for item in items if item.strip() != ""]

    # Check number of items if provided
    if number is not None and len(items) != number:
        return 0

    # Check length of each item
    for item in items:
        num_words = item.strip().split()
        if min_val is None and len(num_words) > max_val:
            return 0
        elif max_val is None and len(num_words) < min_val:
            return 0
        elif not min_val <= len(num_words) <= max_val:
            return 0
    return 1


def _is_only_bullets(text: str) -> bool:
    """
    Returns True if text is only markdown bullets.
    """
    bullets = _extract_markdown_bullets(text, include_bullet=True)

    for bullet in bullets:
        text = text.replace(bullet, "")

    return text.strip() == ""


def _is_only_paragraphs(text: str) -> bool:
    """
    Returns True if text is only paragraphs (no bullets).
    """
    bullets = _extract_markdown_bullets(text, include_bullet=True)

    return len(bullets) == 0


def _extract_markdown_bullets(text: str, include_bullet: bool = False) -> List[str]:
    """
    Extracts markdown bullets from text as a list. If include_bullet is True, the bullet will be
    included in the output. The list of accepted bullets is: -, *, +, •, and any number followed by
    a period.
    """
    if include_bullet:
        return re.findall(r"^[ \t]*(?:[-*+•]|[\d]+\.).*\w+.*$", text, flags=re.MULTILINE)
    return re.findall(r"^[ \t]*(?:[-*+•]|[\d]+\.)(.*\w+.*)$", text, flags=re.MULTILINE)


Run evaluations

Now that we have our evaluation dataset and defined our evaluation functions, let's run evaluations!

First, we generate completions to be graded. We will use Cohere's Command-R model, boasting a context length of 128K.

completions = []
for prompt in data["prompt"]:
    completion = co.chat(message=prompt, model="command-r", temperature=0.2).text
    completions.append(completion)

data["completion"] = completions
print(data["completion"][0])
PhD D is transcribing recorded sessions to locate overlapping speech zones and categorizing them as acoustic events. The team discusses the parameters PhD D should use and how to define these events, considering the number of speakers and silence.

Let's grade the completions using our LLM and non-LLM checks.

data["format_score"] = data.apply(
    lambda x: score_format(x["completion"], x["eval_metadata"]["format"]), axis=1
)

data["length_score"] = data.apply(
    lambda x: score_length(
        x["completion"],
        x["eval_metadata"]["format"],
        x["eval_metadata"].get("min_length"),
        x["eval_metadata"].get("max_length"),
    ),
    axis=1,
)

data["completeness_score"] = data.apply(
    lambda x: score_llm(x["prompt"], x["completion"], criteria_completeness), axis=1
)

data["correctness_score"] = data.apply(
    lambda x: score_llm(x["prompt"], x["completion"], criteria_correctness), axis=1
)

data["conciseness_score"] = data.apply(
    lambda x: score_llm(x["prompt"], x["completion"], criteria_conciseness), axis=1
)
data
instruction eval_metadata objective transcript prompt transcript_token_len completion format_score length_score completeness_score correctness_score conciseness_score
0 Summarize the meeting based on the transcript.... {'format': 'paragraphs', 'min_length': 10, 'ma... general_summarization PhD F: As opposed to the rest of us \nPhD D: W... ## meeting transcript\nPhD F: As opposed to th... 1378 PhD D is transcribing recorded sessions to loc... 1 1 0.8 1.0 0.8
1 Summarize the meeting based on the transcript.... {'format': 'paragraphs', 'min_length': 50, 'ma... general_summarization Lynne Neagle AM: Thank you very much And the n... ## meeting transcript\nLynne Neagle AM: Thank ... 1649 The discussion focused on the impact of COVID1... 1 1 0.8 1.0 0.8
2 Summarize the meeting based on the transcript.... {'format': 'bullets', 'number': 3, 'min_length... general_summarization Industrial Designer: Yep So we are to mainly d... ## meeting transcript\nIndustrial Designer: Ye... 1100 - The team is designing a remote control with ... 1 0 0.8 1.0 0.8
3 Summarize the meeting based on the transcript.... {'format': 'bullets', 'number': 2, 'min_length... general_summarization Industrial Designer: Mm I think one of the ver... ## meeting transcript\nIndustrial Designer: Mm... 2618 - The team discusses the target demographic fo... 1 1 0.8 1.0 0.8
4 What are the follow-up items based on the meet... {'format': 'bullets', 'number': 3, 'min_length... action_items Marketing: so a lot of people have to be able ... ## meeting transcript\nMarketing: so a lot of ... 2286 - Investigate how the remote will interact wit... 1 1 0.8 1.0 0.8
5 What are the follow-up items based on the meet... {'format': 'bullets', 'number': 2, 'min_length... action_items Project Manager: Alright So finance And we wil... ## meeting transcript\nProject Manager: Alrigh... 1965 - The project manager will send the updated de... 1 1 0.8 1.0 0.8

Finally, let's print the average scores per critiera.

avg_scores = data[["format_score", "length_score", "completeness_score", "correctness_score", "conciseness_score"]].mean()
print(avg_scores)
format_score          1.000000
length_score          0.833333
completeness_score    0.800000
correctness_score     1.000000
conciseness_score     0.800000
dtype: float64