Long-Form Text Strategies with Cohere

Ania Bialas

Large Language Models (LLMs) are becoming increasingly capable of comprehending text, among others excelling in document analysis. The new Cohere model, Command-R, boasts a context length of 128k, which makes it particularly effective for such tasks. Nevertheless, even with the extended context window, some documents might be too lengthy to accommodate in full.

In this cookbook, we’ll explore techniques to address cases when relevant information doesn’t fit in the model context window.

We’ll show you three potential mitigation strategies: truncating the document, query-based retrieval, and a “text rank” approach we use internally at Cohere.

Summary

Approach	Description	Pros	Cons	When to use?
Truncation	Truncate the document to fit the context window.	- Simplicity of implementation (does not rely on extrenal infrastructure)	- Loses information at the end of the document	Utilize when all relevant information is contained at the beginning of the document.
Query Based Retrieval	Utilize semantic similarity to retrieve text chunks that are most relevant to the query.	- Focuses on sections directly relevant to the query	- Relies on a semantic similarity algorithm. - Might lose broader context	Employ when seeking specific information within the text.
Text Rank	Apply graph theory to generate a cohesive set of chunks that effectively represent the document.	- Preserves the broader picture.	- Might lose detailed information.	Utilize in summaries and when the question requires broader context.

Getting Started

PYTHON

1 %%capture
2 !pip install cohere
3 !pip install python-dotenv
4 !pip install tokenizers
5 !pip install langchain
6 !pip install nltk
7 !pip install networkx
8 !pip install pypdf2

PYTHON

1 import os
2 import requests
3 from collections import deque
4 from typing import List, Tuple
5 
6 import cohere
7 
8 import numpy as np
9 
10 import PyPDF2
11 from dotenv import load_dotenv
12 
13 from tokenizers import Tokenizer
14 
15 import nltk
16 nltk.download('punkt')  # Download the necessary data for sentence tokenization
17 from nltk.tokenize import sent_tokenize
18 
19 import networkx as nx
20 from getpass import getpass
21 from IPython.display import HTML, display

Output

[nltk_data] Downloading package punkt to
[nltk_data]     /home/anna_cohere_com/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

PYTHON

1 # Set up Cohere client
2 co_model = 'command-r'
3 co_api_key = getpass("Enter your Cohere API key: ")
4 co = cohere.Client(api_key=co_api_key)

PYTHON

1 def load_long_pdf(file_path):
2     """
3     Load a long PDF file and extract its text content.
4 
5     Args:
6         file_path (str): The path to the PDF file.
7 
8     Returns:
9         str: The extracted text content of the PDF file.
10     """
11     with open(file_path, 'rb') as file:
12         pdf_reader = PyPDF2.PdfReader(file)
13         num_pages = len(pdf_reader.pages)
14         full_text = ''
15         for page_num in range(num_pages):
16             page = pdf_reader.pages[page_num]
17             full_text += page.extract_text()
18     return full_text
19 
20 def save_pdf_from_url(pdf_url, save_path):
21     try:
22         # Send a GET request to the PDF URL
23         response = requests.get(pdf_url, stream=True)
24         response.raise_for_status()  # Raise an exception for HTTP errors
25 
26         # Open the local file for writing in binary mode
27         with open(save_path, 'wb') as file:
28             # Write the content of the response to the local file
29             for chunk in response.iter_content(chunk_size=8192):
30                 file.write(chunk)
31 
32         print(f"PDF saved successfully to '{save_path}'")
33     except requests.exceptions.RequestException as e:
34         print(f"Error downloading PDF: {e}")

In this example we use the Proposal for a Regulation of the European Parliament and of the Council defining rules on Artificial Intelligence from 26 January 2024, link.

PYTHON

1 # Download the PDF file from the URL
2 pdf_url = 'https://data.consilium.europa.eu/doc/document/ST-5662-2024-INIT/en/pdf'
3 save_path = 'example.pdf'
4 save_pdf_from_url(pdf_url, save_path)
5 
6 # Load the PDF file and extract its text content
7 long_text = load_long_pdf(save_path)
8 long_text = long_text.replace('\n', ' ')
9 
10 # Print the length of the document
11 print("Document length - #tokens:", len(co.tokenize(text=long_text, model=co_model).tokens))

Output

PDF saved successfully to 'example.pdf'
Document length - #tokens: 134184

Summarizing the text

PYTHON

1 def generate_response(message, max_tokens=300, temperature=0.2, k=0):
2   """
3   A wrapper around the Cohere API to generate a response based on a given prompt.
4 
5   Args:
6     messsage (str): The input message for generating the response.
7     max_tokens (int, optional): The maximum number of tokens in the generated response. Defaults to 300.
8     temperature (float, optional): Controls the randomness of the generated response. Higher values (e.g., 1.0) make the output more random, while lower values (e.g., 0.2) make it more deterministic. Defaults to 0.2.
9     k (int, optional): Controls the diversity of the generated response. Higher values (e.g., 5) make the output more diverse, while lower values (e.g., 0) make it more focused. Defaults to 0.
10 
11   Returns:
12     str: The generated response.
13 
14   """
15   response = co.chat(
16     model = co_model,
17     message=message,
18     max_tokens=max_tokens,
19     temperature=temperature,
20     return_prompt=True
21     )
22   return response.text

PYTHON

1 # Example summary prompt.
2 prompt_template = """
3 ## Instruction
4 Summarize the following Document in 3-5 sentences. Only answer based on the information provided in the document.
5 
6 ## Document
7 {document}
8 
9 ## Summary
10 """.strip()

If you run the cell below, an error will occur. Therefore, in the following sections, we will explore some techniques to address this limitation.

Error: :CohereAPIError: too many tokens:

PYTHON

1 prompt = prompt_template.format(document=long_text)
2 # print(generate_response(message=prompt))

Therefore, in the following sections, we will explore some techniques to address this limitation.

Approach 1 - Truncate

First we try to truncate the document so that it meets the length constraints. This approach is simple to implement and understand. However, it drops potentially important information contained towards the end of the document.

PYTHON

1 # The new Cohere model has a context limit of 128k tokens. However, for the purpose of this exercise, we will assume a smaller context window.
2 # Employing a smaller context window also has the additional benefit of reducing the cost per request, especially if billed by the number of tokens.
3 
4 MAX_TOKENS = 40000
5 
6 def truncate(long: str, max_tokens: int) -> str:
7     """
8     Shortens `long` by brutally truncating it to the first `max_tokens` tokens.
9     This can break up sentences, passages, etc.
10     """
11 
12     tokenized = co.tokenize(text=long, model=co_model).token_strings
13     truncated = tokenized[:max_tokens]
14     short = "".join(truncated)
15     return short

PYTHON

1 short_text = truncate(long_text, MAX_TOKENS)
2 
3 prompt = prompt_template.format(document=short_text)
4 print(generate_response(message=prompt))

The document discusses the impact of a specific protein, p53, on the process of angiogenesis, which is the growth of new blood vessels. Angiogenesis plays a critical role in various physiological processes, including wound healing and embryonic development. The presence of the p53 protein can inhibit angiogenesis by regulating the expression of certain genes and proteins. This inhibition can have significant implications for tumor growth, as angiogenesis is essential for tumor progression. Therefore, understanding the role of p53 in angiogenesis can contribute to our knowledge of tumor suppression and potential therapeutic interventions.

Additionally, the document mentions that the regulation of angiogenesis by p53 occurs independently of the protein’s role in cell cycle arrest and apoptosis, which are other key functions of p53 in tumor suppression. This suggests that p53 has a complex and multifaceted impact on cellular processes.

Approach 2: Query Based Retrieval

In this section we present how we can leverage a query retriereval based approach to generate an answer to the following question: Based on the document, are there any risks related to Elon Musk?.

The solution is outlined below and can be broken down into four functional steps.

Chunk the text into units
- Here we employ a simple chunking algorithm. More information about different chunking strategies can be found [here](TODO: link to chunking post).
Use a ranking algorithm to rank chunks against the query
- We leverage another Cohere endpoint, co.rerank (docs link), to rank each chunk against the query.
Keep the most-relevant chunks until context limit is reached
- co.rerank returns a relevance score, facilitating the selection of the most pertinent chunks. We can choose the most relevant chunks based on this score.
Put condensed text back in original order
- Finally, we arrange the chosen chunks in their original sequence as they appear in the document.

See query_based_retrieval function for the starting point.

Query based retrieval implementation

PYTHON

1 def split_text_into_sentences(text) -> List[str]:
2     """
3     Split the input text into a list of sentences.
4     """
5     sentences = sent_tokenize(text)
6 
7     return sentences
8 
9 def group_sentences_into_passages(sentence_list, n_sentences_per_passage=5):
10     """
11     Group sentences into passages of n_sentences sentences.
12     """
13     passages = []
14     passage = ""
15     for i, sentence in enumerate(sentence_list):
16         passage += sentence + " "
17         if (i + 1) % n_sentences_per_passage == 0:
18             passages.append(passage)
19             passage = ""
20     return passages
21 
22 def build_simple_chunks(text, n_sentences=5):
23     """
24     Build chunks of text from the input text.
25     """
26     sentences = split_text_into_sentences(text)
27     chunks = group_sentences_into_passages(sentences, n_sentences_per_passage=n_sentences)
28     return chunks

PYTHON

1 sentences = split_text_into_sentences(long_text)
2 passages = group_sentences_into_passages(sentences, n_sentences_per_passage=5)
3 print('Example sentence:', np.random.choice(np.asarray(sentences), size=1, replace=False))
4 print()
5 print('Example passage:', np.random.choice(np.asarray(passages), size=1, replace=False))

Output

Example sentence: ['4.']
Example passage: ['T echnical robustness and safety means that AI systems are developed  and used in a way that allows robustness in case of problems and resilience against  attempts to alter the use or performance of the AI system so as to allow unlawful use by  third parties, a nd minimise unintended harm. Privacy and data governance means that AI  systems are developed and used in compliance with existing privacy and data protection  rules, while processing data that meets high standards in terms of quality and integrity. Transpar ency means that AI systems are developed and used in a way that allows  appropriate traceability and explainability, while making humans aware that they  communicate or interact with an AI system, as well as duly informing deployers of the  capabilities and l imitations of that AI system and affected persons about their rights. Diversity, non - discrimination and fairness means that AI systems are developed and used  in a way that includes diverse actors and promotes equal access, gender equality and  cultural dive rsity, while avoiding discriminatory impacts and unfair biases that are  prohibited by Union or national law. Social and environmental well - being means that AI  systems are developed and used in a sustainable and environmentally friendly manner as  well as in   a way to benefit all human beings, while monitoring and assessing the long - term  impacts on the individual, society and democracy. ']

PYTHON

1 def _add_chunks_by_priority(
2     chunks: List[str],
3     idcs_sorted_by_priority: List[int],
4     max_tokens: int,
5 ) -> List[Tuple[int, str]]:
6     """
7     Given chunks of text and their indices sorted by priority (highest priority first), this function
8     fills the model context window with as many highest-priority chunks as possible.
9 
10     The output is a list of (index, chunk) pairs, ordered by priority. To stitch back the chunks into
11     a cohesive text that preserves chronological order, sort the output on its index.
12     """
13 
14     selected = []
15     num_tokens = 0
16     idcs_queue = deque(idcs_sorted_by_priority)
17 
18     while num_tokens < max_tokens and len(idcs_queue) > 0:
19         next_idx = idcs_queue.popleft()
20         num_tokens += len(co.tokenize(text=chunks[next_idx], model=co_model).tokens)
21         # keep index and chunk, to reorder chronologically
22         selected.append((next_idx, chunks[next_idx]))
23     if num_tokens > max_tokens:
24         selected.pop()
25 
26     return selected
27 
28 def query_based_retrieval(
29     long: str,
30     max_tokens: int,
31     query: str,
32     n_setences_per_passage: int = 5,
33 ) -> str:
34     """
35     Performs query-based retrieval on a long text document.
36     """
37     # 1. Chunk text into units
38     chunks = build_simple_chunks(long, n_setences_per_passage)
39 
40     # 2. Use co.rerank to rank chunks vs. query
41     chunks_reranked = co.rerank(query=query, documents=chunks, model="rerank-english-v3.0")
42     idcs_sorted_by_relevance = [
43         chunk.index for chunk in sorted(chunks_reranked.results, key=lambda c: c.relevance_score, reverse=True)
44     ]
45 
46     # 3. Add chunks back in order of relevance
47     selected = _add_chunks_by_priority(chunks, idcs_sorted_by_relevance, max_tokens)
48 
49     # 4. Put condensed text back in original order
50     separator = " "
51     short = separator.join([chunk for index, chunk in sorted(selected, key=lambda item: item[0], reverse=False)])
52     return short

PYTHON

1 # Example prompt
2 prompt_template = """
3 ## Instruction
4 {query}
5 
6 ## Document
7 {document}
8 
9 ## Answer
10 """.strip()

PYTHON

1 query = "What does the report say about biometric identification? Answer only based on the document."
2 short_text = query_based_retrieval(long_text, MAX_TOKENS, query)
3 prompt = prompt_template.format(query=query, document=short_text)
4 print(generate_response(message=prompt, max_tokens=300))

Output

The report discusses the restrictions on the use of biometric identification by law enforcement in publicly accessible spaces. According to the document, real-time biometric identification is prohibited unless in exceptional cases where its use is strictly necessary and proportionate to achieving a substantial public interest. The use of post-remote biometric identification systems is also mentioned, noting the requirements for authorization and limitations on its use.
The report also highlights the classification of certain AI systems as high-risk, including biometric identification systems, emotion recognition systems, and biometric categorisation systems, with the exception of systems used for biometric verification. High-risk AI systems are subject to specific requirements and obligations.

Approach 3: Text rank

In the final section we will show how we leverage graph theory to select chunks based on their centrality. Centrality is a graph-theoretic measure of how connected a node is; the higher the centrality, the more connected the node is to surrounding nodes (with fewer connections among those neighbors).

The solution presented in this document can be broken down into five functional steps:

Break the document into chunks.
- This mirrors the first step in Approach 2.
Embed each chunk using an embedding model and construct a similarity matrix.
- We utilize co.embed documentation link.
Compute the centrality of each chunk.
- We employ a package called NetworkX. It constructs a graph where the chunks are nodes, and the similarity score between them serves as the weight of the edges. Then, we calculate the centrality of each chunk as the sum of the edge weights adjacent to the node representing that chunk.
Retain the highest-centrality chunks until the context limit is reached.
- This step follows a similar approach to Approach 2.
Reassemble the shortened text by reordering chunks in their original order.
- This step mirrors the last step in Approach 2.

See text_rank as the starting point.

Text rank implementation

PYTHON

1 def text_rank(text: str, max_tokens: int, n_setences_per_passage: int) -> str:
2     """
3     Shortens text by extracting key units of text from it based on their centrality.
4     The output is the concatenation of those key units, in their original order.
5     """
6 
7     # 1. Chunk text into units
8     chunks = build_simple_chunks(text, n_setences_per_passage)
9 
10     # 2. Embed and construct similarity matrix
11     embeddings = np.array(
12         co.embed(
13             texts=chunks,
14             model="embed-v4.0",
15             input_type="clustering",
16         ).embeddings
17     )
18     similarities = np.dot(embeddings, embeddings.T)
19 
20     # 3. Compute centrality and sort sentences by centrality
21     # Easiest to use networkx's `degree` function with similarity as weight
22     g = nx.from_numpy_array(similarities, edge_attr="weight")
23     centralities = g.degree(weight="weight")
24     idcs_sorted_by_centrality = [node for node, degree in sorted(centralities, key=lambda item: item[1], reverse=True)]
25 
26     # 4. Add chunks back in order of centrality
27     selected = _add_chunks_by_priority(chunks, idcs_sorted_by_centrality, max_tokens)
28 
29     # 5. Put condensed text back in original order
30     short = " ".join([chunk for index, chunk in sorted(selected, key=lambda item: item[0], reverse=False)])
31 
32     return short

PYTHON

1 # Example summary prompt.
2 prompt_template = """
3 ## Instruction
4 Summarize the following Document in 3-5 sentences. Only answer based on the information provided in the document.
5 
6 ## Document
7 {document}
8 
9 ## Summary
10 """.strip()

PYTHON

1 short_text = text_rank(long_text, MAX_TOKENS, 5)
2 prompt = prompt_template.format(document=short_text)
3 print(generate_response(message=prompt, max_tokens=600))

Output

The document outlines the requirements and obligations for developing and deploying AI systems in the European Union. It aims to establish a regulatory framework to foster innovation while ensuring the protection of fundamental rights and public interests. The regulation applies to providers and deployers of AI systems, including those established outside the EU. High-risk AI systems are subject to specific requirements, such as risk management, data governance, and transparency. Providers must ensure compliance and keep records, and deployers must use AI systems responsibly. The regulation also establishes an AI Office, advisory bodies, and a database for high-risk AI systems. Additionally, it addresses issues like testing, codes of conduct, and cooperation with third countries. Fines and penalties are proposed for non-compliance.

Summary

In this notebook we present three useful methods to over come the limitations of context window size. In the following blog post, we talk more about how these methods can be evaluated.