Long-Form Text Strategies with Cohere

Ania BialasAnia Bialas

Large Language Models (LLMs) are becoming increasingly capable of comprehending text, among others excelling in document analysis. The new Cohere model, Command A, boasts a context length of 256k, which makes it particularly effective for such tasks. Nevertheless, even with the extended context window, some documents might be too lengthy to accommodate in full.

In this cookbook, we’ll explore techniques to address cases when relevant information doesn’t fit in the model context window.

We’ll show you three potential mitigation strategies: truncating the document, query-based retrieval, and a “text rank” approach we use internally at Cohere.

Summary

ApproachDescriptionProsConsWhen to use?
TruncationTruncate the document to fit the context window.- Simplicity of implementation
(does not rely on extrenal infrastructure)
- Loses information at the end of the documentUtilize when all relevant information is contained
at the beginning of the document.
Query Based RetrievalUtilize semantic similarity to retrieve text chunks
that are most relevant to the query.
- Focuses on sections directly relevant to
the query
- Relies on a semantic similarity algorithm.
- Might lose broader context
Employ when seeking specific
information within the text.
Text RankApply graph theory to generate a cohesive set
of chunks that effectively represent the document.
- Preserves the broader picture.- Might lose detailed information.Utilize in summaries and when the question
requires broader context.

Getting Started

PYTHON
1%%capture
2!pip install cohere
3!pip install python-dotenv
4!pip install tokenizers
5!pip install langchain
6!pip install nltk
7!pip install networkx
8!pip install pypdf2
PYTHON
1import os
2import requests
3from collections import deque
4from typing import List, Tuple
5
6import cohere
7
8import numpy as np
9
10import PyPDF2
11from dotenv import load_dotenv
12
13from tokenizers import Tokenizer
14
15import nltk
16nltk.download('punkt') # Download the necessary data for sentence tokenization
17from nltk.tokenize import sent_tokenize
18
19import networkx as nx
20from getpass import getpass
Output
[nltk_data] Downloading package punkt to
[nltk_data] /home/anna_cohere_com/nltk_data...
[nltk_data] Package punkt is already up-to-date!
PYTHON
1# Set up Cohere client
2co_model = 'command-a-03-2025'
3co_api_key = getpass("Enter your Cohere API key: ")
4co = cohere.Client(api_key=co_api_key)
PYTHON
1def load_long_pdf(file_path):
2 """
3 Load a long PDF file and extract its text content.
4
5 Args:
6 file_path (str): The path to the PDF file.
7
8 Returns:
9 str: The extracted text content of the PDF file.
10 """
11 with open(file_path, 'rb') as file:
12 pdf_reader = PyPDF2.PdfReader(file)
13 num_pages = len(pdf_reader.pages)
14 full_text = ''
15 for page_num in range(num_pages):
16 page = pdf_reader.pages[page_num]
17 full_text += page.extract_text()
18 return full_text
19
20def save_pdf_from_url(pdf_url, save_path):
21 try:
22 # Send a GET request to the PDF URL
23 response = requests.get(pdf_url, stream=True)
24 response.raise_for_status() # Raise an exception for HTTP errors
25
26 # Open the local file for writing in binary mode
27 with open(save_path, 'wb') as file:
28 # Write the content of the response to the local file
29 for chunk in response.iter_content(chunk_size=8192):
30 file.write(chunk)
31
32 print(f"PDF saved successfully to '{save_path}'")
33 except requests.exceptions.RequestException as e:
34 print(f"Error downloading PDF: {e}")

In this example we use the Proposal for a Regulation of the European Parliament and of the Council defining rules on Artificial Intelligence from 26 January 2024, link.

PYTHON
1# Download the PDF file from the URL
2pdf_url = 'https://data.consilium.europa.eu/doc/document/ST-5662-2024-INIT/en/pdf'
3save_path = 'example.pdf'
4save_pdf_from_url(pdf_url, save_path)
5
6# Load the PDF file and extract its text content
7long_text = load_long_pdf(save_path)
8long_text = long_text.replace('\n', ' ')
9
10# Print the length of the document
11print("Document length - #tokens:", len(co.tokenize(text=long_text, model=co_model).tokens))
Output
PDF saved successfully to 'example.pdf'
Document length - #tokens: 134184

Summarizing the text

PYTHON
1def generate_response(message, max_tokens=300, temperature=0.2, k=0):
2 """
3 A wrapper around the Cohere API to generate a response based on a given prompt.
4
5 Args:
6 messsage (str): The input message for generating the response.
7 max_tokens (int, optional): The maximum number of tokens in the generated response. Defaults to 300.
8 temperature (float, optional): Controls the randomness of the generated response. Higher values (e.g., 1.0) make the output more random, while lower values (e.g., 0.2) make it more deterministic. Defaults to 0.2.
9 k (int, optional): Controls the diversity of the generated response. Higher values (e.g., 5) make the output more diverse, while lower values (e.g., 0) make it more focused. Defaults to 0.
10
11 Returns:
12 str: The generated response.
13
14 """
15 response = co.chat(
16 model = co_model,
17 message=message,
18 max_tokens=max_tokens,
19 temperature=temperature
20 )
21 return response.text
PYTHON
1# Example summary prompt.
2prompt_template = """
3## Instruction
4Summarize the following Document in 3-5 sentences. Only answer based on the information provided in the document.
5
6## Document
7{document}
8
9## Summary
10""".strip()

If you run the cell below, an error will occur. Therefore, in the following sections, we will explore some techniques to address this limitation.

Error: :CohereAPIError: too many tokens:

PYTHON
1prompt = prompt_template.format(document=long_text)
2# print(generate_response(message=prompt))

Therefore, in the following sections, we will explore some techniques to address this limitation.

Approach 1 - Truncate

First we try to truncate the document so that it meets the length constraints. This approach is simple to implement and understand. However, it drops potentially important information contained towards the end of the document.

PYTHON
1# The new Cohere model has a context limit of 128k tokens. However, for the purpose of this exercise, we will assume a smaller context window.
2# Employing a smaller context window also has the additional benefit of reducing the cost per request, especially if billed by the number of tokens.
3
4MAX_TOKENS = 40000
5
6def truncate(long: str, max_tokens: int) -> str:
7 """
8 Shortens `long` by brutally truncating it to the first `max_tokens` tokens.
9 This can break up sentences, passages, etc.
10 """
11
12 tokenized = co.tokenize(text=long, model=co_model).token_strings
13 truncated = tokenized[:max_tokens]
14 short = "".join(truncated)
15 return short
PYTHON
1short_text = truncate(long_text, MAX_TOKENS)
2
3prompt = prompt_template.format(document=short_text)
4print(generate_response(message=prompt))

The document discusses the impact of a specific protein, p53, on the process of angiogenesis, which is the growth of new blood vessels. Angiogenesis plays a critical role in various physiological processes, including wound healing and embryonic development. The presence of the p53 protein can inhibit angiogenesis by regulating the expression of certain genes and proteins. This inhibition can have significant implications for tumor growth, as angiogenesis is essential for tumor progression. Therefore, understanding the role of p53 in angiogenesis can contribute to our knowledge of tumor suppression and potential therapeutic interventions.

Additionally, the document mentions that the regulation of angiogenesis by p53 occurs independently of the protein’s role in cell cycle arrest and apoptosis, which are other key functions of p53 in tumor suppression. This suggests that p53 has a complex and multifaceted impact on cellular processes.

Approach 2: Query Based Retrieval

In this section we present how we can leverage a query retriereval based approach to generate an answer to the following question: Based on the document, are there any risks related to Elon Musk?.

The solution is outlined below and can be broken down into four functional steps.

  1. Chunk the text into units

    • Here we employ a simple chunking algorithm. More information about different chunking strategies can be found [here](TODO: link to chunking post).
  2. Use a ranking algorithm to rank chunks against the query

    • We leverage another Cohere endpoint, co.rerank (docs link), to rank each chunk against the query.
  3. Keep the most-relevant chunks until context limit is reached

    • co.rerank returns a relevance score, facilitating the selection of the most pertinent chunks. We can choose the most relevant chunks based on this score.
  4. Put condensed text back in original order

    • Finally, we arrange the chosen chunks in their original sequence as they appear in the document.

See query_based_retrieval function for the starting point.

Query based retrieval implementation

PYTHON
1def split_text_into_sentences(text) -> List[str]:
2 """
3 Split the input text into a list of sentences.
4 """
5 sentences = sent_tokenize(text)
6
7 return sentences
8
9def group_sentences_into_passages(sentence_list, n_sentences_per_passage=5):
10 """
11 Group sentences into passages of n_sentences sentences.
12 """
13 passages = []
14 passage = ""
15 for i, sentence in enumerate(sentence_list):
16 passage += sentence + " "
17 if (i + 1) % n_sentences_per_passage == 0:
18 passages.append(passage)
19 passage = ""
20 return passages
21
22def build_simple_chunks(text, n_sentences=5):
23 """
24 Build chunks of text from the input text.
25 """
26 sentences = split_text_into_sentences(text)
27 chunks = group_sentences_into_passages(sentences, n_sentences_per_passage=n_sentences)
28 return chunks
PYTHON
1sentences = split_text_into_sentences(long_text)
2passages = group_sentences_into_passages(sentences, n_sentences_per_passage=5)
3print('Example sentence:', np.random.choice(np.asarray(sentences), size=1, replace=False))
4print()
5print('Example passage:', np.random.choice(np.asarray(passages), size=1, replace=False))
Output
Example sentence: ['The European Data Protection Supervisor may also establish an AI regulatory sandbox for the EU institutions, bodies and agencies and exercise the roles and the tasks of national competent authorities in accordance with this chapter.']
Example passage: ['This flexibility could mean, for example a decision by the provider to integrate a part of the necessary testing and reporting processes, information and documentation required under this Regulation into already existing documentation and procedu res required under the existing Union harmonisation legislation listed in Annex II, Section A. This however should not in any way undermine the obligation of the provider to comply with all the applicable requirements. (42a) The risk management system shou ld consist of a continuous, iterative process that is planned and run throughout the entire lifecycle of a high - risk AI system. This process should be aimed at identifying and mitigating the relevant risks of artificial intelligence systems on health, safe ty and fundamental rights. The risk management system should be regularly reviewed and updated to ensure its continuing effectiveness, as well as justification and documentation of any significant decisions and actions taken subject to this Regulation. ']
PYTHON
1def _add_chunks_by_priority(
2 chunks: List[str],
3 idcs_sorted_by_priority: List[int],
4 max_tokens: int,
5) -> List[Tuple[int, str]]:
6 """
7 Given chunks of text and their indices sorted by priority (highest priority first), this function
8 fills the model context window with as many highest-priority chunks as possible.
9
10 The output is a list of (index, chunk) pairs, ordered by priority. To stitch back the chunks into
11 a cohesive text that preserves chronological order, sort the output on its index.
12 """
13
14 selected = []
15 num_tokens = 0
16 idcs_queue = deque(idcs_sorted_by_priority)
17
18 while num_tokens < max_tokens and len(idcs_queue) > 0:
19 next_idx = idcs_queue.popleft()
20 num_tokens += len(co.tokenize(text=chunks[next_idx], model=co_model).tokens)
21 # keep index and chunk, to reorder chronologically
22 selected.append((next_idx, chunks[next_idx]))
23 if num_tokens > max_tokens:
24 selected.pop()
25
26 return selected
27
28def query_based_retrieval(
29 long: str,
30 max_tokens: int,
31 query: str,
32 n_setences_per_passage: int = 5,
33) -> str:
34 """
35 Performs query-based retrieval on a long text document.
36 """
37 # 1. Chunk text into units
38 chunks = build_simple_chunks(long, n_setences_per_passage)
39
40 # 2. Use co.rerank to rank chunks vs. query
41 chunks_reranked = co.rerank(query=query, documents=chunks, model="rerank-english-v3.0")
42 idcs_sorted_by_relevance = [
43 chunk.index for chunk in sorted(chunks_reranked.results, key=lambda c: c.relevance_score, reverse=True)
44 ]
45
46 # 3. Add chunks back in order of relevance
47 selected = _add_chunks_by_priority(chunks, idcs_sorted_by_relevance, max_tokens)
48
49 # 4. Put condensed text back in original order
50 separator = " "
51 short = separator.join([chunk for index, chunk in sorted(selected, key=lambda item: item[0], reverse=False)])
52 return short
PYTHON
1# Example prompt
2prompt_template = """
3## Instruction
4{query}
5
6## Document
7{document}
8
9## Answer
10""".strip()
PYTHON
1query = "What does the report say about biometric identification? Answer only based on the document."
2short_text = query_based_retrieval(long_text, MAX_TOKENS, query)
3prompt = prompt_template.format(query=query, document=short_text)
4print(generate_response(message=prompt, max_tokens=300))
Output
The report outlines several key points regarding biometric identification within the context of the proposed Artificial Intelligence Act:
1. **Prohibition of Real-Time Biometric Identification in Public Spaces**: The report proposes a ban on real-time biometric identification by law enforcement authorities in publicly accessible spaces, with specific exceptions. These exceptions are detailed in Article 5(1)(d) and are subject to safeguards, including monitoring, oversight, and limited reporting obligations at the EU level.
2. **Exceptions to the Prohibition**: The exceptions to the ban on real-time biometric identification include:
- Search for victims of specific crimes (e.g., abduction, trafficking, sexual exploitation).
- Prevention of imminent threats to life or physical safety, including terrorist attacks.
- Localization or identification of suspects for serious criminal offenses (as defined in Annex IIa) punishable by a custodial sentence of at least four years.
3. **Safeguards and Conditions**: The use of real-time biometric identification systems in these exceptional cases must comply with specific safeguards and conditions, including:
- A fundamental rights impact assessment.
- Registration of the system in a database.
- Prior authorization by a judicial or independent administrative authority, except in urgent situations where authorization can be sought within 24 hours.
- Limitation to what is strictly necessary in terms of time, geography, and personal scope.
4. **Post-Remote Biometric Identification**: The use of post-remote biometric identification

Approach 3: Text rank

In the final section we will show how we leverage graph theory to select chunks based on their centrality. Centrality is a graph-theoretic measure of how connected a node is; the higher the centrality, the more connected the node is to surrounding nodes (with fewer connections among those neighbors).

The solution presented in this document can be broken down into five functional steps:

  1. Break the document into chunks.

  2. Embed each chunk using an embedding model and construct a similarity matrix.

  3. Compute the centrality of each chunk.

    • We employ a package called NetworkX. It constructs a graph where the chunks are nodes, and the similarity score between them serves as the weight of the edges. Then, we calculate the centrality of each chunk as the sum of the edge weights adjacent to the node representing that chunk.
  4. Retain the highest-centrality chunks until the context limit is reached.

    • This step follows a similar approach to Approach 2.
  5. Reassemble the shortened text by reordering chunks in their original order.

See text_rank as the starting point.

Text rank implementation

PYTHON
1def text_rank(text: str, max_tokens: int, n_setences_per_passage: int) -> str:
2 """
3 Shortens text by extracting key units of text from it based on their centrality.
4 The output is the concatenation of those key units, in their original order.
5 """
6
7 # 1. Chunk text into units
8 chunks = build_simple_chunks(text, n_setences_per_passage)
9
10 # 2. Embed and construct similarity matrix
11 embeddings = np.array(
12 co.embed(
13 texts=chunks,
14 model="embed-v4.0",
15 input_type="clustering",
16 ).embeddings
17 )
18 similarities = np.dot(embeddings, embeddings.T)
19
20 # 3. Compute centrality and sort sentences by centrality
21 # Easiest to use networkx's `degree` function with similarity as weight
22 g = nx.from_numpy_array(similarities, edge_attr="weight")
23 centralities = g.degree(weight="weight")
24 idcs_sorted_by_centrality = [node for node, degree in sorted(centralities, key=lambda item: item[1], reverse=True)]
25
26 # 4. Add chunks back in order of centrality
27 selected = _add_chunks_by_priority(chunks, idcs_sorted_by_centrality, max_tokens)
28
29 # 5. Put condensed text back in original order
30 short = " ".join([chunk for index, chunk in sorted(selected, key=lambda item: item[0], reverse=False)])
31
32 return short
PYTHON
1# Example summary prompt.
2prompt_template = """
3## Instruction
4Summarize the following Document in 3-5 sentences. Only answer based on the information provided in the document.
5
6## Document
7{document}
8
9## Summary
10""".strip()
PYTHON
1short_text = text_rank(long_text, MAX_TOKENS, 5)
2prompt = prompt_template.format(document=short_text)
3print(generate_response(message=prompt, max_tokens=600))
Output
The document outlines the European Union's regulatory framework for artificial intelligence (AI) systems, focusing on high-risk AI applications. It establishes rules for placing AI systems on the market, including prohibitions on certain practices, requirements for high-risk systems, and transparency obligations. The regulation defines high-risk AI systems based on their intended use and potential risks to health, safety, and fundamental rights. Providers of high-risk AI systems must comply with specific requirements, such as risk management, data governance, and human oversight. The regulation also mandates conformity assessments, registration in an EU database, and post-market monitoring. It emphasizes the importance of AI literacy, prohibits manipulative or exploitative AI practices, and ensures compliance through market surveillance and enforcement mechanisms. Additionally, the regulation addresses general-purpose AI models, requiring providers to meet specific obligations, especially for models with systemic risks. The framework aims to promote trustworthy AI while safeguarding public interests and supporting innovation.

Summary

In this notebook we present three useful methods to over come the limitations of context window size. In the following blog post, we talk more about how these methods can be evaluated.