Long Form General Strategies

Ania BialasAnia Bialas

Large Language Models (LLMs) are becoming increasingly capable of comprehending text, among others excelling in document analysis. The new Cohere model, Command-R, boasts a context length of 128k, which makes it particularly effective for such tasks. Nevertheless, even with the extended context window, some documents might be too lengthy to accommodate in full.

In this cookbook, we’ll explore techniques to address cases when relevant information doesn’t fit in the model context window.

We’ll show you three potential mitigation strategies: truncating the document, query-based retrieval, and a “text rank” approach we use internally at Cohere.

Summary

ApproachDescriptionProsConsWhen to use?
TruncationTruncate the document to fit the context window.- Simplicity of implementation
(does not rely on extrenal infrastructure)
- Loses information at the end of the documentUtilize when all relevant information is contained
at the beginning of the document.
Query Based RetrievalUtilize semantic similarity to retrieve text chunks
that are most relevant to the query.
- Focuses on sections directly relevant to
the query
- Relies on a semantic similarity algorithm.
- Might lose broader context
Employ when seeking specific
information within the text.
Text RankApply graph theory to generate a cohesive set
of chunks that effectively represent the document.
- Preserves the broader picture.- Might lose detailed information.Utilize in summaries and when the question
requires broader context.

Getting Started

PYTHON
1%%capture
2!pip install cohere
3!pip install python-dotenv
4!pip install tokenizers
5!pip install langchain
6!pip install nltk
7!pip install networkx
8!pip install pypdf2
PYTHON
1import os
2import requests
3from collections import deque
4from typing import List, Tuple
5
6import cohere
7
8import numpy as np
9
10import PyPDF2
11from dotenv import load_dotenv
12
13from tokenizers import Tokenizer
14
15import nltk
16nltk.download('punkt') # Download the necessary data for sentence tokenization
17from nltk.tokenize import sent_tokenize
18
19import networkx as nx
20from getpass import getpass
21from IPython.display import HTML, display
Output
[nltk_data] Downloading package punkt to
[nltk_data] /home/anna_cohere_com/nltk_data...
[nltk_data] Package punkt is already up-to-date!
PYTHON
1# Set up Cohere client
2co_model = 'command-r'
3co_api_key = getpass("Enter your Cohere API key: ")
4co = cohere.Client(api_key=co_api_key)
PYTHON
1def load_long_pdf(file_path):
2 """
3 Load a long PDF file and extract its text content.
4
5 Args:
6 file_path (str): The path to the PDF file.
7
8 Returns:
9 str: The extracted text content of the PDF file.
10 """
11 with open(file_path, 'rb') as file:
12 pdf_reader = PyPDF2.PdfReader(file)
13 num_pages = len(pdf_reader.pages)
14 full_text = ''
15 for page_num in range(num_pages):
16 page = pdf_reader.pages[page_num]
17 full_text += page.extract_text()
18 return full_text
19
20def save_pdf_from_url(pdf_url, save_path):
21 try:
22 # Send a GET request to the PDF URL
23 response = requests.get(pdf_url, stream=True)
24 response.raise_for_status() # Raise an exception for HTTP errors
25
26 # Open the local file for writing in binary mode
27 with open(save_path, 'wb') as file:
28 # Write the content of the response to the local file
29 for chunk in response.iter_content(chunk_size=8192):
30 file.write(chunk)
31
32 print(f"PDF saved successfully to '{save_path}'")
33 except requests.exceptions.RequestException as e:
34 print(f"Error downloading PDF: {e}")

In this example we use the Proposal for a Regulation of the European Parliament and of the Council defining rules on Artificial Intelligence from 26 January 2024, link.

PYTHON
1# Download the PDF file from the URL
2pdf_url = 'https://data.consilium.europa.eu/doc/document/ST-5662-2024-INIT/en/pdf'
3save_path = 'example.pdf'
4save_pdf_from_url(pdf_url, save_path)
5
6# Load the PDF file and extract its text content
7long_text = load_long_pdf(save_path)
8long_text = long_text.replace('\n', ' ')
9
10# Print the length of the document
11print("Document length - #tokens:", len(co.tokenize(text=long_text, model=co_model).tokens))
Output
PDF saved successfully to 'example.pdf'
Document length - #tokens: 134184

Summarizing the text

PYTHON
1def generate_response(message, max_tokens=300, temperature=0.2, k=0):
2 """
3 A wrapper around the Cohere API to generate a response based on a given prompt.
4
5 Args:
6 messsage (str): The input message for generating the response.
7 max_tokens (int, optional): The maximum number of tokens in the generated response. Defaults to 300.
8 temperature (float, optional): Controls the randomness of the generated response. Higher values (e.g., 1.0) make the output more random, while lower values (e.g., 0.2) make it more deterministic. Defaults to 0.2.
9 k (int, optional): Controls the diversity of the generated response. Higher values (e.g., 5) make the output more diverse, while lower values (e.g., 0) make it more focused. Defaults to 0.
10
11 Returns:
12 str: The generated response.
13
14 """
15 response = co.chat(
16 model = co_model,
17 message=message,
18 max_tokens=max_tokens,
19 temperature=temperature,
20 return_prompt=True
21 )
22 return response.text
PYTHON
1# Example summary prompt.
2prompt_template = """
3## Instruction
4Summarize the following Document in 3-5 sentences. Only answer based on the information provided in the document.
5
6## Document
7{document}
8
9## Summary
10""".strip()

If you run the cell below, an error will occur. Therefore, in the following sections, we will explore some techniques to address this limitation.

Error: :CohereAPIError: too many tokens:

PYTHON
1prompt = prompt_template.format(document=long_text)
2# print(generate_response(message=prompt))

Therefore, in the following sections, we will explore some techniques to address this limitation.

Approach 1 - Truncate

First we try to truncate the document so that it meets the length constraints. This approach is simple to implement and understand. However, it drops potentially important information contained towards the end of the document.

PYTHON
1# The new Cohere model has a context limit of 128k tokens. However, for the purpose of this exercise, we will assume a smaller context window.
2# Employing a smaller context window also has the additional benefit of reducing the cost per request, especially if billed by the number of tokens.
3
4MAX_TOKENS = 40000
5
6def truncate(long: str, max_tokens: int) -> str:
7 """
8 Shortens `long` by brutally truncating it to the first `max_tokens` tokens.
9 This can break up sentences, passages, etc.
10 """
11
12 tokenized = co.tokenize(text=long, model=co_model).token_strings
13 truncated = tokenized[:max_tokens]
14 short = "".join(truncated)
15 return short
PYTHON
1short_text = truncate(long_text, MAX_TOKENS)
2
3prompt = prompt_template.format(document=short_text)
4print(generate_response(message=prompt))

The document discusses the impact of a specific protein, p53, on the process of angiogenesis, which is the growth of new blood vessels. Angiogenesis plays a critical role in various physiological processes, including wound healing and embryonic development. The presence of the p53 protein can inhibit angiogenesis by regulating the expression of certain genes and proteins. This inhibition can have significant implications for tumor growth, as angiogenesis is essential for tumor progression. Therefore, understanding the role of p53 in angiogenesis can contribute to our knowledge of tumor suppression and potential therapeutic interventions.

Additionally, the document mentions that the regulation of angiogenesis by p53 occurs independently of the protein’s role in cell cycle arrest and apoptosis, which are other key functions of p53 in tumor suppression. This suggests that p53 has a complex and multifaceted impact on cellular processes.

Approach 2: Query Based Retrieval

In this section we present how we can leverage a query retriereval based approach to generate an answer to the following question: Based on the document, are there any risks related to Elon Musk?.

The solution is outlined below and can be broken down into four functional steps.

  1. Chunk the text into units

    • Here we employ a simple chunking algorithm. More information about different chunking strategies can be found [here](TODO: link to chunking post).
  2. Use a ranking algorithm to rank chunks against the query

    • We leverage another Cohere endpoint, co.rerank (docs link), to rank each chunk against the query.
  3. Keep the most-relevant chunks until context limit is reached

    • co.rerank returns a relevance score, facilitating the selection of the most pertinent chunks. We can choose the most relevant chunks based on this score.
  4. Put condensed text back in original order

    • Finally, we arrange the chosen chunks in their original sequence as they appear in the document.

See query_based_retrieval function for the starting point.

Query based retrieval implementation

PYTHON
1def split_text_into_sentences(text) -> List[str]:
2 """
3 Split the input text into a list of sentences.
4 """
5 sentences = sent_tokenize(text)
6
7 return sentences
8
9def group_sentences_into_passages(sentence_list, n_sentences_per_passage=5):
10 """
11 Group sentences into passages of n_sentences sentences.
12 """
13 passages = []
14 passage = ""
15 for i, sentence in enumerate(sentence_list):
16 passage += sentence + " "
17 if (i + 1) % n_sentences_per_passage == 0:
18 passages.append(passage)
19 passage = ""
20 return passages
21
22def build_simple_chunks(text, n_sentences=5):
23 """
24 Build chunks of text from the input text.
25 """
26 sentences = split_text_into_sentences(text)
27 chunks = group_sentences_into_passages(sentences, n_sentences_per_passage=n_sentences)
28 return chunks
PYTHON
1sentences = split_text_into_sentences(long_text)
2passages = group_sentences_into_passages(sentences, n_sentences_per_passage=5)
3print('Example sentence:', np.random.choice(np.asarray(sentences), size=1, replace=False))
4print()
5print('Example passage:', np.random.choice(np.asarray(passages), size=1, replace=False))
Output
Example sentence: ['4.']
Example passage: ['T echnical robustness and safety means that AI systems are developed and used in a way that allows robustness in case of problems and resilience against attempts to alter the use or performance of the AI system so as to allow unlawful use by third parties, a nd minimise unintended harm. Privacy and data governance means that AI systems are developed and used in compliance with existing privacy and data protection rules, while processing data that meets high standards in terms of quality and integrity. Transpar ency means that AI systems are developed and used in a way that allows appropriate traceability and explainability, while making humans aware that they communicate or interact with an AI system, as well as duly informing deployers of the capabilities and l imitations of that AI system and affected persons about their rights. Diversity, non - discrimination and fairness means that AI systems are developed and used in a way that includes diverse actors and promotes equal access, gender equality and cultural dive rsity, while avoiding discriminatory impacts and unfair biases that are prohibited by Union or national law. Social and environmental well - being means that AI systems are developed and used in a sustainable and environmentally friendly manner as well as in a way to benefit all human beings, while monitoring and assessing the long - term impacts on the individual, society and democracy. ']
PYTHON
1def _add_chunks_by_priority(
2 chunks: List[str],
3 idcs_sorted_by_priority: List[int],
4 max_tokens: int,
5) -> List[Tuple[int, str]]:
6 """
7 Given chunks of text and their indices sorted by priority (highest priority first), this function
8 fills the model context window with as many highest-priority chunks as possible.
9
10 The output is a list of (index, chunk) pairs, ordered by priority. To stitch back the chunks into
11 a cohesive text that preserves chronological order, sort the output on its index.
12 """
13
14 selected = []
15 num_tokens = 0
16 idcs_queue = deque(idcs_sorted_by_priority)
17
18 while num_tokens < max_tokens and len(idcs_queue) > 0:
19 next_idx = idcs_queue.popleft()
20 num_tokens += len(co.tokenize(text=chunks[next_idx], model=co_model).tokens)
21 # keep index and chunk, to reorder chronologically
22 selected.append((next_idx, chunks[next_idx]))
23 if num_tokens > max_tokens:
24 selected.pop()
25
26 return selected
27
28def query_based_retrieval(
29 long: str,
30 max_tokens: int,
31 query: str,
32 n_setences_per_passage: int = 5,
33) -> str:
34 """
35 Performs query-based retrieval on a long text document.
36 """
37 # 1. Chunk text into units
38 chunks = build_simple_chunks(long, n_setences_per_passage)
39
40 # 2. Use co.rerank to rank chunks vs. query
41 chunks_reranked = co.rerank(query=query, documents=chunks, model="rerank-english-v3.0")
42 idcs_sorted_by_relevance = [
43 chunk.index for chunk in sorted(chunks_reranked.results, key=lambda c: c.relevance_score, reverse=True)
44 ]
45
46 # 3. Add chunks back in order of relevance
47 selected = _add_chunks_by_priority(chunks, idcs_sorted_by_relevance, max_tokens)
48
49 # 4. Put condensed text back in original order
50 separator = " "
51 short = separator.join([chunk for index, chunk in sorted(selected, key=lambda item: item[0], reverse=False)])
52 return short
PYTHON
1# Example prompt
2prompt_template = """
3## Instruction
4{query}
5
6## Document
7{document}
8
9## Answer
10""".strip()
PYTHON
1query = "What does the report say about biometric identification? Answer only based on the document."
2short_text = query_based_retrieval(long_text, MAX_TOKENS, query)
3prompt = prompt_template.format(query=query, document=short_text)
4print(generate_response(message=prompt, max_tokens=300))
Output
The report discusses the restrictions on the use of biometric identification by law enforcement in publicly accessible spaces. According to the document, real-time biometric identification is prohibited unless in exceptional cases where its use is strictly necessary and proportionate to achieving a substantial public interest. The use of post-remote biometric identification systems is also mentioned, noting the requirements for authorization and limitations on its use.
The report also highlights the classification of certain AI systems as high-risk, including biometric identification systems, emotion recognition systems, and biometric categorisation systems, with the exception of systems used for biometric verification. High-risk AI systems are subject to specific requirements and obligations.

Approach 3: Text rank

In the final section we will show how we leverage graph theory to select chunks based on their centrality. Centrality is a graph-theoretic measure of how connected a node is; the higher the centrality, the more connected the node is to surrounding nodes (with fewer connections among those neighbors).

The solution presented in this document can be broken down into five functional steps:

  1. Break the document into chunks.

  2. Embed each chunk using an embedding model and construct a similarity matrix.

  3. Compute the centrality of each chunk.

    • We employ a package called NetworkX. It constructs a graph where the chunks are nodes, and the similarity score between them serves as the weight of the edges. Then, we calculate the centrality of each chunk as the sum of the edge weights adjacent to the node representing that chunk.
  4. Retain the highest-centrality chunks until the context limit is reached.

    • This step follows a similar approach to Approach 2.
  5. Reassemble the shortened text by reordering chunks in their original order.

See text_rank as the starting point.

Text rank implementation

PYTHON
1def text_rank(text: str, max_tokens: int, n_setences_per_passage: int) -> str:
2 """
3 Shortens text by extracting key units of text from it based on their centrality.
4 The output is the concatenation of those key units, in their original order.
5 """
6
7 # 1. Chunk text into units
8 chunks = build_simple_chunks(text, n_setences_per_passage)
9
10 # 2. Embed and construct similarity matrix
11 embeddings = np.array(
12 co.embed(
13 texts=chunks,
14 model="embed-english-v3.0",
15 input_type="clustering",
16 ).embeddings
17 )
18 similarities = np.dot(embeddings, embeddings.T)
19
20 # 3. Compute centrality and sort sentences by centrality
21 # Easiest to use networkx's `degree` function with similarity as weight
22 g = nx.from_numpy_array(similarities, edge_attr="weight")
23 centralities = g.degree(weight="weight")
24 idcs_sorted_by_centrality = [node for node, degree in sorted(centralities, key=lambda item: item[1], reverse=True)]
25
26 # 4. Add chunks back in order of centrality
27 selected = _add_chunks_by_priority(chunks, idcs_sorted_by_centrality, max_tokens)
28
29 # 5. Put condensed text back in original order
30 short = " ".join([chunk for index, chunk in sorted(selected, key=lambda item: item[0], reverse=False)])
31
32 return short
PYTHON
1# Example summary prompt.
2prompt_template = """
3## Instruction
4Summarize the following Document in 3-5 sentences. Only answer based on the information provided in the document.
5
6## Document
7{document}
8
9## Summary
10""".strip()
PYTHON
1short_text = text_rank(long_text, MAX_TOKENS, 5)
2prompt = prompt_template.format(document=short_text)
3print(generate_response(message=prompt, max_tokens=600))
Output
The document outlines the requirements and obligations for developing and deploying AI systems in the European Union. It aims to establish a regulatory framework to foster innovation while ensuring the protection of fundamental rights and public interests. The regulation applies to providers and deployers of AI systems, including those established outside the EU. High-risk AI systems are subject to specific requirements, such as risk management, data governance, and transparency. Providers must ensure compliance and keep records, and deployers must use AI systems responsibly. The regulation also establishes an AI Office, advisory bodies, and a database for high-risk AI systems. Additionally, it addresses issues like testing, codes of conduct, and cooperation with third countries. Fines and penalties are proposed for non-compliance.

Summary

In this notebook we present three useful methods to over come the limitations of context window size. In the following blog post, we talk more about how these methods can be evaluated.