Long Form General Strategies

Long Form General Strategies

Ania Bialas

Ania Bialas

Large Language Models (LLMs) are becoming increasingly capable of comprehending text, among others excelling in document analysis. The new Cohere model, Command-R, boasts a context length of 128k, which makes it particularly effective for such tasks. Nevertheless, even with the extended context window, some documents might be too lengthy to accommodate in full.

In this cookbook, we'll explore techniques to address cases when relevant information doesn't fit in the model context window.

We'll show you three potential mitigation strategies: truncating the document, query-based retrieval, and a "text rank" approach we use internally at Cohere.

Table of contents:

  1. Getting started
  2. Approach 1: Truncate
  3. Approach 2: Query Based Retrieval
  4. Approach 3: Text Rank


ApproachDescriptionProsConsWhen to use?
TruncationTruncate the document to fit the context window.- Simplicity of implementation
(does not rely on extrenal infrastructure)
- Loses information at the end of the documentUtilize when all relevant information is contained
at the beginning of the document.
Query Based RetrievalUtilize semantic similarity to retrieve text chunks
that are most relevant to the query.
- Focuses on sections directly relevant to
the query
- Relies on a semantic similarity algorithm.
- Might lose broader context
Employ when seeking specific
information within the text.
Text RankApply graph theory to generate a cohesive set
of chunks that effectively represent the document.
- Preserves the broader picture.- Might lose detailed information.Utilize in summaries and when the question
requires broader context.

Getting Started

!pip install cohere
!pip install python-dotenv
!pip install tokenizers
!pip install langchain
!pip install nltk
!pip install networkx
!pip install pypdf2
import os
import requests
from collections import deque
from typing import List, Tuple

import cohere

import numpy as np

import PyPDF2
from dotenv import load_dotenv

from tokenizers import Tokenizer

import nltk
nltk.download('punkt')  # Download the necessary data for sentence tokenization
from nltk.tokenize import sent_tokenize

import networkx as nx
from getpass import getpass
from IPython.display import HTML, display
[nltk_data] Downloading package punkt to
[nltk_data]     /home/anna_cohere_com/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
# Set up Cohere client
co_model = 'command-r'
co_api_key = getpass("Enter your Cohere API key: ")
co = cohere.Client(api_key=co_api_key)
def load_long_pdf(file_path):
    Load a long PDF file and extract its text content.

        file_path (str): The path to the PDF file.

        str: The extracted text content of the PDF file.
    with open(file_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        num_pages = len(pdf_reader.pages)
        full_text = ''
        for page_num in range(num_pages):
            page = pdf_reader.pages[page_num]
            full_text += page.extract_text()
    return full_text

def save_pdf_from_url(pdf_url, save_path):
        # Send a GET request to the PDF URL
        response = requests.get(pdf_url, stream=True)
        response.raise_for_status()  # Raise an exception for HTTP errors

        # Open the local file for writing in binary mode
        with open(save_path, 'wb') as file:
            # Write the content of the response to the local file
            for chunk in response.iter_content(chunk_size=8192):

        print(f"PDF saved successfully to '{save_path}'")
    except requests.exceptions.RequestException as e:
        print(f"Error downloading PDF: {e}")

In this example we use the Proposal for a Regulation of the European Parliament and of the Council defining rules on Artificial Intelligence from 26 January 2024, link.

# Download the PDF file from the URL
pdf_url = 'https://data.consilium.europa.eu/doc/document/ST-5662-2024-INIT/en/pdf'
save_path = 'example.pdf'
save_pdf_from_url(pdf_url, save_path)

# Load the PDF file and extract its text content
long_text = load_long_pdf(save_path)
long_text = long_text.replace('\n', ' ')

# Print the length of the document
print("Document length - #tokens:", len(co.tokenize(text=long_text, model=co_model).tokens))
PDF saved successfully to 'example.pdf'
Document length - #tokens: 134184

Summarizing the text

def generate_response(message, max_tokens=300, temperature=0.2, k=0):
  A wrapper around the Cohere API to generate a response based on a given prompt.

    messsage (str): The input message for generating the response.
    max_tokens (int, optional): The maximum number of tokens in the generated response. Defaults to 300.
    temperature (float, optional): Controls the randomness of the generated response. Higher values (e.g., 1.0) make the output more random, while lower values (e.g., 0.2) make it more deterministic. Defaults to 0.2.
    k (int, optional): Controls the diversity of the generated response. Higher values (e.g., 5) make the output more diverse, while lower values (e.g., 0) make it more focused. Defaults to 0.

    str: The generated response.

  response = co.chat(
    model = co_model,
  return response.text
# Example summary prompt.
prompt_template = """
## Instruction
Summarize the following Document in 3-5 sentences. Only answer based on the information provided in the document.

## Document

## Summary

If you run the cell below, an error will occur. Therefore, in the following sections, we will explore some techniques to address this limitation.

Error: :CohereAPIError: too many tokens:

prompt = prompt_template.format(document=long_text)
# print(generate_response(message=prompt))

Therefore, in the following sections, we will explore some techniques to address this limitation.

Approach 1 - Truncate

First we try to truncate the document so that it meets the length constraints. This approach is simple to implement and understand. However, it drops potentially important information contained towards the end of the document.

# The new Cohere model has a context limit of 128k tokens. However, for the purpose of this exercise, we will assume a smaller context window.
# Employing a smaller context window also has the additional benefit of reducing the cost per request, especially if billed by the number of tokens.

MAX_TOKENS = 40000

def truncate(long: str, max_tokens: int) -> str:
    Shortens `long` by brutally truncating it to the first `max_tokens` tokens.
    This can break up sentences, passages, etc.

    tokenized = co.tokenize(text=long, model=co_model).token_strings
    truncated = tokenized[:max_tokens]
    short = "".join(truncated)
    return short
short_text = truncate(long_text, MAX_TOKENS)

prompt = prompt_template.format(document=short_text)
The document discusses the impact of a specific protein, p53, on the process of angiogenesis, which is the growth of new blood vessels. Angiogenesis plays a critical role in various physiological processes, including wound healing and embryonic development. The presence of the p53 protein can inhibit angiogenesis by regulating the expression of certain genes and proteins. This inhibition can have significant implications for tumor growth, as angiogenesis is essential for tumor progression. Therefore, understanding the role of p53 in angiogenesis can contribute to our knowledge of tumor suppression and potential therapeutic interventions.

Additionally, the document mentions that the regulation of angiogenesis by p53 occurs independently of the protein's role in cell cycle arrest and apoptosis, which are other key functions of p53 in tumor suppression. This suggests that p53 has a complex and multifaceted impact on cellular processes.

Approach 2: Query Based Retrieval

In this section we present how we can leverage a query retriereval based approach to generate an answer to the following question: Based on the document, are there any risks related to Elon Musk?.

The solution is outlined below and can be broken down into four functional steps.

  1. Chunk the text into units

    • Here we employ a simple chunking algorithm. More information about different chunking strategies can be found [here](TODO: link to chunking post).
  2. Use a ranking algorithm to rank chunks against the query

    • We leverage another Cohere endpoint, co.rerank (docs link), to rank each chunk against the query.
  3. Keep the most-relevant chunks until context limit is reached

    • co.rerank returns a relevance score, facilitating the selection of the most pertinent chunks. We can choose the most relevant chunks based on this score.
  4. Put condensed text back in original order

    • Finally, we arrange the chosen chunks in their original sequence as they appear in the document.

See query_based_retrieval function for the starting point.

Query based retrieval implementation

def split_text_into_sentences(text) -> List[str]:
    Split the input text into a list of sentences.
    sentences = sent_tokenize(text)

    return sentences

def group_sentences_into_passages(sentence_list, n_sentences_per_passage=5):
    Group sentences into passages of n_sentences sentences.
    passages = []
    passage = ""
    for i, sentence in enumerate(sentence_list):
        passage += sentence + " "
        if (i + 1) % n_sentences_per_passage == 0:
            passage = ""
    return passages

def build_simple_chunks(text, n_sentences=5):
    Build chunks of text from the input text.
    sentences = split_text_into_sentences(text)
    chunks = group_sentences_into_passages(sentences, n_sentences_per_passage=n_sentences)
    return chunks
sentences = split_text_into_sentences(long_text)
passages = group_sentences_into_passages(sentences, n_sentences_per_passage=5)
print('Example sentence:', np.random.choice(np.asarray(sentences), size=1, replace=False))
print('Example passage:', np.random.choice(np.asarray(passages), size=1, replace=False))
Example sentence: ['4.']

Example passage: ['T echnical robustness and safety means that AI systems are developed  and used in a way that allows robustness in case of problems and resilience against  attempts to alter the use or performance of the AI system so as to allow unlawful use by  third parties, a nd minimise unintended harm. Privacy and data governance means that AI  systems are developed and used in compliance with existing privacy and data protection  rules, while processing data that meets high standards in terms of quality and integrity. Transpar ency means that AI systems are developed and used in a way that allows  appropriate traceability and explainability, while making humans aware that they  communicate or interact with an AI system, as well as duly informing deployers of the  capabilities and l imitations of that AI system and affected persons about their rights. Diversity, non - discrimination and fairness means that AI systems are developed and used  in a way that includes diverse actors and promotes equal access, gender equality and  cultural dive rsity, while avoiding discriminatory impacts and unfair biases that are  prohibited by Union or national law. Social and environmental well - being means that AI  systems are developed and used in a sustainable and environmentally friendly manner as  well as in   a way to benefit all human beings, while monitoring and assessing the long - term  impacts on the individual, society and democracy. ']
def _add_chunks_by_priority(
    chunks: List[str],
    idcs_sorted_by_priority: List[int],
    max_tokens: int,
) -> List[Tuple[int, str]]:
    Given chunks of text and their indices sorted by priority (highest priority first), this function
    fills the model context window with as many highest-priority chunks as possible.

    The output is a list of (index, chunk) pairs, ordered by priority. To stitch back the chunks into
    a cohesive text that preserves chronological order, sort the output on its index.

    selected = []
    num_tokens = 0
    idcs_queue = deque(idcs_sorted_by_priority)

    while num_tokens < max_tokens and len(idcs_queue) > 0:
        next_idx = idcs_queue.popleft()
        num_tokens += len(co.tokenize(text=chunks[next_idx], model=co_model).tokens)
        # keep index and chunk, to reorder chronologically
        selected.append((next_idx, chunks[next_idx]))
    if num_tokens > max_tokens:

    return selected

def query_based_retrieval(
    long: str,
    max_tokens: int,
    query: str,
    n_setences_per_passage: int = 5,
) -> str:
    Performs query-based retrieval on a long text document.
    # 1. Chunk text into units
    chunks = build_simple_chunks(long, n_setences_per_passage)

    # 2. Use co.rerank to rank chunks vs. query
    chunks_reranked = co.rerank(query=query, documents=chunks, model="rerank-english-v3.0")
    idcs_sorted_by_relevance = [
        chunk.index for chunk in sorted(chunks_reranked.results, key=lambda c: c.relevance_score, reverse=True)

    # 3. Add chunks back in order of relevance
    selected = _add_chunks_by_priority(chunks, idcs_sorted_by_relevance, max_tokens)

    # 4. Put condensed text back in original order
    separator = " "
    short = separator.join([chunk for index, chunk in sorted(selected, key=lambda item: item[0], reverse=False)])
    return short
# Example prompt
prompt_template = """
## Instruction

## Document

## Answer
query = "What does the report say about biometric identification? Answer only based on the document."
short_text = query_based_retrieval(long_text, MAX_TOKENS, query)
prompt = prompt_template.format(query=query, document=short_text)
print(generate_response(message=prompt, max_tokens=300))
The report discusses the restrictions on the use of biometric identification by law enforcement in publicly accessible spaces. According to the document, real-time biometric identification is prohibited unless in exceptional cases where its use is strictly necessary and proportionate to achieving a substantial public interest. The use of post-remote biometric identification systems is also mentioned, noting the requirements for authorization and limitations on its use.

The report also highlights the classification of certain AI systems as high-risk, including biometric identification systems, emotion recognition systems, and biometric categorisation systems, with the exception of systems used for biometric verification. High-risk AI systems are subject to specific requirements and obligations.

Approach 3: Text rank

In the final section we will show how we leverage graph theory to select chunks based on their centrality. Centrality is a graph-theoretic measure of how connected a node is; the higher the centrality, the more connected the node is to surrounding nodes (with fewer connections among those neighbors).

The solution presented in this document can be broken down into five functional steps:

  1. Break the document into chunks.

  2. Embed each chunk using an embedding model and construct a similarity matrix.

  3. Compute the centrality of each chunk.

    • We employ a package called NetworkX. It constructs a graph where the chunks are nodes, and the similarity score between them serves as the weight of the edges. Then, we calculate the centrality of each chunk as the sum of the edge weights adjacent to the node representing that chunk.
  4. Retain the highest-centrality chunks until the context limit is reached.

    • This step follows a similar approach to Approach 2.
  5. Reassemble the shortened text by reordering chunks in their original order.

See text_rank as the starting point.

Text rank implementation

def text_rank(text: str, max_tokens: int, n_setences_per_passage: int) -> str:
    Shortens text by extracting key units of text from it based on their centrality.
    The output is the concatenation of those key units, in their original order.

    # 1. Chunk text into units
    chunks = build_simple_chunks(text, n_setences_per_passage)

    # 2. Embed and construct similarity matrix
    embeddings = np.array(
    similarities = np.dot(embeddings, embeddings.T)

    # 3. Compute centrality and sort sentences by centrality
    # Easiest to use networkx's `degree` function with similarity as weight
    g = nx.from_numpy_array(similarities, edge_attr="weight")
    centralities = g.degree(weight="weight")
    idcs_sorted_by_centrality = [node for node, degree in sorted(centralities, key=lambda item: item[1], reverse=True)]

    # 4. Add chunks back in order of centrality
    selected = _add_chunks_by_priority(chunks, idcs_sorted_by_centrality, max_tokens)

    # 5. Put condensed text back in original order
    short = " ".join([chunk for index, chunk in sorted(selected, key=lambda item: item[0], reverse=False)])

    return short
# Example summary prompt.
prompt_template = """
## Instruction
Summarize the following Document in 3-5 sentences. Only answer based on the information provided in the document.

## Document

## Summary
short_text = text_rank(long_text, MAX_TOKENS, 5)
prompt = prompt_template.format(document=short_text)
print(generate_response(message=prompt, max_tokens=600))
The document outlines the requirements and obligations for developing and deploying AI systems in the European Union. It aims to establish a regulatory framework to foster innovation while ensuring the protection of fundamental rights and public interests. The regulation applies to providers and deployers of AI systems, including those established outside the EU. High-risk AI systems are subject to specific requirements, such as risk management, data governance, and transparency. Providers must ensure compliance and keep records, and deployers must use AI systems responsibly. The regulation also establishes an AI Office, advisory bodies, and a database for high-risk AI systems. Additionally, it addresses issues like testing, codes of conduct, and cooperation with third countries. Fines and penalties are proposed for non-compliance.


In this notebook we present three useful methods to over come the limitations of context window size. In the following blog post, we talk more about how these methods can be evaluated.