Long Form General Strategies
Large Language Models (LLMs) are becoming increasingly capable of comprehending text, among others excelling in document analysis. The new Cohere model, Command-R, boasts a context length of 128k, which makes it particularly effective for such tasks. Nevertheless, even with the extended context window, some documents might be too lengthy to accommodate in full.
In this cookbook, we’ll explore techniques to address cases when relevant information doesn’t fit in the model context window.
We’ll show you three potential mitigation strategies: truncating the document, query-based retrieval, and a “text rank” approach we use internally at Cohere.
Summary
Getting Started
In this example we use the Proposal for a Regulation of the European Parliament and of the Council defining rules on Artificial Intelligence from 26 January 2024, link.
Summarizing the text
If you run the cell below, an error will occur. Therefore, in the following sections, we will explore some techniques to address this limitation.
Error: :CohereAPIError: too many tokens:
Therefore, in the following sections, we will explore some techniques to address this limitation.
Approach 1 - Truncate
First we try to truncate the document so that it meets the length constraints. This approach is simple to implement and understand. However, it drops potentially important information contained towards the end of the document.
The document discusses the impact of a specific protein, p53, on the process of angiogenesis, which is the growth of new blood vessels. Angiogenesis plays a critical role in various physiological processes, including wound healing and embryonic development. The presence of the p53 protein can inhibit angiogenesis by regulating the expression of certain genes and proteins. This inhibition can have significant implications for tumor growth, as angiogenesis is essential for tumor progression. Therefore, understanding the role of p53 in angiogenesis can contribute to our knowledge of tumor suppression and potential therapeutic interventions.
Additionally, the document mentions that the regulation of angiogenesis by p53 occurs independently of the protein’s role in cell cycle arrest and apoptosis, which are other key functions of p53 in tumor suppression. This suggests that p53 has a complex and multifaceted impact on cellular processes.
Approach 2: Query Based Retrieval
In this section we present how we can leverage a query retriereval based approach to generate an answer to the following question: Based on the document, are there any risks related to Elon Musk?
.
The solution is outlined below and can be broken down into four functional steps.
-
Chunk the text into units
- Here we employ a simple chunking algorithm. More information about different chunking strategies can be found [here](TODO: link to chunking post).
-
Use a ranking algorithm to rank chunks against the query
- We leverage another Cohere endpoint,
co.rerank
(docs link), to rank each chunk against the query.
- We leverage another Cohere endpoint,
-
Keep the most-relevant chunks until context limit is reached
co.rerank
returns a relevance score, facilitating the selection of the most pertinent chunks. We can choose the most relevant chunks based on this score.
-
Put condensed text back in original order
- Finally, we arrange the chosen chunks in their original sequence as they appear in the document.
See query_based_retrieval
function for the starting point.
Query based retrieval implementation
Approach 3: Text rank
In the final section we will show how we leverage graph theory to select chunks based on their centrality. Centrality is a graph-theoretic measure of how connected a node is; the higher the centrality, the more connected the node is to surrounding nodes (with fewer connections among those neighbors).
The solution presented in this document can be broken down into five functional steps:
-
Break the document into chunks.
- This mirrors the first step in Approach 2.
-
Embed each chunk using an embedding model and construct a similarity matrix.
- We utilize
co.embed
documentation link.
- We utilize
-
Compute the centrality of each chunk.
- We employ a package called
NetworkX
. It constructs a graph where the chunks are nodes, and the similarity score between them serves as the weight of the edges. Then, we calculate the centrality of each chunk as the sum of the edge weights adjacent to the node representing that chunk.
- We employ a package called
-
Retain the highest-centrality chunks until the context limit is reached.
- This step follows a similar approach to Approach 2.
-
Reassemble the shortened text by reordering chunks in their original order.
- This step mirrors the last step in Approach 2.
See text_rank
as the starting point.
Text rank implementation
Summary
In this notebook we present three useful methods to over come the limitations of context window size. In the following blog post, we talk more about how these methods can be evaluated.