Long-Form Text Strategies with Cohere
Long-Form Text Strategies with Cohere
Long-Form Text Strategies with Cohere
Large Language Models (LLMs) are becoming increasingly capable of comprehending text, among others excelling in document analysis. The new Cohere model, Command A, boasts a context length of 256k, which makes it particularly effective for such tasks. Nevertheless, even with the extended context window, some documents might be too lengthy to accommodate in full.
In this cookbook, we’ll explore techniques to address cases when relevant information doesn’t fit in the model context window.
We’ll show you three potential mitigation strategies: truncating the document, query-based retrieval, and a “text rank” approach we use internally at Cohere.
In this example we use the Proposal for a Regulation of the European Parliament and of the Council defining rules on Artificial Intelligence from 26 January 2024, link.
If you run the cell below, an error will occur. Therefore, in the following sections, we will explore some techniques to address this limitation.
Error: :CohereAPIError: too many tokens:
Therefore, in the following sections, we will explore some techniques to address this limitation.
First we try to truncate the document so that it meets the length constraints. This approach is simple to implement and understand. However, it drops potentially important information contained towards the end of the document.
The document discusses the impact of a specific protein, p53, on the process of angiogenesis, which is the growth of new blood vessels. Angiogenesis plays a critical role in various physiological processes, including wound healing and embryonic development. The presence of the p53 protein can inhibit angiogenesis by regulating the expression of certain genes and proteins. This inhibition can have significant implications for tumor growth, as angiogenesis is essential for tumor progression. Therefore, understanding the role of p53 in angiogenesis can contribute to our knowledge of tumor suppression and potential therapeutic interventions.
Additionally, the document mentions that the regulation of angiogenesis by p53 occurs independently of the protein’s role in cell cycle arrest and apoptosis, which are other key functions of p53 in tumor suppression. This suggests that p53 has a complex and multifaceted impact on cellular processes.
In this section we present how we can leverage a query retriereval based approach to generate an answer to the following question: Based on the document, are there any risks related to Elon Musk?.
The solution is outlined below and can be broken down into four functional steps.
Chunk the text into units
Use a ranking algorithm to rank chunks against the query
co.rerank (docs link), to rank each chunk against the query.Keep the most-relevant chunks until context limit is reached
co.rerank returns a relevance score, facilitating the selection of the most pertinent chunks. We can choose the most relevant chunks based on this score.Put condensed text back in original order
See query_based_retrieval function for the starting point.
In the final section we will show how we leverage graph theory to select chunks based on their centrality. Centrality is a graph-theoretic measure of how connected a node is; the higher the centrality, the more connected the node is to surrounding nodes (with fewer connections among those neighbors).
The solution presented in this document can be broken down into five functional steps:
Break the document into chunks.
Embed each chunk using an embedding model and construct a similarity matrix.
co.embed documentation link.Compute the centrality of each chunk.
NetworkX. It constructs a graph where the chunks are nodes, and the similarity score between them serves as the weight of the edges. Then, we calculate the centrality of each chunk as the sum of the edge weights adjacent to the node representing that chunk.Retain the highest-centrality chunks until the context limit is reached.
Reassemble the shortened text by reordering chunks in their original order.
See text_rank as the starting point.
In this notebook we present three useful methods to over come the limitations of context window size. In the following blog post, we talk more about how these methods can be evaluated.