Retrieval-augmented generation (RAG) allows language models to generate grounded answers to questions about documents. However, the complexity of the documents can significantly influence overall RAG performance. For instance, the documents may be PDFs that contain a mix of text and tables.
More broadly, the implementation of a RAG pipeline - including parsing and chunking of documents, along with the embedding and retrieval of the chunks - is critical to the accuracy of grounded answers. Additionally, it is sometimes not sufficient to merely retrieve the answers; a user may want further postprocessing performed on the output. This use case would benefit from giving the model access to tools.
In this notebook, we will guide you through best practices for setting up a RAG pipeline to process documents that contain both tables and text. We will also demonstrate how to create a ReAct agent with a Cohere model, and then give the agent access to a RAG pipeline tool to improve accuracy. The general structure of the notebook is as follows:
We recommend the following notebook as a guide to semi-structured RAG.
We also recommend the following notebook to explore various parsing techniques for PDFs.
Various LangChain-supported parsers can be found here.
To improve RAG performance on PDFs with mixed types (text and tables), we investigated a number of parsing and chunking strategies from various libraries:
We have found that the best option for parsing is unstructured.io since the parser can:
There are many options for setting up a vector store. Here, we show how to do so using Chroma and Langchain’s Multi-vector retrieval. As the name implies, multi-vector retrieval allows us to store multiple vectors per document; for instance, for a single document chunk, one could keep embeddings for both the chunk itself, and a summary of that document. A summary may be able to distill more accurately what a chunk is about, leading to better retrieval.
You can read more about this here: https://python.langchain.com/docs/how_to/multi_vector/
Below, we demonstrate the following process:
With our database in place, we can run queries against it. The query process can be broken down into the following steps:
We can now test out a query. In this example, the final answer can be found on page 12 of the PDF, which aligns with the response provided by the model:
In the example below, we ask a follow up question that relies on the chat history, but does not require a rerun of the RAG pipeline.
We detect questions that do not require RAG by examining the search_queries object returned by calling co.chat to generate candidate queries to answer our question. If this object is empty, then the model has determined that a document query is not needed to answer the question.
In the example below, the else statement is invoked based on query2. We still pass in the chat history, allowing the question to be answered with only the prior context.
Here, we connect all of the pieces discussed above into one class object, which is then used as a tool for a Cohere ReAct agent. This class definition consolidates and clarify the key parameters used to define the RAG pipeline.
This function will be deprecated in a future release and unstructured will simply use the DEFAULT_MODEL from unstructured_inference.model.base to set default model name
Finally, we build a simple agent that utilizes the RAG pipeline defined above. We do this by granting the agent access to two tools:
The intention behind coupling these tools is to enable the model to perform mathematical and other postprocessing operations on RAG outputs using Python.
Just like earlier, we can also pass chat history to the LangChain agent to refer to for any other queries.
As you can see, the RAG pipeline can be used as a tool for a Cohere ReAct agent. This allows the agent to access the RAG pipeline for document retrieval and generation, as well as a Python interpreter for postprocessing mathematical operations to improve accuracy. This setup can be used to improve the accuracy of grounded answers to questions about documents that contain both tables and text.