Agentic RAG for PDFs with mixed data
Motivation
Retrieval-augmented generation (RAG) allows language models to generate grounded answers to questions about documents. However, the complexity of the documents can significantly influence overall RAG performance. For instance, the documents may be PDFs that contain a mix of text and tables.
More broadly, the implementation of a RAG pipeline - including parsing and chunking of documents, along with the embedding and retrieval of the chunks - is critical to the accuracy of grounded answers. Additionally, it is sometimes not sufficient to merely retrieve the answers; a user may want further postprocessing performed on the output. This use case would benefit from giving the model access to tools.
Objective
In this notebook, we will guide you through best practices for setting up a RAG pipeline to process documents that contain both tables and text. We will also demonstrate how to create a ReAct agent with a Cohere model, and then give the agent access to a RAG pipeline tool to improve accuracy. The general structure of the notebook is as follows:
- individual components around parsing, retrieval and generation are covered for documents with mixed tabular and textual data
- a class object is created that can be used to instantiate the pipeline with parametric input
- the RAG pipeline is then used as a tool for a Cohere ReACT agent
Reference Documents
We recommend the following notebook as a guide to semi-structured RAG.
We also recommend the following notebook to explore various parsing techniques for PDFs.
Various LangChain-supported parsers can be found here.
Install Dependencies
Parsing
To improve RAG performance on PDFs with mixed types (text and tables), we investigated a number of parsing and chunking strategies from various libraries:
- PyPDFLoader (LC)
- LlamaParse (Llama-Index)
- Unstructured
We have found that the best option for parsing is unstructured.io since the parser can:
- separate tables from text
- automatically chunk the tables and text by title during the parsing step so that similar elements are grouped
Vector Store Setup
There are many options for setting up a vector store. Here, we show how to do so using Chroma and Langchain’s Multi-vector retrieval. As the name implies, multi-vector retrieval allows us to store multiple vectors per document; for instance, for a single document chunk, one could keep embeddings for both the chunk itself, and a summary of that document. A summary may be able to distill more accurately what a chunk is about, leading to better retrieval.
You can read more about this here: https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector/
Below, we demonstrate the following process:
- summaries of each chunk are embedded
- during inference, the multi-vector retrieval returns the full context document related to the summary
RAG Pipeline
With our database in place, we can run queries against it. The query process can be broken down into the following steps:
- augment the query, this really helps retrieve all the relevant information
- use each augmented query to retrieve the top k docs and then rerank them
- concatenate all the shortlisted/reranked docs and pass them to the generation model
Example
We can now test out a query. In this example, the final answer can be found on page 12 of the PDF, which aligns with the response provided by the model:
Chat History Management
In the example below, we ask a follow up question that relies on the chat history, but does not require a rerun of the RAG pipeline.
We detect questions that do not require RAG by examining the search_queries
object returned by calling co.chat
to generate candidate queries to answer our question. If this object is empty, then the model has determined that a document query is not needed to answer the question.
In the example below, the else
statement is invoked based on query2
. We still pass in the chat history, allowing the question to be answered with only the prior context.
RAG Pipeline Class
Here, we connect all of the pieces discussed above into one class object, which is then used as a tool for a Cohere ReAct agent. This class definition consolidates and clarify the key parameters used to define the RAG pipeline.
This function will be deprecated in a future release and unstructured
will simply use the DEFAULT_MODEL from unstructured_inference.model.base
to set default model name
Cohere ReAct Agent with RAG Tool
Finally, we build a simple agent that utilizes the RAG pipeline defined above. We do this by granting the agent access to two tools:
- the end-to-end RAG pipeline
- a Python interpreter
The intention behind coupling these tools is to enable the model to perform mathematical and other postprocessing operations on RAG outputs using Python.
Just like earlier, we can also pass chat history to the LangChain agent to refer to for any other queries.
Conclusion
As you can see, the RAG pipeline can be used as a tool for a Cohere ReAct agent. This allows the agent to access the RAG pipeline for document retrieval and generation, as well as a Python interpreter for postprocessing mathematical operations to improve accuracy. This setup can be used to improve the accuracy of grounded answers to questions about documents that contain both tables and text.