The bread and butter of natural language processing technology is text. Once we can reduce a set of data into text, we can do all kinds of things with it: question answering, summarization, classification, sentiment analysis, searching and indexing, and more.
In the context of enterprise Retrieval Augmented Generation (RAG), the information is often locked in complex file types such as PDFs. These formats are made for sharing information between humans, but not so much with language models.
In this notebook, we will use a real-world pharmaceutical drug label to test out various performant approaches to parsing PDFs. This will allow us to use Cohere’s Command-R model in a RAG setting to answer questions and asks about this label, such as “I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of” a given pharmaceutical.
We will go over five proprietary as well as open source options for processing PDFs. The parsing mechanisms demonstrated in the following sections are
By way of example, we will be parsing a 21-page PDF containing the label for a recent FDA drug approval, the beginning of which is shown below. Then, we will perform a series of basic RAG tasks with our different parsings and evaluate their performance.
Before we dive into the technical weeds, we need to set up the notebook’s runtime and filesystem environments. The code cells below do the following:
data/document-parsing and contain the following:
fda-approved-drug.pdf (this can also be found here: https://www.accessdata.fda.gov/drugsatfda_docs/label/2023/215500s000lbl.pdf)Make sure to include the notebook’s utility functions in the runtime.
For demonstration purposes, we have collected and saved the parsed documents from each solution in this notebook. Skip to the next section to run RAG with Command-A on the pre-fetched versions. You can find all parsed resources in detail at the link here.
Document AI helps developers create high-accuracy processors to extract, classify, and split documents.
External documentation: https://cloud.google.com/document-ai
The following block can be executed in one of two ways:
Note: You can skip to the next block if you want to use the pre-existing parsed version.
Amazon Textract is an OCR service offered by AWS. It can detect text, forms, tables, and more in PDFs and images. In this section, we go over how to use Textract’s asynchronous API.
We assume that you are working within the AWS ecosystem (from a SageMaker notebook, EC2 instance, a Lambda function, etc.) with valid credentials. Much of the code here is from supplemental materials created by AWS and offered here:
At minimum, you will need access to the following AWS resources to get started:
fda-approved-drug.pdf fileFirst, we bring in the TextractWrapper class provided in the AWS Code Examples repository. This class makes it simpler to interface with the Textract service.
Next, we set up Textract and S3, and provide this to an instance of TextractWrapper.
We are now ready to make calls to Textract. At a high level, Textract has two modes: synchronous and asynchronous. Synchronous calls return the parsed output once it is completed. As of the time of writing (March 2024), however, multipage PDF processing is only supported asynchronously. So for our purposes here, we will only explore the asynchronous route.
Asynchronous calls follow the below process:
Once the job completes, this will return a dictionary with the following keys:
dict_keys(['DocumentMetadata', 'JobStatus', 'NextToken', 'Blocks', 'AnalyzeDocumentModelVersion', 'ResponseMetadata'])
This response corresponds to one chunk of information parsed by Textract. The number of chunks a document is parsed into depends on the length of the document. The two keys we are most interested in are Blocks and NextToken. Blocks contains all of the information that was extracted from this chunk, while NextToken tells us what chunk comes next, if any.
Textract returns an information-rich representation of the extracted text, such as their position on the page and hierarchical relationships with other entities, all the way down to the individual word level. Since we are only interested in the raw text, we need a way to parse through all of the chunks and their Blocks. Lucky for us, Amazon provides some helper functions for this purpose, which we utilize below.
We feed in the Job ID from before into the function get_text_results_from_textract to fetch all of the chunks associated with this job. Then, we pass the resulting list into get_the_text_with_required_info and get_text_with_line_spacing_info to organize the text into lines.
Finally, we can concatenate the lines into one string to pass into our downstream RAG pipeline.
Unstructured.io provides libraries with open-source components for pre-processing text documents such as PDFs, HTML and Word Documents.
External documentation: https://github.com/Unstructured-IO/unstructured-api
The guide assumes an endpoint exists that hosts this service. The API is offered in two forms
Note: You can skip to the next block if you want to use the pre-existing parsed version.
LlamaParse is an API created by LlamaIndex to efficiently parse and represent files for efficient retrieval and context augmentation using LlamaIndex frameworks.
External documentation: https://github.com/run-llama/llama_parse
The following block uses the LlamaParse cloud offering. You can learn more and fetch a respective API key for the service here.
Parsing documents with LlamaParse offers an option for two output modes both of which we will explore and compare below
Note: You can skip to the next block if you want to use the pre-existing parsed version.
The final parsing method we examine does not rely on cloud services, but rather relies on two libraries: pdf2image, and pytesseract. pytesseract lets you perform OCR locally on images, but not PDF files. So, we first convert our PDF into a set of images via pdf2image.
Now, we can process the image of each page with pytesseract and concatenate the results to get our parsed document.
We can now ask a set of simple + complex questions and see how each parsing solution performs with Command-R. The questions are
In order to set up our RAG implementation, we need to separate the parsed text into chunks and load the chunks to an index. The index will allow us to retrieve relevant passages from the document for different queries. Here, we use a simple implementation of indexing using the hnswlib library. Note that there are many different indexing solutions that are appropriate for specific production use cases.
In this step, we use k-nearest neighbors to fetch the most relevant documents for our query. Once the nearest neighbors are retrieved, we use Cohere’s reranker to reorder the documents in the most relevant order with regards to our input search query.
Run the code cells below to make head to head comparisons of the different parsing techniques across different questions.