Advanced Document Parsing For Enterprises
Introduction
The bread and butter of natural language processing technology is text. Once we can reduce a set of data into text, we can do all kinds of things with it: question answering, summarization, classification, sentiment analysis, searching and indexing, and more.
In the context of enterprise Retrieval Augmented Generation (RAG), the information is often locked in complex file types such as PDFs. These formats are made for sharing information between humans, but not so much with language models.
In this notebook, we will use a real-world pharmaceutical drug label to test out various performant approaches to parsing PDFs. This will allow us to use Cohere’s Command-R model in a RAG setting to answer questions and asks about this label, such as “I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of” a given pharmaceutical.
PDF Parsing
We will go over five proprietary as well as open source options for processing PDFs. The parsing mechanisms demonstrated in the following sections are
By way of example, we will be parsing a 21-page PDF containing the label for a recent FDA drug approval, the beginning of which is shown below. Then, we will perform a series of basic RAG tasks with our different parsings and evaluate their performance.
Getting Set Up
Before we dive into the technical weeds, we need to set up the notebook’s runtime and filesystem environments. The code cells below do the following:
- Install required libraries
- Confirm that data dependencies from the GitHub repo have been downloaded. These will be under
data/document-parsing
and contain the following:- the PDF document that we will be working with,
fda-approved-drug.pdf
(this can also be found here: https://www.accessdata.fda.gov/drugsatfda_docs/label/2023/215500s000lbl.pdf) - precomputed parsed documents for each parsing solution. While the point of this notebook is to illustrate how this is done, we provide the parsed final results to allow readers to skip ahead to the RAG section without having to set up the required infrastructure for each solution.)
- the PDF document that we will be working with,
- Add utility functions needed for later sections
Utility Functions
Make sure to include the notebook’s utility functions in the runtime.
Document Parsing Solutions
For demonstration purposes, we have collected and saved the parsed documents from each solution in this notebook. Skip to the next section to run RAG with Command-R on the pre-fetched versions. You can find all parsed resources in detail at the link here.
Solution 1: Google Cloud Document AI [Back to Solutions]
Document AI helps developers create high-accuracy processors to extract, classify, and split documents.
External documentation: https://cloud.google.com/document-ai
Parsing the document
The following block can be executed in one of two ways:
- Inside a Google Vertex AI environment
- No authentication needed
- From this notebook
- Authentication is needed
- There are pointers inside the code on which lines to uncomment in order to make this work
Note: You can skip to the next block if you want to use the pre-existing parsed version.
Visualize the parsed document
Solution 2: AWS Textract [Back to Solutions]
Amazon Textract is an OCR service offered by AWS. It can detect text, forms, tables, and more in PDFs and images. In this section, we go over how to use Textract’s asynchronous API.
Parsing the document
We assume that you are working within the AWS ecosystem (from a SageMaker notebook, EC2 instance, a Lambda function, etc.) with valid credentials. Much of the code here is from supplemental materials created by AWS and offered here:
- https://github.com/awsdocs/aws-doc-sdk-examples/tree/main/python/example_code/textract
- https://github.com/aws-samples/textract-paragraph-identification/tree/main
At minimum, you will need access to the following AWS resources to get started:
- Textract
- an S3 bucket containing the document(s) to process - in this case, our
example-drug-label.pdf
file - an SNS topic that Textract can publish to. This is used to send a notification that parsing is complete.
- an IAM role that Textract will assume, granting access to the S3 bucket and SNS topic
First, we bring in the TextractWrapper
class provided in the AWS Code Examples repository. This class makes it simpler to interface with the Textract service.
Next, we set up Textract and S3, and provide this to an instance of TextractWrapper
.
We are now ready to make calls to Textract. At a high level, Textract has two modes: synchronous and asynchronous. Synchronous calls return the parsed output once it is completed. As of the time of writing (March 2024), however, multipage PDF processing is only supported asynchronously. So for our purposes here, we will only explore the asynchronous route.
Asynchronous calls follow the below process:
- Send a request to Textract with an SNS topic, S3 bucket, and the name (key) of the document inside that bucket to process. Textract returns a Job ID that can be used to track the status of the request
- Textract fetches the document from S3 and processes it
- Once the request is complete, Textract sends out a message to the SNS topic. This can be used in conjunction with other services such as Lambda or SQS for downstream processes.
- The parsed results can be fetched from Textract in chunks via the job ID.
Once the job completes, this will return a dictionary with the following keys:
dict_keys(['DocumentMetadata', 'JobStatus', 'NextToken', 'Blocks', 'AnalyzeDocumentModelVersion', 'ResponseMetadata'])
This response corresponds to one chunk of information parsed by Textract. The number of chunks a document is parsed into depends on the length of the document. The two keys we are most interested in are Blocks
and NextToken
. Blocks
contains all of the information that was extracted from this chunk, while NextToken
tells us what chunk comes next, if any.
Textract returns an information-rich representation of the extracted text, such as their position on the page and hierarchical relationships with other entities, all the way down to the individual word level. Since we are only interested in the raw text, we need a way to parse through all of the chunks and their Blocks
. Lucky for us, Amazon provides some helper functions for this purpose, which we utilize below.
We feed in the Job ID from before into the function get_text_results_from_textract
to fetch all of the chunks associated with this job. Then, we pass the resulting list into get_the_text_with_required_info
and get_text_with_line_spacing_info
to organize the text into lines.
Finally, we can concatenate the lines into one string to pass into our downstream RAG pipeline.
Visualize the parsed document
Solution 3: Unstructured.io [Back to Solutions]
Unstructured.io provides libraries with open-source components for pre-processing text documents such as PDFs, HTML and Word Documents.
External documentation: https://github.com/Unstructured-IO/unstructured-api
Parsing the document
The guide assumes an endpoint exists that hosts this service. The API is offered in two forms
Note: You can skip to the next block if you want to use the pre-existing parsed version.
Visualize the parsed document
Solution 4: LlamaParse [Back to Solutions]
LlamaParse is an API created by LlamaIndex to efficiently parse and represent files for efficient retrieval and context augmentation using LlamaIndex frameworks.
External documentation: https://github.com/run-llama/llama_parse
Parsing the document
The following block uses the LlamaParse cloud offering. You can learn more and fetch a respective API key for the service here.
Parsing documents with LlamaParse offers an option for two output modes both of which we will explore and compare below
- Text
- Markdown
Note: You can skip to the next block if you want to use the pre-existing parsed version.
Visualize the parsed document
Solution 5: pdf2image + pytesseract [Back to Solutions]
The final parsing method we examine does not rely on cloud services, but rather relies on two libraries: pdf2image
, and pytesseract
. pytesseract
lets you perform OCR locally on images, but not PDF files. So, we first convert our PDF into a set of images via pdf2image
.
Parsing the document
Now, we can process the image of each page with pytesseract
and concatenate the results to get our parsed document.
Visualize the parsed document
Document Questions
We can now ask a set of simple + complex questions and see how each parsing solution performs with Command-R. The questions are
- What are the most common adverse reactions of Iwilfin?
- Task: Simple information extraction
- What is the recommended dosage of IWILFIN on body surface area between 0.5 and 0.75?
- Task: Tabular data extraction
- I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of Iwilfin.
- Task: Overall document summary
Data Ingestion
In order to set up our RAG implementation, we need to separate the parsed text into chunks and load the chunks to an index. The index will allow us to retrieve relevant passages from the document for different queries. Here, we use a simple implementation of indexing using the hnswlib
library. Note that there are many different indexing solutions that are appropriate for specific production use cases.
Retrieval
In this step, we use k-nearest neighbors to fetch the most relevant documents for our query. Once the nearest neighbors are retrieved, we use Cohere’s reranker to reorder the documents in the most relevant order with regards to our input search query.
Final Step: Call Command-R + RAG!
Head-to-head Comparisons
Run the code cells below to make head to head comparisons of the different parsing techniques across different questions.