Advanced Document Parsing For Enterprises

Giannis Chatziveroglou

Introduction

The bread and butter of natural language processing technology is text. Once we can reduce a set of data into text, we can do all kinds of things with it: question answering, summarization, classification, sentiment analysis, searching and indexing, and more.

In the context of enterprise Retrieval Augmented Generation (RAG), the information is often locked in complex file types such as PDFs. These formats are made for sharing information between humans, but not so much with language models.

In this notebook, we will use a real-world pharmaceutical drug label to test out various performant approaches to parsing PDFs. This will allow us to use Cohere’s Command-R model in a RAG setting to answer questions and asks about this label, such as “I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of” a given pharmaceutical.

PDF Parsing

We will go over five proprietary as well as open source options for processing PDFs. The parsing mechanisms demonstrated in the following sections are

By way of example, we will be parsing a 21-page PDF containing the label for a recent FDA drug approval, the beginning of which is shown below. Then, we will perform a series of basic RAG tasks with our different parsings and evaluate their performance.

Getting Set Up

Before we dive into the technical weeds, we need to set up the notebook’s runtime and filesystem environments. The code cells below do the following:

Install required libraries
Confirm that data dependencies from the GitHub repo have been downloaded. These will be under data/document-parsing and contain the following:
- the PDF document that we will be working with, fda-approved-drug.pdf (this can also be found here: https://www.accessdata.fda.gov/drugsatfda_docs/label/2023/215500s000lbl.pdf)
- precomputed parsed documents for each parsing solution. While the point of this notebook is to illustrate how this is done, we provide the parsed final results to allow readers to skip ahead to the RAG section without having to set up the required infrastructure for each solution.)
Add utility functions needed for later sections

PYTHON

1 %%capture
2 ! sudo apt install tesseract-ocr poppler-utils
3 ! pip install "cohere<5" fsspec hnswlib google-cloud-documentai google-cloud-storage boto3 langchain-text-splitters llama_parse pytesseract pdf2image pandas

PYTHON

1 data_dir = "data/document-parsing"
2 source_filename = "fda-approved-drug"
3 extension = "pdf"

PYTHON

1 from pathlib import Path
2 
3 sources = ["gcp", "aws", "unstructured-io", "llamaparse-text", "llamaparse-markdown", "pytesseract"]
4 
5 filenames = ["{}-parsed-fda-approved-drug.txt".format(source) for source in sources]
6 filenames.append("fda-approved-drug.pdf")
7 
8 for filename in filenames:
9     file_path = Path(f"{data_dir}/{filename}")
10     if file_path.is_file() == False:
11         print(f"File {filename} not found at {data_dir}!")

Utility Functions

Make sure to include the notebook’s utility functions in the runtime.

PYTHON

1 def store_document(path: str, doc_content: str):
2     with open(path, 'w') as f:
3       f.write(doc_content)

PYTHON

1 import json
2 
3 def insert_citations_in_order(text, citations, documents):
4     """
5     A helper function to pretty print citations.
6     """
7 
8     citations_reference = {}
9     for index, doc in enumerate(documents):
10         citations_reference[index] = doc
11 
12     offset = 0
13     # Process citations in the order they were provided
14     for citation in citations:
15         # Adjust start/end with offset
16         start, end = citation['start'] + offset, citation['end'] + offset
17         citation_numbers = []
18         for doc_id in citation["document_ids"]:
19             for citation_index, doc in citations_reference.items():
20                 if doc["id"] == doc_id:
21                     citation_numbers.append(citation_index)
22         references = "(" + ", ".join("[{}]".format(num) for num in citation_numbers) + ")"
23         modification = f'{text[start:end]} {references}'
24         # Replace the cited text with its bolded version + placeholder
25         text = text[:start] + modification + text[end:]
26         # Update the offset for subsequent replacements
27         offset += len(modification) - (end - start)
28 
29     # Add the citations at the bottom of the text
30     text_with_citations = f'{text}'
31     citations_reference = ["[{}]: {}".format(x["id"], x["text"]) for x in citations_reference.values()]
32 
33     return text_with_citations, "\n".join(citations_reference)

PYTHON

1 def format_docs_for_chat(documents):
2   return [{"id": str(index), "text": x} for index, x in enumerate(documents)]

Document Parsing Solutions

For demonstration purposes, we have collected and saved the parsed documents from each solution in this notebook. Skip to the next section to run RAG with Command-A on the pre-fetched versions. You can find all parsed resources in detail at the link here.

Solution 1: Google Cloud Document AI [Back to Solutions]

Document AI helps developers create high-accuracy processors to extract, classify, and split documents.

External documentation: https://cloud.google.com/document-ai

Parsing the document

The following block can be executed in one of two ways:

Inside a Google Vertex AI environment
- No authentication needed
From this notebook
- Authentication is needed
- There are pointers inside the code on which lines to uncomment in order to make this work

Note: You can skip to the next block if you want to use the pre-existing parsed version.

PYTHON

1 """
2 Extracted from https://cloud.google.com/document-ai/docs/samples/documentai-batch-process-document
3 """
4 
5 import re
6 from typing import Optional
7 
8 from google.api_core.client_options import ClientOptions
9 from google.api_core.exceptions import InternalServerError
10 from google.api_core.exceptions import RetryError
11 from google.cloud import documentai  # type: ignore
12 from google.cloud import storage
13 
14 project_id = ""
15 location = ""
16 processor_id = ""
17 gcs_output_uri = ""
18 # credentials_file = "populate if you are running in a non Vertex AI environment."
19 gcs_input_prefix = ""
20 
21 
22 def batch_process_documents(
23     project_id: str,
24     location: str,
25     processor_id: str,
26     gcs_output_uri: str,
27     gcs_input_prefix: str,
28     timeout: int = 400
29 ) -> None:
30     parsed_documents = []
31 
32     # Client configs
33     opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")
34     # With credentials
35     # opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com", credentials_file=credentials_file)
36 
37     client = documentai.DocumentProcessorServiceClient(client_options=opts)
38     processor_name = client.processor_path(project_id, location, processor_id)
39 
40     # Input storage configs
41     gcs_prefix = documentai.GcsPrefix(gcs_uri_prefix=gcs_input_prefix)
42     input_config = documentai.BatchDocumentsInputConfig(gcs_prefix=gcs_prefix)
43 
44     # Output storage configs
45     gcs_output_config = documentai.DocumentOutputConfig.GcsOutputConfig(gcs_uri=gcs_output_uri, field_mask=None)
46     output_config = documentai.DocumentOutputConfig(gcs_output_config=gcs_output_config)
47     storage_client = storage.Client()
48     # With credentials
49     # storage_client = storage.Client.from_service_account_json(json_credentials_path=credentials_file)
50 
51     # Batch process docs request
52     request = documentai.BatchProcessRequest(
53         name=processor_name,
54         input_documents=input_config,
55         document_output_config=output_config,
56     )
57 
58     # batch_process_documents returns a long running operation
59     operation = client.batch_process_documents(request)
60 
61     # Continually polls the operation until it is complete.
62     # This could take some time for larger files
63     try:
64         print(f"Waiting for operation {operation.operation.name} to complete...")
65         operation.result(timeout=timeout)
66     except (RetryError, InternalServerError) as e:
67         print(e.message)
68 
69     # Get output document information from completed operation metadata
70     metadata = documentai.BatchProcessMetadata(operation.metadata)
71     if metadata.state != documentai.BatchProcessMetadata.State.SUCCEEDED:
72         raise ValueError(f"Batch Process Failed: {metadata.state_message}")
73 
74     print("Output files:")
75     # One process per Input Document
76     for process in list(metadata.individual_process_statuses):
77         matches = re.match(r"gs://(.*?)/(.*)", process.output_gcs_destination)
78         if not matches:
79             print("Could not parse output GCS destination:", process.output_gcs_destination)
80             continue
81 
82         output_bucket, output_prefix = matches.groups()
83         output_blobs = storage_client.list_blobs(output_bucket, prefix=output_prefix)
84 
85         # Document AI may output multiple JSON files per source file
86         # (Large documents get split in multiple file "versions" doc --> parsed_doc_0 + parsed_doc_1 ...)
87         for blob in output_blobs:
88             # Document AI should only output JSON files to GCS
89             if blob.content_type != "application/json":
90                 print(f"Skipping non-supported file: {blob.name} - Mimetype: {blob.content_type}")
91                 continue
92 
93             # Download JSON file as bytes object and convert to Document Object
94             print(f"Fetching {blob.name}")
95             document = documentai.Document.from_json(blob.download_as_bytes(), ignore_unknown_fields=True)
96             # Store the filename and the parsed versioned document content as a tuple
97             parsed_documents.append((blob.name.split("/")[-1].split(".")[0], document.text))
98 
99     print("Finished document parsing process.")
100     return parsed_documents
101 
102 # Call service
103 # versioned_parsed_documents = batch_process_documents(
104 #     project_id=project_id,
105 #     location=location,
106 #     processor_id=processor_id,
107 #     gcs_output_uri=gcs_output_uri,
108 #     gcs_input_prefix=gcs_input_prefix
109 # )

PYTHON

1 """
2 Post process parsed document and store it locally.
3 Make sure to run this in a Google Vertex AI environment or include a credentials file.
4 """
5 
6 """
7 from pathlib import Path
8 from collections import defaultdict
9 
10 parsed_documents = []
11 combined_versioned_parsed_documents = defaultdict(list)
12 
13 # Assemble versioned documents together ({"doc_name": [(0, doc_content_0), (1, doc_content_1), ...]}).
14 for filename, doc_content in versioned_parsed_documents:
15   filename, version = "-".join(filename.split("-")[:-1]), filename.split("-")[-1]
16   combined_versioned_parsed_documents[filename].append((version, doc_content))
17 
18 # Sort documents by version and join the content together.
19 for filename, docs in combined_versioned_parsed_documents.items():
20   doc_content = " ".join([x[1] for x in sorted(docs, key=lambda x: x[0])])
21   parsed_documents.append((filename, doc_content))
22 
23 # Store parsed documents in local storage.
24 for filename, doc_content in parsed_documents:
25  file_path = "{}/{}-parsed-{}.txt".format(data_dir, "gcp", source_filename)
26  store_document(file_path, doc_content)
27 """

Visualize the parsed document

PYTHON

1 filename = "gcp-parsed-{}.txt".format(source_filename)
2 with open("{}/{}".format(data_dir, filename), "r") as doc:
3     parsed_document = doc.read()
4 
5 print(parsed_document[:1000])

Solution 2: AWS Textract [Back to Solutions]

Amazon Textract is an OCR service offered by AWS. It can detect text, forms, tables, and more in PDFs and images. In this section, we go over how to use Textract’s asynchronous API.

Parsing the document

We assume that you are working within the AWS ecosystem (from a SageMaker notebook, EC2 instance, a Lambda function, etc.) with valid credentials. Much of the code here is from supplemental materials created by AWS and offered here:

At minimum, you will need access to the following AWS resources to get started:

Textract
an S3 bucket containing the document(s) to process - in this case, our fda-approved-drug.pdf file
an SNS topic that Textract can publish to. This is used to send a notification that parsing is complete.
an IAM role that Textract will assume, granting access to the S3 bucket and SNS topic

First, we bring in the TextractWrapper class provided in the AWS Code Examples repository. This class makes it simpler to interface with the Textract service.

PYTHON

1 # source: https://github.com/awsdocs/aws-doc-sdk-examples/tree/main/python/example_code/textract
2 
3 # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
4 # SPDX-License-Identifier: Apache-2.0
5 
6 """
7 Purpose
8 
9 Shows how to use the AWS SDK for Python (Boto3) with Amazon Textract to
10 detect text, form, and table elements in document images.
11 """
12 
13 import json
14 import logging
15 from botocore.exceptions import ClientError
16 
17 logger = logging.getLogger(__name__)
18 
19 
20 # snippet-start:[python.example_code.textract.TextractWrapper]
21 class TextractWrapper:
22     """Encapsulates Textract functions."""
23 
24     def __init__(self, textract_client, s3_resource, sqs_resource):
25         """
26         :param textract_client: A Boto3 Textract client.
27         :param s3_resource: A Boto3 Amazon S3 resource.
28         :param sqs_resource: A Boto3 Amazon SQS resource.
29         """
30         self.textract_client = textract_client
31         self.s3_resource = s3_resource
32         self.sqs_resource = sqs_resource
33 
34     # snippet-end:[python.example_code.textract.TextractWrapper]
35 
36     # snippet-start:[python.example_code.textract.DetectDocumentText]
37     def detect_file_text(self, *, document_file_name=None, document_bytes=None):
38         """
39         Detects text elements in a local image file or from in-memory byte data.
40         The image must be in PNG or JPG format.
41 
42         :param document_file_name: The name of a document image file.
43         :param document_bytes: In-memory byte data of a document image.
44         :return: The response from Amazon Textract, including a list of blocks
45                  that describe elements detected in the image.
46         """
47         if document_file_name is not None:
48             with open(document_file_name, "rb") as document_file:
49                 document_bytes = document_file.read()
50         try:
51             response = self.textract_client.detect_document_text(
52                 Document={"Bytes": document_bytes}
53             )
54             logger.info("Detected %s blocks.", len(response["Blocks"]))
55         except ClientError:
56             logger.exception("Couldn't detect text.")
57             raise
58         else:
59             return response
60 
61     # snippet-end:[python.example_code.textract.DetectDocumentText]
62 
63     # snippet-start:[python.example_code.textract.AnalyzeDocument]
64     def analyze_file(
65         self, feature_types, *, document_file_name=None, document_bytes=None
66     ):
67         """
68         Detects text and additional elements, such as forms or tables, in a local image
69         file or from in-memory byte data.
70         The image must be in PNG or JPG format.
71 
72         :param feature_types: The types of additional document features to detect.
73         :param document_file_name: The name of a document image file.
74         :param document_bytes: In-memory byte data of a document image.
75         :return: The response from Amazon Textract, including a list of blocks
76                  that describe elements detected in the image.
77         """
78         if document_file_name is not None:
79             with open(document_file_name, "rb") as document_file:
80                 document_bytes = document_file.read()
81         try:
82             response = self.textract_client.analyze_document(
83                 Document={"Bytes": document_bytes}, FeatureTypes=feature_types
84             )
85             logger.info("Detected %s blocks.", len(response["Blocks"]))
86         except ClientError:
87             logger.exception("Couldn't detect text.")
88             raise
89         else:
90             return response
91 
92     # snippet-end:[python.example_code.textract.AnalyzeDocument]
93 
94     # snippet-start:[python.example_code.textract.helper.prepare_job]
95     def prepare_job(self, bucket_name, document_name, document_bytes):
96         """
97         Prepares a document image for an asynchronous detection job by uploading
98         the image bytes to an Amazon S3 bucket. Amazon Textract must have permission
99         to read from the bucket to process the image.
100 
101         :param bucket_name: The name of the Amazon S3 bucket.
102         :param document_name: The name of the image stored in Amazon S3.
103         :param document_bytes: The image as byte data.
104         """
105         try:
106             bucket = self.s3_resource.Bucket(bucket_name)
107             bucket.upload_fileobj(document_bytes, document_name)
108             logger.info("Uploaded %s to %s.", document_name, bucket_name)
109         except ClientError:
110             logger.exception("Couldn't upload %s to %s.", document_name, bucket_name)
111             raise
112 
113     # snippet-end:[python.example_code.textract.helper.prepare_job]
114 
115     # snippet-start:[python.example_code.textract.helper.check_job_queue]
116     def check_job_queue(self, queue_url, job_id):
117         """
118         Polls an Amazon SQS queue for messages that indicate a specified Textract
119         job has completed.
120 
121         :param queue_url: The URL of the Amazon SQS queue to poll.
122         :param job_id: The ID of the Textract job.
123         :return: The status of the job.
124         """
125         status = None
126         try:
127             queue = self.sqs_resource.Queue(queue_url)
128             messages = queue.receive_messages()
129             if messages:
130                 msg_body = json.loads(messages[0].body)
131                 msg = json.loads(msg_body["Message"])
132                 if msg.get("JobId") == job_id:
133                     messages[0].delete()
134                     status = msg.get("Status")
135                     logger.info(
136                         "Got message %s with status %s.", messages[0].message_id, status
137                     )
138             else:
139                 logger.info("No messages in queue %s.", queue_url)
140         except ClientError:
141             logger.exception("Couldn't get messages from queue %s.", queue_url)
142         else:
143             return status
144 
145     # snippet-end:[python.example_code.textract.helper.check_job_queue]
146 
147     # snippet-start:[python.example_code.textract.StartDocumentTextDetection]
148     def start_detection_job(
149         self, bucket_name, document_file_name, sns_topic_arn, sns_role_arn
150     ):
151         """
152         Starts an asynchronous job to detect text elements in an image stored in an
153         Amazon S3 bucket. Textract publishes a notification to the specified Amazon SNS
154         topic when the job completes.
155         The image must be in PNG, JPG, or PDF format.
156 
157         :param bucket_name: The name of the Amazon S3 bucket that contains the image.
158         :param document_file_name: The name of the document image stored in Amazon S3.
159         :param sns_topic_arn: The Amazon Resource Name (ARN) of an Amazon SNS topic
160                               where the job completion notification is published.
161         :param sns_role_arn: The ARN of an AWS Identity and Access Management (IAM)
162                              role that can be assumed by Textract and grants permission
163                              to publish to the Amazon SNS topic.
164         :return: The ID of the job.
165         """
166         try:
167             response = self.textract_client.start_document_text_detection(
168                 DocumentLocation={
169                     "S3Object": {"Bucket": bucket_name, "Name": document_file_name}
170                 },
171                 NotificationChannel={
172                     "SNSTopicArn": sns_topic_arn,
173                     "RoleArn": sns_role_arn,
174                 },
175             )
176             job_id = response["JobId"]
177             logger.info(
178                 "Started text detection job %s on %s.", job_id, document_file_name
179             )
180         except ClientError:
181             logger.exception("Couldn't detect text in %s.", document_file_name)
182             raise
183         else:
184             return job_id
185 
186     # snippet-end:[python.example_code.textract.StartDocumentTextDetection]
187 
188     # snippet-start:[python.example_code.textract.GetDocumentTextDetection]
189     def get_detection_job(self, job_id):
190         """
191         Gets data for a previously started text detection job.
192 
193         :param job_id: The ID of the job to retrieve.
194         :return: The job data, including a list of blocks that describe elements
195                  detected in the image.
196         """
197         try:
198             response = self.textract_client.get_document_text_detection(JobId=job_id)
199             job_status = response["JobStatus"]
200             logger.info("Job %s status is %s.", job_id, job_status)
201         except ClientError:
202             logger.exception("Couldn't get data for job %s.", job_id)
203             raise
204         else:
205             return response
206 
207     # snippet-end:[python.example_code.textract.GetDocumentTextDetection]
208 
209     # snippet-start:[python.example_code.textract.StartDocumentAnalysis]
210     def start_analysis_job(
211         self,
212         bucket_name,
213         document_file_name,
214         feature_types,
215         sns_topic_arn,
216         sns_role_arn,
217     ):
218         """
219         Starts an asynchronous job to detect text and additional elements, such as
220         forms or tables, in an image stored in an Amazon S3 bucket. Textract publishes
221         a notification to the specified Amazon SNS topic when the job completes.
222         The image must be in PNG, JPG, or PDF format.
223 
224         :param bucket_name: The name of the Amazon S3 bucket that contains the image.
225         :param document_file_name: The name of the document image stored in Amazon S3.
226         :param feature_types: The types of additional document features to detect.
227         :param sns_topic_arn: The Amazon Resource Name (ARN) of an Amazon SNS topic
228                               where job completion notification is published.
229         :param sns_role_arn: The ARN of an AWS Identity and Access Management (IAM)
230                              role that can be assumed by Textract and grants permission
231                              to publish to the Amazon SNS topic.
232         :return: The ID of the job.
233         """
234         try:
235             response = self.textract_client.start_document_analysis(
236                 DocumentLocation={
237                     "S3Object": {"Bucket": bucket_name, "Name": document_file_name}
238                 },
239                 NotificationChannel={
240                     "SNSTopicArn": sns_topic_arn,
241                     "RoleArn": sns_role_arn,
242                 },
243                 FeatureTypes=feature_types,
244             )
245             job_id = response["JobId"]
246             logger.info(
247                 "Started text analysis job %s on %s.", job_id, document_file_name
248             )
249         except ClientError:
250             logger.exception("Couldn't analyze text in %s.", document_file_name)
251             raise
252         else:
253             return job_id
254 
255     # snippet-end:[python.example_code.textract.StartDocumentAnalysis]
256 
257     # snippet-start:[python.example_code.textract.GetDocumentAnalysis]
258     def get_analysis_job(self, job_id):
259         """
260         Gets data for a previously started detection job that includes additional
261         elements.
262 
263         :param job_id: The ID of the job to retrieve.
264         :return: The job data, including a list of blocks that describe elements
265                  detected in the image.
266         """
267         try:
268             response = self.textract_client.get_document_analysis(JobId=job_id)
269             job_status = response["JobStatus"]
270             logger.info("Job %s status is %s.", job_id, job_status)
271         except ClientError:
272             logger.exception("Couldn't get data for job %s.", job_id)
273             raise
274         else:
275             return response
276 
277 
278 # snippet-end:[python.example_code.textract.GetDocumentAnalysis]

Next, we set up Textract and S3, and provide this to an instance of TextractWrapper.

PYTHON

1 import boto3
2 
3 textract_client = boto3.client('textract')
4 s3_client = boto3.client('s3')
5 
6 textractWrapper = TextractWrapper(textract_client, s3_client, None)

We are now ready to make calls to Textract. At a high level, Textract has two modes: synchronous and asynchronous. Synchronous calls return the parsed output once it is completed. As of the time of writing (March 2024), however, multipage PDF processing is only supported asynchronously. So for our purposes here, we will only explore the asynchronous route.

Asynchronous calls follow the below process:

Send a request to Textract with an SNS topic, S3 bucket, and the name (key) of the document inside that bucket to process. Textract returns a Job ID that can be used to track the status of the request
Textract fetches the document from S3 and processes it
Once the request is complete, Textract sends out a message to the SNS topic. This can be used in conjunction with other services such as Lambda or SQS for downstream processes.
The parsed results can be fetched from Textract in chunks via the job ID.

PYTHON

1 bucket_name = "your-bucket-name"
2 sns_topic_arn = "your-sns-arn" # this can be found under the topic you created in the Amazon SNS dashboard
3 sns_role_arn = "sns-role-arn" # this is an IAM role that allows Textract to interact with SNS
4 
5 file_name = "fda-approved-drug.pdf"

PYTHON

1 # kick off a text detection job. This returns a job ID.
2 job_id = textractWrapper.start_detection_job(bucket_name=bucket_name, document_file_name=file_name,
3                                     sns_topic_arn=sns_topic_arn, sns_role_arn=sns_role_arn)

Once the job completes, this will return a dictionary with the following keys:

dict_keys(['DocumentMetadata', 'JobStatus', 'NextToken', 'Blocks', 'AnalyzeDocumentModelVersion', 'ResponseMetadata'])

This response corresponds to one chunk of information parsed by Textract. The number of chunks a document is parsed into depends on the length of the document. The two keys we are most interested in are Blocks and NextToken. Blocks contains all of the information that was extracted from this chunk, while NextToken tells us what chunk comes next, if any.

Textract returns an information-rich representation of the extracted text, such as their position on the page and hierarchical relationships with other entities, all the way down to the individual word level. Since we are only interested in the raw text, we need a way to parse through all of the chunks and their Blocks. Lucky for us, Amazon provides some helper functions for this purpose, which we utilize below.

PYTHON

1 def get_text_results_from_textract(job_id):
2     response = textract_client.get_document_text_detection(JobId=job_id)
3     collection_of_textract_responses = []
4     pages = [response]
5 
6     collection_of_textract_responses.append(response)
7 
8     while 'NextToken' in response:
9         next_token = response['NextToken']
10         response = textract_client.get_document_text_detection(JobId=job_id, NextToken=next_token)
11         pages.append(response)
12         collection_of_textract_responses.append(response)
13     return collection_of_textract_responses
14 
15 def get_the_text_with_required_info(collection_of_textract_responses):
16     total_text = []
17     total_text_with_info = []
18     running_sequence_number = 0
19 
20     font_sizes_and_line_numbers = {}
21     for page in collection_of_textract_responses:
22         per_page_text = []
23         blocks = page['Blocks']
24         for block in blocks:
25             if block['BlockType'] == 'LINE':
26                 block_text_dict = {}
27                 running_sequence_number += 1
28                 block_text_dict.update(text=block['Text'])
29                 block_text_dict.update(page=block['Page'])
30                 block_text_dict.update(left_indent=round(block['Geometry']['BoundingBox']['Left'], 2))
31                 font_height = round(block['Geometry']['BoundingBox']['Height'], 3)
32                 line_number = running_sequence_number
33                 block_text_dict.update(font_height=round(block['Geometry']['BoundingBox']['Height'], 3))
34                 block_text_dict.update(indent_from_top=round(block['Geometry']['BoundingBox']['Top'], 2))
35                 block_text_dict.update(text_width=round(block['Geometry']['BoundingBox']['Width'], 2))
36                 block_text_dict.update(line_number=running_sequence_number)
37 
38                 if font_height in font_sizes_and_line_numbers:
39                     line_numbers = font_sizes_and_line_numbers[font_height]
40                     line_numbers.append(line_number)
41                     font_sizes_and_line_numbers[font_height] = line_numbers
42                 else:
43                     line_numbers = []
44                     line_numbers.append(line_number)
45                     font_sizes_and_line_numbers[font_height] = line_numbers
46 
47                 total_text.append(block['Text'])
48                 per_page_text.append(block['Text'])
49                 total_text_with_info.append(block_text_dict)
50 
51     return total_text, total_text_with_info, font_sizes_and_line_numbers
52 
53 def get_text_with_line_spacing_info(total_text_with_info):
54     i = 1
55     text_info_with_line_spacing_info = []
56     while (i < len(total_text_with_info) - 1):
57         previous_line_info = total_text_with_info[i - 1]
58         current_line_info = total_text_with_info[i]
59         next_line_info = total_text_with_info[i + 1]
60         if current_line_info['page'] == next_line_info['page'] and previous_line_info['page'] == current_line_info[
61             'page']:
62             line_spacing_after = round((next_line_info['indent_from_top'] - current_line_info['indent_from_top']), 2)
63             spacing_with_prev = round((current_line_info['indent_from_top'] - previous_line_info['indent_from_top']), 2)
64             current_line_info.update(line_space_before=spacing_with_prev)
65             current_line_info.update(line_space_after=line_spacing_after)
66             text_info_with_line_spacing_info.append(current_line_info)
67         else:
68             text_info_with_line_spacing_info.append(None)
69         i += 1
70     return text_info_with_line_spacing_info

We feed in the Job ID from before into the function get_text_results_from_textract to fetch all of the chunks associated with this job. Then, we pass the resulting list into get_the_text_with_required_info and get_text_with_line_spacing_info to organize the text into lines.

Finally, we can concatenate the lines into one string to pass into our downstream RAG pipeline.

PYTHON

1 all_text = "\n".join([line["text"] if line else "" for line in text_info_with_line_spacing])
2 
3 with open(f"aws-parsed-{source_filename}.txt", "w") as f:
4   f.write(all_text)

Visualize the parsed document

PYTHON

1 filename = "aws-parsed-{}.txt".format(source_filename)
2 with open("{}/{}".format(data_dir, filename), "r") as doc:
3     parsed_document = doc.read()
4 
5 print(parsed_document[:1000])

Solution 3: Unstructured.io [Back to Solutions]

Unstructured.io provides libraries with open-source components for pre-processing text documents such as PDFs, HTML and Word Documents.

External documentation: https://github.com/Unstructured-IO/unstructured-api

Parsing the document

The guide assumes an endpoint exists that hosts this service. The API is offered in two forms

Note: You can skip to the next block if you want to use the pre-existing parsed version.

PYTHON

1 import os
2 import requests
3 
4 UNSTRUCTURED_URL = "" # enter service endpoint, for example "http://localhost:9500/general/v0/general" (assuming the container is running locally and exposing the service with a -p 9500:9500 port mapping)
5 
6 
7 parsed_documents = []
8 
9 input_path = "{}/{}.{}".format(data_dir, source_filename, extension)
10 with open(input_path, 'rb') as file_data:
11     response = requests.post(
12         url=UNSTRUCTURED_URL,
13         files={"files": ("{}.{}".format(source_filename, extension), file_data)},
14         data={
15             "output_format": (None, "application/json"),
16             "strategy": "fast",
17             "pdf_infer_table_structure": "true",
18             "include_page_breaks": "true"
19         },
20         headers={"Accept": "application/json"}
21     )
22 
23 parsed_response = response.json()
24 
25 parsed_document = " ".join([parsed_entry["text"] for parsed_entry in parsed_response])
26 print("Parsed {}".format(source_filename))

PYTHON

1 """
2 Post process parsed document and store it locally.
3 """
4 
5 file_path = "{}/{}-parsed-fda-approved-drug.txt".format(data_dir, "unstructured-io")
6 store_document(file_path, parsed_document)

Visualize the parsed document

PYTHON

1 filename = "unstructured-io-parsed-{}.txt".format(source_filename)
2 with open("{}/{}".format(data_dir, filename), "r") as doc:
3     parsed_document = doc.read()
4 
5 print(parsed_document[:1000])

Solution 4: LlamaParse [Back to Solutions]

LlamaParse is an API created by LlamaIndex to efficiently parse and represent files for efficient retrieval and context augmentation using LlamaIndex frameworks.

External documentation: https://github.com/run-llama/llama_parse

Parsing the document

The following block uses the LlamaParse cloud offering. You can learn more and fetch a respective API key for the service here.

Parsing documents with LlamaParse offers an option for two output modes both of which we will explore and compare below

Text
Markdown

Note: You can skip to the next block if you want to use the pre-existing parsed version.

PYTHON

1 import os
2 from llama_parse import LlamaParse
3 
4 import nest_asyncio # needed to notebook env
5 nest_asyncio.apply() # needed to notebook env
6 
7 llama_index_api_key = "{API_KEY}"
8 input_path = "{}/{}.{}".format(data_dir, source_filename, extension)

PYTHON

1 # Text mode
2 text_parser = LlamaParse(
3     api_key=llama_index_api_key,
4     result_type="text"
5 )
6 
7 text_response = text_parser.load_data(input_path)
8 text_parsed_document = " ".join([parsed_entry.text for parsed_entry in text_response])
9 
10 print("Parsed {} to text".format(source_filename))

PYTHON

1 """
2 Post process parsed document and store it locally.
3 """
4 
5 file_path = "{}/{}-text-parsed-fda-approved-drug.txt".format(data_dir, "llamaparse")
6 store_document(file_path, text_parsed_document)

PYTHON

1 # Markdown mode
2 markdown_parser = LlamaParse(
3     api_key=llama_index_api_key,
4     result_type="markdown"
5 )
6 
7 markdown_response = markdown_parser.load_data(input_path)
8 markdown_parsed_document = " ".join([parsed_entry.text for parsed_entry in markdown_response])
9 
10 print("Parsed {} to markdown".format(source_filename))

PYTHON

1 """
2 Post process parsed document and store it locally.
3 """
4 
5 file_path = "{}/{}-markdown-parsed-fda-approved-drug.txt".format(data_dir, "llamaparse")
6 store_document(file_path, markdown_parsed_document)

Visualize the parsed document

PYTHON

1 # Text parsing
2 
3 filename = "llamaparse-text-parsed-{}.txt".format(source_filename)
4 
5 with open("{}/{}".format(data_dir, filename), "r") as doc:
6     parsed_document = doc.read()
7 
8 print(parsed_document[:1000])

PYTHON

1 # Markdown parsing
2 
3 filename = "llamaparse-markdown-parsed-fda-approved-drug.txt"
4 with open("{}/{}".format(data_dir, filename), "r") as doc:
5     parsed_document = doc.read()
6 
7 print(parsed_document[:1000])

Solution 5: pdf2image + pytesseract [Back to Solutions]

The final parsing method we examine does not rely on cloud services, but rather relies on two libraries: pdf2image, and pytesseract. pytesseract lets you perform OCR locally on images, but not PDF files. So, we first convert our PDF into a set of images via pdf2image.

Parsing the document

PYTHON

1 from matplotlib import pyplot as plt
2 from pdf2image import convert_from_path
3 import pytesseract

PYTHON

1 # pdf2image extracts as a list of PIL.Image objects
2 pages = convert_from_path(filename)

PYTHON

1 # we look at the first page as a sanity check:
2 
3 plt.imshow(pages[0])
4 plt.axis('off')
5 plt.show()

Now, we can process the image of each page with pytesseract and concatenate the results to get our parsed document.

PYTHON

1 label_ocr_pytesseract = "".join([pytesseract.image_to_string(page) for page in pages])

PYTHON

1 print(label_ocr_pytesseract[:200])

Output

HIGHLIGHTS OF PRESCRIBING INFORMATION
These highlights do not include all the information needed to use
IWILFIN™ safely and effectively. See full prescribing information for
IWILFIN.
IWILFIN™ (eflor

PYTHON

1 label_ocr_pytesseract = "".join([pytesseract.image_to_string(page) for page in pages])
2 
3 with open(f"pytesseract-parsed-{source_filename}.txt", "w") as f:
4   f.write(label_ocr_pytesseract)

Visualize the parsed document

PYTHON

1 filename = "pytesseract-parsed-{}.txt".format(source_filename)
2 with open("{}/{}".format(data_dir, filename), "r") as doc:
3     parsed_document = doc.read()
4 
5 print(parsed_document[:1000])

Document Questions

We can now ask a set of simple + complex questions and see how each parsing solution performs with Command-R. The questions are

What are the most common adverse reactions of Iwilfin?
- Task: Simple information extraction
What is the recommended dosage of IWILFIN on body surface area between 0.5 and 0.75?
- Task: Tabular data extraction
I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of Iwilfin.
- Task: Overall document summary

PYTHON

1 import cohere
2 co = cohere.Client(api_key="{API_KEY}")

PYTHON

1 """
2 Document Questions
3 """
4 prompt = "What are the most common adverse reactions of Iwilfin?"
5 # prompt = "What is the recommended dosage of Iwilfin on body surface area between 0.5 m2 and 0.75 m2?"
6 # prompt = "I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of Iwilfin."
7 
8 """
9 Choose one of the above solutions
10 """
11 source = "gcp"
12 # source = "aws"
13 # source = "unstructured-io"
14 # source = "llamaparse-text"
15 # source = "llamaparse-markdown"
16 # source = "pytesseract"

Data Ingestion

In order to set up our RAG implementation, we need to separate the parsed text into chunks and load the chunks to an index. The index will allow us to retrieve relevant passages from the document for different queries. Here, we use a simple implementation of indexing using the hnswlib library. Note that there are many different indexing solutions that are appropriate for specific production use cases.

PYTHON

1 """
2 Read parsed document content and chunk data
3 """
4 
5 import os
6 from langchain_text_splitters import RecursiveCharacterTextSplitter
7 
8 documents = []
9 
10 with open("{}/{}-parsed-fda-approved-drug.txt".format(data_dir, source), "r") as doc:
11     doc_content = doc.read()
12 
13 """
14 Personal notes on chunking
15 https://medium.com/@ayhamboucher/llm-based-context-splitter-for-large-documents-445d3f02b01b
16 """
17 
18 
19 # Chunk doc content
20 text_splitter = RecursiveCharacterTextSplitter(
21     chunk_size=512,
22     chunk_overlap=200,
23     length_function=len,
24     is_separator_regex=False
25 )
26 
27 # Split the text into chunks with some overlap
28 chunks_ = text_splitter.create_documents([doc_content])
29 documents = [c.page_content for c in chunks_]
30 
31 print("Source document has been broken down to {} chunks".format(len(documents)))

PYTHON

1 """
2 Embed document chunks
3 """
4 document_embeddings = co.embed(texts=documents, model="embed-v4.0", input_type="search_document").embeddings

PYTHON

1 """
2 Create document index and add embedded chunks
3 """
4 
5 import hnswlib
6 
7 index = hnswlib.Index(space='ip', dim=1536) # space: inner product
8 index.init_index(max_elements=len(document_embeddings), ef_construction=512, M=64)
9 index.add_items(document_embeddings, list(range(len(document_embeddings))))
10 print("Count:", index.element_count)

Output

Count: 115

Retrieval

In this step, we use k-nearest neighbors to fetch the most relevant documents for our query. Once the nearest neighbors are retrieved, we use Cohere’s reranker to reorder the documents in the most relevant order with regards to our input search query.

PYTHON

1 """
2 Embed search query
3 Fetch k nearest neighbors
4 """
5 
6 query_emb = co.embed(texts=[prompt], model='embed-v4.0', input_type="search_query").embeddings
7 default_knn = 10
8 knn = default_knn if default_knn <= index.element_count else index.element_count
9 result = index.knn_query(query_emb, k=knn)
10 neighbors = [(result[0][0][i], result[1][0][i]) for i in range(len(result[0][0]))]
11 relevant_docs = [documents[x[0]] for x in sorted(neighbors, key=lambda x: x[1])]

PYTHON

1 """
2 Rerank retrieved documents
3 """
4 
5 rerank_results = co.rerank(query=prompt, documents=relevant_docs, top_n=3, model='rerank-english-v2.0').results
6 reranked_relevant_docs = format_docs_for_chat([x.document["text"] for x in rerank_results])

Final Step: Call Command-A + RAG!

PYTHON

1 """
2 Call the /chat endpoint with command-a
3 """
4 
5 response = co.chat(
6     message=prompt,
7     model="command-a-03-2025",
8     documents=reranked_relevant_docs
9 )
10 
11 cited_response, citations_reference = insert_citations_in_order(response.text, response.citations, reranked_relevant_docs)
12 print(cited_response)
13 print("\n")
14 print("References:")
15 print(citations_reference)

Head-to-head Comparisons

Run the code cells below to make head to head comparisons of the different parsing techniques across different questions.

PYTHON

1 import pandas as pd
2 results = pd.read_csv("{}/results-table.csv".format(data_dir))

PYTHON

1 question = input("""
2 Question 1: What are the most common adverse reactions of Iwilfin?
3 Question 2: What is the recommended dosage of Iwilfin on body surface area between 0.5 m2 and 0.75 m2?
4 Question 3: I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of Iwilfin.
5 
6 Pick which question you want to see (1,2,3):  """)
7 references = input("Do you want to see the references as well? References are long and noisy (y/n): ")
8 print("\n\n")
9 
10 index = {"1": 0, "2": 3, "3": 6}[question]
11 
12 for src in ["gcp", "aws", "unstructured-io", "llamaparse-text", "llamaparse-markdown", "pytesseract"]:
13   print("| {} |".format(src))
14   print("\n")
15   print(results[src][index])
16   if references == "y":
17     print("\n")
18     print("References:")
19     print(results[src][index+1])
20   print("\n")

Output

Question 1: What are the most common adverse reactions of Iwilfin?
Question 2: What is the recommended dosage of Iwilfin on body surface area between 0.5 m2 and 0.75 m2?
Question 3: I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of Iwilfin.
Pick which question you want to see (1,2,3):  3
Do you want to see the references as well? References are long and noisy (y/n): n
| gcp |
Compound Name: eflornithine hydrochloride ([0], [1], [2]) (IWILFIN ([1])™)
Indication: used to reduce the risk of relapse in adult and paediatric patients with high-risk neuroblastoma (HRNB) ([1], [3]), who have responded at least partially to prior multiagent, multimodality therapy. ([1], [3], [4])
Route of Administration: IWILFIN™ tablets ([1], [3], [4]) are taken orally twice daily ([3], [4]), with doses ranging from 192 to 768 mg based on body surface area. ([3], [4])
Mechanism of Action: IWILFIN™ is an ornithine decarboxylase inhibitor. ([0], [2])
| aws |
Compound Name: eflornithine ([0], [1], [2], [3]) (IWILFIN ([0])™)
Indication: used to reduce the risk of relapse ([0], [3]) in adults ([0], [3]) and paediatric patients ([0], [3]) with high-risk neuroblastoma (HRNB) ([0], [3]) who have responded to prior therapies. ([0], [3], [4])
Route of Administration: Oral ([2], [4])
Mechanism of Action: IWILFIN is an ornithine decarboxylase inhibitor. ([1])
| unstructured-io |
Compound Name: Iwilfin ([1], [2], [3], [4]) (eflornithine) ([0], [2], [3], [4])
Indication: Iwilfin is indicated to reduce the risk of relapse ([1], [3]) in adult and paediatric patients ([1], [3]) with high-risk neuroblastoma (HRNB) ([1], [3]), who have responded to prior anti-GD2 ([1]) immunotherapy ([1], [4]) and multi-modality therapy. ([1])
Route of Administration: Oral ([0], [3])
Mechanism of Action: Iwilfin is an ornithine decarboxylase inhibitor. ([1], [2], [3], [4])
| llamaparse-text |
Compound Name: IWILFIN ([2], [3]) (eflornithine) ([3])
Indication: IWILFIN is used to reduce the risk of relapse ([1], [2], [3]) in adult and paediatric patients ([1], [2], [3]) with high-risk neuroblastoma (HRNB) ([1], [2], [3]), who have responded at least partially to certain prior therapies. ([2], [3])
Route of Administration: IWILFIN is administered as a tablet. ([2])
Mechanism of Action: IWILFIN is an ornithine decarboxylase inhibitor. ([0], [1], [4])
| llamaparse-markdown |
Compound Name: IWILFIN ([1], [2]) (eflornithine) ([1])
Indication: IWILFIN is indicated to reduce the risk of relapse ([1], [2]) in adult and paediatric patients ([1], [2]) with high-risk neuroblastoma (HRNB) ([1], [2]), who have responded at least partially ([1], [2], [3]) to prior anti-GD2 immunotherapy ([1], [2]) and multiagent, multimodality therapy. ([1], [2], [3])
Route of Administration: Oral ([0], [1], [3], [4])
Mechanism of Action: IWILFIN acts as an ornithine decarboxylase inhibitor. ([1])
| pytesseract |
Compound Name: IWILFIN™ ([0], [2]) (eflornithine) ([0], [2])
Indication: IWILFIN is indicated to reduce the risk of relapse ([0], [2]) in adult and paediatric patients ([0], [2]) with high-risk neuroblastoma (HRNB) ([0], [2]), who have responded positively to prior anti-GD2 immunotherapy and multiagent, multimodality therapy. ([0], [2], [4])
Route of Administration: IWILFIN is administered orally ([0], [1], [3], [4]), in the form of a tablet. ([1])
Mechanism of Action: IWILFIN acts as an ornithine decarboxylase inhibitor. ([0])