Advanced Document Parsing For Enterprises

Giannis ChatziveroglouGiannis Chatziveroglou

Introduction

The bread and butter of natural language processing technology is text. Once we can reduce a set of data into text, we can do all kinds of things with it: question answering, summarization, classification, sentiment analysis, searching and indexing, and more.

In the context of enterprise Retrieval Augmented Generation (RAG), the information is often locked in complex file types such as PDFs. These formats are made for sharing information between humans, but not so much with language models.

In this notebook, we will use a real-world pharmaceutical drug label to test out various performant approaches to parsing PDFs. This will allow us to use Cohere’s Command-R model in a RAG setting to answer questions and asks about this label, such as “I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of” a given pharmaceutical.

Document Parsing Result

PDF Parsing

We will go over five proprietary as well as open source options for processing PDFs. The parsing mechanisms demonstrated in the following sections are

By way of example, we will be parsing a 21-page PDF containing the label for a recent FDA drug approval, the beginning of which is shown below. Then, we will perform a series of basic RAG tasks with our different parsings and evaluate their performance.

Drug Label Snippet

Getting Set Up

Before we dive into the technical weeds, we need to set up the notebook’s runtime and filesystem environments. The code cells below do the following:

  • Install required libraries
  • Confirm that data dependencies from the GitHub repo have been downloaded. These will be under data/document-parsing and contain the following:
    • the PDF document that we will be working with, fda-approved-drug.pdf (this can also be found here: https://www.accessdata.fda.gov/drugsatfda_docs/label/2023/215500s000lbl.pdf)
    • precomputed parsed documents for each parsing solution. While the point of this notebook is to illustrate how this is done, we provide the parsed final results to allow readers to skip ahead to the RAG section without having to set up the required infrastructure for each solution.)
  • Add utility functions needed for later sections
PYTHON
1%%capture
2! sudo apt install tesseract-ocr poppler-utils
3! pip install "cohere<5" fsspec hnswlib google-cloud-documentai google-cloud-storage boto3 langchain-text-splitters llama_parse pytesseract pdf2image pandas
PYTHON
1data_dir = "data/document-parsing"
2source_filename = "example-drug-label"
3extension = "pdf"
PYTHON
1from pathlib import Path
2
3sources = ["gcp", "aws", "unstructured-io", "llamaparse-text", "llamaparse-markdown", "pytesseract"]
4
5filenames = ["{}-parsed-fda-approved-drug.txt".format(source) for source in sources]
6filenames.append("fda-approved-drug.pdf")
7
8for filename in filenames:
9 file_path = Path(f"{data_dir}/{filename}")
10 if file_path.is_file() == False:
11 print(f"File {filename} not found at {data_dir}!")

Utility Functions

Make sure to include the notebook’s utility functions in the runtime.

PYTHON
1def store_document(path: str, doc_content: str):
2 with open(path, 'w') as f:
3 f.write(doc_content)
PYTHON
1import json
2
3def insert_citations_in_order(text, citations, documents):
4 """
5 A helper function to pretty print citations.
6 """
7
8 citations_reference = {}
9 for index, doc in enumerate(documents):
10 citations_reference[index] = doc
11
12 offset = 0
13 # Process citations in the order they were provided
14 for citation in citations:
15 # Adjust start/end with offset
16 start, end = citation['start'] + offset, citation['end'] + offset
17 citation_numbers = []
18 for doc_id in citation["document_ids"]:
19 for citation_index, doc in citations_reference.items():
20 if doc["id"] == doc_id:
21 citation_numbers.append(citation_index)
22 references = "(" + ", ".join("[{}]".format(num) for num in citation_numbers) + ")"
23 modification = f'{text[start:end]} {references}'
24 # Replace the cited text with its bolded version + placeholder
25 text = text[:start] + modification + text[end:]
26 # Update the offset for subsequent replacements
27 offset += len(modification) - (end - start)
28
29 # Add the citations at the bottom of the text
30 text_with_citations = f'{text}'
31 citations_reference = ["[{}]: {}".format(x["id"], x["text"]) for x in citations_reference.values()]
32
33 return text_with_citations, "\n".join(citations_reference)
PYTHON
1def format_docs_for_chat(documents):
2 return [{"id": str(index), "text": x} for index, x in enumerate(documents)]

Document Parsing Solutions

For demonstration purposes, we have collected and saved the parsed documents from each solution in this notebook. Skip to the next section to run RAG with Command-R on the pre-fetched versions. You can find all parsed resources in detail at the link here.

Solution 1: Google Cloud Document AI [Back to Solutions]

Document AI helps developers create high-accuracy processors to extract, classify, and split documents.

External documentation: https://cloud.google.com/document-ai

Parsing the document

The following block can be executed in one of two ways:

  • Inside a Google Vertex AI environment
    • No authentication needed
  • From this notebook
    • Authentication is needed
    • There are pointers inside the code on which lines to uncomment in order to make this work

Note: You can skip to the next block if you want to use the pre-existing parsed version.

PYTHON
1"""
2Extracted from https://cloud.google.com/document-ai/docs/samples/documentai-batch-process-document
3"""
4
5import re
6from typing import Optional
7
8from google.api_core.client_options import ClientOptions
9from google.api_core.exceptions import InternalServerError
10from google.api_core.exceptions import RetryError
11from google.cloud import documentai # type: ignore
12from google.cloud import storage
13
14project_id = ""
15location = ""
16processor_id = ""
17gcs_output_uri = ""
18# credentials_file = "populate if you are running in a non Vertex AI environment."
19gcs_input_prefix = ""
20
21
22def batch_process_documents(
23 project_id: str,
24 location: str,
25 processor_id: str,
26 gcs_output_uri: str,
27 gcs_input_prefix: str,
28 timeout: int = 400
29) -> None:
30 parsed_documents = []
31
32 # Client configs
33 opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")
34 # With credentials
35 # opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com", credentials_file=credentials_file)
36
37 client = documentai.DocumentProcessorServiceClient(client_options=opts)
38 processor_name = client.processor_path(project_id, location, processor_id)
39
40 # Input storage configs
41 gcs_prefix = documentai.GcsPrefix(gcs_uri_prefix=gcs_input_prefix)
42 input_config = documentai.BatchDocumentsInputConfig(gcs_prefix=gcs_prefix)
43
44 # Output storage configs
45 gcs_output_config = documentai.DocumentOutputConfig.GcsOutputConfig(gcs_uri=gcs_output_uri, field_mask=None)
46 output_config = documentai.DocumentOutputConfig(gcs_output_config=gcs_output_config)
47 storage_client = storage.Client()
48 # With credentials
49 # storage_client = storage.Client.from_service_account_json(json_credentials_path=credentials_file)
50
51 # Batch process docs request
52 request = documentai.BatchProcessRequest(
53 name=processor_name,
54 input_documents=input_config,
55 document_output_config=output_config,
56 )
57
58 # batch_process_documents returns a long running operation
59 operation = client.batch_process_documents(request)
60
61 # Continually polls the operation until it is complete.
62 # This could take some time for larger files
63 try:
64 print(f"Waiting for operation {operation.operation.name} to complete...")
65 operation.result(timeout=timeout)
66 except (RetryError, InternalServerError) as e:
67 print(e.message)
68
69 # Get output document information from completed operation metadata
70 metadata = documentai.BatchProcessMetadata(operation.metadata)
71 if metadata.state != documentai.BatchProcessMetadata.State.SUCCEEDED:
72 raise ValueError(f"Batch Process Failed: {metadata.state_message}")
73
74 print("Output files:")
75 # One process per Input Document
76 for process in list(metadata.individual_process_statuses):
77 matches = re.match(r"gs://(.*?)/(.*)", process.output_gcs_destination)
78 if not matches:
79 print("Could not parse output GCS destination:", process.output_gcs_destination)
80 continue
81
82 output_bucket, output_prefix = matches.groups()
83 output_blobs = storage_client.list_blobs(output_bucket, prefix=output_prefix)
84
85 # Document AI may output multiple JSON files per source file
86 # (Large documents get split in multiple file "versions" doc --> parsed_doc_0 + parsed_doc_1 ...)
87 for blob in output_blobs:
88 # Document AI should only output JSON files to GCS
89 if blob.content_type != "application/json":
90 print(f"Skipping non-supported file: {blob.name} - Mimetype: {blob.content_type}")
91 continue
92
93 # Download JSON file as bytes object and convert to Document Object
94 print(f"Fetching {blob.name}")
95 document = documentai.Document.from_json(blob.download_as_bytes(), ignore_unknown_fields=True)
96 # Store the filename and the parsed versioned document content as a tuple
97 parsed_documents.append((blob.name.split("/")[-1].split(".")[0], document.text))
98
99 print("Finished document parsing process.")
100 return parsed_documents
101
102# Call service
103# versioned_parsed_documents = batch_process_documents(
104# project_id=project_id,
105# location=location,
106# processor_id=processor_id,
107# gcs_output_uri=gcs_output_uri,
108# gcs_input_prefix=gcs_input_prefix
109# )
PYTHON
1"""
2Post process parsed document and store it locally.
3Make sure to run this in a Google Vertex AI environment or include a credentials file.
4"""
5
6"""
7from pathlib import Path
8from collections import defaultdict
9
10parsed_documents = []
11combined_versioned_parsed_documents = defaultdict(list)
12
13# Assemble versioned documents together ({"doc_name": [(0, doc_content_0), (1, doc_content_1), ...]}).
14for filename, doc_content in versioned_parsed_documents:
15 filename, version = "-".join(filename.split("-")[:-1]), filename.split("-")[-1]
16 combined_versioned_parsed_documents[filename].append((version, doc_content))
17
18# Sort documents by version and join the content together.
19for filename, docs in combined_versioned_parsed_documents.items():
20 doc_content = " ".join([x[1] for x in sorted(docs, key=lambda x: x[0])])
21 parsed_documents.append((filename, doc_content))
22
23# Store parsed documents in local storage.
24for filename, doc_content in parsed_documents:
25 file_path = "{}/{}-parsed-{}.txt".format(data_dir, "gcp", source_filename)
26 store_document(file_path, doc_content)
27"""

Visualize the parsed document

PYTHON
1filename = "gcp-parsed-{}.txt".format(source_filename)
2with open("{}/{}".format(data_dir, filename), "r") as doc:
3 parsed_document = doc.read()
4
5print(parsed_document[:1000])

Solution 2: AWS Textract [Back to Solutions]

Amazon Textract is an OCR service offered by AWS. It can detect text, forms, tables, and more in PDFs and images. In this section, we go over how to use Textract’s asynchronous API.

Parsing the document

We assume that you are working within the AWS ecosystem (from a SageMaker notebook, EC2 instance, a Lambda function, etc.) with valid credentials. Much of the code here is from supplemental materials created by AWS and offered here:

At minimum, you will need access to the following AWS resources to get started:

  • Textract
  • an S3 bucket containing the document(s) to process - in this case, our example-drug-label.pdf file
  • an SNS topic that Textract can publish to. This is used to send a notification that parsing is complete.
  • an IAM role that Textract will assume, granting access to the S3 bucket and SNS topic

First, we bring in the TextractWrapper class provided in the AWS Code Examples repository. This class makes it simpler to interface with the Textract service.

PYTHON
1# source: https://github.com/awsdocs/aws-doc-sdk-examples/tree/main/python/example_code/textract
2
3# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
4# SPDX-License-Identifier: Apache-2.0
5
6"""
7Purpose
8
9Shows how to use the AWS SDK for Python (Boto3) with Amazon Textract to
10detect text, form, and table elements in document images.
11"""
12
13import json
14import logging
15from botocore.exceptions import ClientError
16
17logger = logging.getLogger(__name__)
18
19
20# snippet-start:[python.example_code.textract.TextractWrapper]
21class TextractWrapper:
22 """Encapsulates Textract functions."""
23
24 def __init__(self, textract_client, s3_resource, sqs_resource):
25 """
26 :param textract_client: A Boto3 Textract client.
27 :param s3_resource: A Boto3 Amazon S3 resource.
28 :param sqs_resource: A Boto3 Amazon SQS resource.
29 """
30 self.textract_client = textract_client
31 self.s3_resource = s3_resource
32 self.sqs_resource = sqs_resource
33
34 # snippet-end:[python.example_code.textract.TextractWrapper]
35
36 # snippet-start:[python.example_code.textract.DetectDocumentText]
37 def detect_file_text(self, *, document_file_name=None, document_bytes=None):
38 """
39 Detects text elements in a local image file or from in-memory byte data.
40 The image must be in PNG or JPG format.
41
42 :param document_file_name: The name of a document image file.
43 :param document_bytes: In-memory byte data of a document image.
44 :return: The response from Amazon Textract, including a list of blocks
45 that describe elements detected in the image.
46 """
47 if document_file_name is not None:
48 with open(document_file_name, "rb") as document_file:
49 document_bytes = document_file.read()
50 try:
51 response = self.textract_client.detect_document_text(
52 Document={"Bytes": document_bytes}
53 )
54 logger.info("Detected %s blocks.", len(response["Blocks"]))
55 except ClientError:
56 logger.exception("Couldn't detect text.")
57 raise
58 else:
59 return response
60
61 # snippet-end:[python.example_code.textract.DetectDocumentText]
62
63 # snippet-start:[python.example_code.textract.AnalyzeDocument]
64 def analyze_file(
65 self, feature_types, *, document_file_name=None, document_bytes=None
66 ):
67 """
68 Detects text and additional elements, such as forms or tables, in a local image
69 file or from in-memory byte data.
70 The image must be in PNG or JPG format.
71
72 :param feature_types: The types of additional document features to detect.
73 :param document_file_name: The name of a document image file.
74 :param document_bytes: In-memory byte data of a document image.
75 :return: The response from Amazon Textract, including a list of blocks
76 that describe elements detected in the image.
77 """
78 if document_file_name is not None:
79 with open(document_file_name, "rb") as document_file:
80 document_bytes = document_file.read()
81 try:
82 response = self.textract_client.analyze_document(
83 Document={"Bytes": document_bytes}, FeatureTypes=feature_types
84 )
85 logger.info("Detected %s blocks.", len(response["Blocks"]))
86 except ClientError:
87 logger.exception("Couldn't detect text.")
88 raise
89 else:
90 return response
91
92 # snippet-end:[python.example_code.textract.AnalyzeDocument]
93
94 # snippet-start:[python.example_code.textract.helper.prepare_job]
95 def prepare_job(self, bucket_name, document_name, document_bytes):
96 """
97 Prepares a document image for an asynchronous detection job by uploading
98 the image bytes to an Amazon S3 bucket. Amazon Textract must have permission
99 to read from the bucket to process the image.
100
101 :param bucket_name: The name of the Amazon S3 bucket.
102 :param document_name: The name of the image stored in Amazon S3.
103 :param document_bytes: The image as byte data.
104 """
105 try:
106 bucket = self.s3_resource.Bucket(bucket_name)
107 bucket.upload_fileobj(document_bytes, document_name)
108 logger.info("Uploaded %s to %s.", document_name, bucket_name)
109 except ClientError:
110 logger.exception("Couldn't upload %s to %s.", document_name, bucket_name)
111 raise
112
113 # snippet-end:[python.example_code.textract.helper.prepare_job]
114
115 # snippet-start:[python.example_code.textract.helper.check_job_queue]
116 def check_job_queue(self, queue_url, job_id):
117 """
118 Polls an Amazon SQS queue for messages that indicate a specified Textract
119 job has completed.
120
121 :param queue_url: The URL of the Amazon SQS queue to poll.
122 :param job_id: The ID of the Textract job.
123 :return: The status of the job.
124 """
125 status = None
126 try:
127 queue = self.sqs_resource.Queue(queue_url)
128 messages = queue.receive_messages()
129 if messages:
130 msg_body = json.loads(messages[0].body)
131 msg = json.loads(msg_body["Message"])
132 if msg.get("JobId") == job_id:
133 messages[0].delete()
134 status = msg.get("Status")
135 logger.info(
136 "Got message %s with status %s.", messages[0].message_id, status
137 )
138 else:
139 logger.info("No messages in queue %s.", queue_url)
140 except ClientError:
141 logger.exception("Couldn't get messages from queue %s.", queue_url)
142 else:
143 return status
144
145 # snippet-end:[python.example_code.textract.helper.check_job_queue]
146
147 # snippet-start:[python.example_code.textract.StartDocumentTextDetection]
148 def start_detection_job(
149 self, bucket_name, document_file_name, sns_topic_arn, sns_role_arn
150 ):
151 """
152 Starts an asynchronous job to detect text elements in an image stored in an
153 Amazon S3 bucket. Textract publishes a notification to the specified Amazon SNS
154 topic when the job completes.
155 The image must be in PNG, JPG, or PDF format.
156
157 :param bucket_name: The name of the Amazon S3 bucket that contains the image.
158 :param document_file_name: The name of the document image stored in Amazon S3.
159 :param sns_topic_arn: The Amazon Resource Name (ARN) of an Amazon SNS topic
160 where the job completion notification is published.
161 :param sns_role_arn: The ARN of an AWS Identity and Access Management (IAM)
162 role that can be assumed by Textract and grants permission
163 to publish to the Amazon SNS topic.
164 :return: The ID of the job.
165 """
166 try:
167 response = self.textract_client.start_document_text_detection(
168 DocumentLocation={
169 "S3Object": {"Bucket": bucket_name, "Name": document_file_name}
170 },
171 NotificationChannel={
172 "SNSTopicArn": sns_topic_arn,
173 "RoleArn": sns_role_arn,
174 },
175 )
176 job_id = response["JobId"]
177 logger.info(
178 "Started text detection job %s on %s.", job_id, document_file_name
179 )
180 except ClientError:
181 logger.exception("Couldn't detect text in %s.", document_file_name)
182 raise
183 else:
184 return job_id
185
186 # snippet-end:[python.example_code.textract.StartDocumentTextDetection]
187
188 # snippet-start:[python.example_code.textract.GetDocumentTextDetection]
189 def get_detection_job(self, job_id):
190 """
191 Gets data for a previously started text detection job.
192
193 :param job_id: The ID of the job to retrieve.
194 :return: The job data, including a list of blocks that describe elements
195 detected in the image.
196 """
197 try:
198 response = self.textract_client.get_document_text_detection(JobId=job_id)
199 job_status = response["JobStatus"]
200 logger.info("Job %s status is %s.", job_id, job_status)
201 except ClientError:
202 logger.exception("Couldn't get data for job %s.", job_id)
203 raise
204 else:
205 return response
206
207 # snippet-end:[python.example_code.textract.GetDocumentTextDetection]
208
209 # snippet-start:[python.example_code.textract.StartDocumentAnalysis]
210 def start_analysis_job(
211 self,
212 bucket_name,
213 document_file_name,
214 feature_types,
215 sns_topic_arn,
216 sns_role_arn,
217 ):
218 """
219 Starts an asynchronous job to detect text and additional elements, such as
220 forms or tables, in an image stored in an Amazon S3 bucket. Textract publishes
221 a notification to the specified Amazon SNS topic when the job completes.
222 The image must be in PNG, JPG, or PDF format.
223
224 :param bucket_name: The name of the Amazon S3 bucket that contains the image.
225 :param document_file_name: The name of the document image stored in Amazon S3.
226 :param feature_types: The types of additional document features to detect.
227 :param sns_topic_arn: The Amazon Resource Name (ARN) of an Amazon SNS topic
228 where job completion notification is published.
229 :param sns_role_arn: The ARN of an AWS Identity and Access Management (IAM)
230 role that can be assumed by Textract and grants permission
231 to publish to the Amazon SNS topic.
232 :return: The ID of the job.
233 """
234 try:
235 response = self.textract_client.start_document_analysis(
236 DocumentLocation={
237 "S3Object": {"Bucket": bucket_name, "Name": document_file_name}
238 },
239 NotificationChannel={
240 "SNSTopicArn": sns_topic_arn,
241 "RoleArn": sns_role_arn,
242 },
243 FeatureTypes=feature_types,
244 )
245 job_id = response["JobId"]
246 logger.info(
247 "Started text analysis job %s on %s.", job_id, document_file_name
248 )
249 except ClientError:
250 logger.exception("Couldn't analyze text in %s.", document_file_name)
251 raise
252 else:
253 return job_id
254
255 # snippet-end:[python.example_code.textract.StartDocumentAnalysis]
256
257 # snippet-start:[python.example_code.textract.GetDocumentAnalysis]
258 def get_analysis_job(self, job_id):
259 """
260 Gets data for a previously started detection job that includes additional
261 elements.
262
263 :param job_id: The ID of the job to retrieve.
264 :return: The job data, including a list of blocks that describe elements
265 detected in the image.
266 """
267 try:
268 response = self.textract_client.get_document_analysis(JobId=job_id)
269 job_status = response["JobStatus"]
270 logger.info("Job %s status is %s.", job_id, job_status)
271 except ClientError:
272 logger.exception("Couldn't get data for job %s.", job_id)
273 raise
274 else:
275 return response
276
277
278# snippet-end:[python.example_code.textract.GetDocumentAnalysis]

Next, we set up Textract and S3, and provide this to an instance of TextractWrapper.

PYTHON
1import boto3
2
3textract_client = boto3.client('textract')
4s3_client = boto3.client('s3')
5
6textractWrapper = TextractWrapper(textract_client, s3_client, None)

We are now ready to make calls to Textract. At a high level, Textract has two modes: synchronous and asynchronous. Synchronous calls return the parsed output once it is completed. As of the time of writing (March 2024), however, multipage PDF processing is only supported asynchronously. So for our purposes here, we will only explore the asynchronous route.

Asynchronous calls follow the below process:

  1. Send a request to Textract with an SNS topic, S3 bucket, and the name (key) of the document inside that bucket to process. Textract returns a Job ID that can be used to track the status of the request
  2. Textract fetches the document from S3 and processes it
  3. Once the request is complete, Textract sends out a message to the SNS topic. This can be used in conjunction with other services such as Lambda or SQS for downstream processes.
  4. The parsed results can be fetched from Textract in chunks via the job ID.
PYTHON
1bucket_name = "your-bucket-name"
2sns_topic_arn = "your-sns-arn" # this can be found under the topic you created in the Amazon SNS dashboard
3sns_role_arn = "sns-role-arn" # this is an IAM role that allows Textract to interact with SNS
4
5file_name = "example-drug-label.pdf"
PYTHON
1# kick off a text detection job. This returns a job ID.
2job_id = textractWrapper.start_detection_job(bucket_name=bucket_name, document_file_name=file_name,
3 sns_topic_arn=sns_topic_arn, sns_role_arn=sns_role_arn)

Once the job completes, this will return a dictionary with the following keys:

dict_keys(['DocumentMetadata', 'JobStatus', 'NextToken', 'Blocks', 'AnalyzeDocumentModelVersion', 'ResponseMetadata'])

This response corresponds to one chunk of information parsed by Textract. The number of chunks a document is parsed into depends on the length of the document. The two keys we are most interested in are Blocks and NextToken. Blocks contains all of the information that was extracted from this chunk, while NextToken tells us what chunk comes next, if any.

Textract returns an information-rich representation of the extracted text, such as their position on the page and hierarchical relationships with other entities, all the way down to the individual word level. Since we are only interested in the raw text, we need a way to parse through all of the chunks and their Blocks. Lucky for us, Amazon provides some helper functions for this purpose, which we utilize below.

PYTHON
1def get_text_results_from_textract(job_id):
2 response = textract_client.get_document_text_detection(JobId=job_id)
3 collection_of_textract_responses = []
4 pages = [response]
5
6 collection_of_textract_responses.append(response)
7
8 while 'NextToken' in response:
9 next_token = response['NextToken']
10 response = textract_client.get_document_text_detection(JobId=job_id, NextToken=next_token)
11 pages.append(response)
12 collection_of_textract_responses.append(response)
13 return collection_of_textract_responses
14
15def get_the_text_with_required_info(collection_of_textract_responses):
16 total_text = []
17 total_text_with_info = []
18 running_sequence_number = 0
19
20 font_sizes_and_line_numbers = {}
21 for page in collection_of_textract_responses:
22 per_page_text = []
23 blocks = page['Blocks']
24 for block in blocks:
25 if block['BlockType'] == 'LINE':
26 block_text_dict = {}
27 running_sequence_number += 1
28 block_text_dict.update(text=block['Text'])
29 block_text_dict.update(page=block['Page'])
30 block_text_dict.update(left_indent=round(block['Geometry']['BoundingBox']['Left'], 2))
31 font_height = round(block['Geometry']['BoundingBox']['Height'], 3)
32 line_number = running_sequence_number
33 block_text_dict.update(font_height=round(block['Geometry']['BoundingBox']['Height'], 3))
34 block_text_dict.update(indent_from_top=round(block['Geometry']['BoundingBox']['Top'], 2))
35 block_text_dict.update(text_width=round(block['Geometry']['BoundingBox']['Width'], 2))
36 block_text_dict.update(line_number=running_sequence_number)
37
38 if font_height in font_sizes_and_line_numbers:
39 line_numbers = font_sizes_and_line_numbers[font_height]
40 line_numbers.append(line_number)
41 font_sizes_and_line_numbers[font_height] = line_numbers
42 else:
43 line_numbers = []
44 line_numbers.append(line_number)
45 font_sizes_and_line_numbers[font_height] = line_numbers
46
47 total_text.append(block['Text'])
48 per_page_text.append(block['Text'])
49 total_text_with_info.append(block_text_dict)
50
51 return total_text, total_text_with_info, font_sizes_and_line_numbers
52
53def get_text_with_line_spacing_info(total_text_with_info):
54 i = 1
55 text_info_with_line_spacing_info = []
56 while (i < len(total_text_with_info) - 1):
57 previous_line_info = total_text_with_info[i - 1]
58 current_line_info = total_text_with_info[i]
59 next_line_info = total_text_with_info[i + 1]
60 if current_line_info['page'] == next_line_info['page'] and previous_line_info['page'] == current_line_info[
61 'page']:
62 line_spacing_after = round((next_line_info['indent_from_top'] - current_line_info['indent_from_top']), 2)
63 spacing_with_prev = round((current_line_info['indent_from_top'] - previous_line_info['indent_from_top']), 2)
64 current_line_info.update(line_space_before=spacing_with_prev)
65 current_line_info.update(line_space_after=line_spacing_after)
66 text_info_with_line_spacing_info.append(current_line_info)
67 else:
68 text_info_with_line_spacing_info.append(None)
69 i += 1
70 return text_info_with_line_spacing_info

We feed in the Job ID from before into the function get_text_results_from_textract to fetch all of the chunks associated with this job. Then, we pass the resulting list into get_the_text_with_required_info and get_text_with_line_spacing_info to organize the text into lines.

Finally, we can concatenate the lines into one string to pass into our downstream RAG pipeline.

PYTHON
1all_text = "\n".join([line["text"] if line else "" for line in text_info_with_line_spacing])
2
3with open(f"aws-parsed-{source_filename}.txt", "w") as f:
4 f.write(all_text)

Visualize the parsed document

PYTHON
1filename = "aws-parsed-{}.txt".format(source_filename)
2with open("{}/{}".format(data_dir, filename), "r") as doc:
3 parsed_document = doc.read()
4
5print(parsed_document[:1000])

Solution 3: Unstructured.io [Back to Solutions]

Unstructured.io provides libraries with open-source components for pre-processing text documents such as PDFs, HTML and Word Documents.

External documentation: https://github.com/Unstructured-IO/unstructured-api

Parsing the document

The guide assumes an endpoint exists that hosts this service. The API is offered in two forms

  1. a hosted version
  2. an OSS docker image

Note: You can skip to the next block if you want to use the pre-existing parsed version.

PYTHON
1import os
2import requests
3
4UNSTRUCTURED_URL = "" # enter service endpoint
5
6parsed_documents = []
7
8input_path = "{}/{}.{}".format(data_dir, source_filename, extension)
9with open(input_path, 'rb') as file_data:
10 response = requests.post(
11 url=UNSTRUCTURED_URL,
12 files={"files": ("{}.{}".format(source_filename, extension), file_data)},
13 data={
14 "output_format": (None, "application/json"),
15 "stratergy": "hi_res",
16 "pdf_infer_table_structure": "true",
17 "include_page_breaks": "true"
18 },
19 headers={"Accept": "application/json"}
20 )
21
22parsed_response = response.json()
23
24parsed_document = " ".join([parsed_entry["text"] for parsed_entry in parsed_response])
25print("Parsed {}".format(source_filename))
PYTHON
1"""
2Post process parsed document and store it locally.
3"""
4
5file_path = "{}/{}-parsed-fda-approved-drug.txt".format(data_dir, "unstructured-io")
6store_document(file_path, parsed_document)

Visualize the parsed document

PYTHON
1filename = "unstructured-io-parsed-{}.txt".format(source_filename)
2with open("{}/{}".format(data_dir, filename), "r") as doc:
3 parsed_document = doc.read()
4
5print(parsed_document[:1000])

Solution 4: LlamaParse [Back to Solutions]

LlamaParse is an API created by LlamaIndex to efficiently parse and represent files for efficient retrieval and context augmentation using LlamaIndex frameworks.

External documentation: https://github.com/run-llama/llama_parse

Parsing the document

The following block uses the LlamaParse cloud offering. You can learn more and fetch a respective API key for the service here.

Parsing documents with LlamaParse offers an option for two output modes both of which we will explore and compare below

  • Text
  • Markdown

Note: You can skip to the next block if you want to use the pre-existing parsed version.

PYTHON
1import os
2from llama_parse import LlamaParse
3
4import nest_asyncio # needed to notebook env
5nest_asyncio.apply() # needed to notebook env
6
7llama_index_api_key = "{API_KEY}"
8input_path = "{}/{}.{}".format(data_dir, source_filename, extension)
PYTHON
1# Text mode
2text_parser = LlamaParse(
3 api_key=llama_index_api_key,
4 result_type="text"
5)
6
7text_response = text_parser.load_data(input_path)
8text_parsed_document = " ".join([parsed_entry.text for parsed_entry in text_response])
9
10print("Parsed {} to text".format(source_filename))
PYTHON
1"""
2Post process parsed document and store it locally.
3"""
4
5file_path = "{}/{}-text-parsed-fda-approved-drug.txt".format(data_dir, "llamaparse")
6store_document(file_path, text_parsed_document)
PYTHON
1# Markdown mode
2markdown_parser = LlamaParse(
3 api_key=llama_index_api_key,
4 result_type="markdown"
5)
6
7markdown_response = markdown_parser.load_data(input_path)
8markdown_parsed_document = " ".join([parsed_entry.text for parsed_entry in markdown_response])
9
10print("Parsed {} to markdown".format(source_filename))
PYTHON
1"""
2Post process parsed document and store it locally.
3"""
4
5file_path = "{}/{}-markdown-parsed-fda-approved-drug.txt".format(data_dir, "llamaparse")
6store_document(file_path, markdown_parsed_document)

Visualize the parsed document

PYTHON
1# Text parsing
2
3filename = "llamaparse-text-parsed-{}.txt".format(source_filename)
4
5with open("{}/{}".format(data_dir, filename), "r") as doc:
6 parsed_document = doc.read()
7
8print(parsed_document[:1000])
PYTHON
1# Markdown parsing
2
3filename = "llamaparse-markdown-parsed-fda-approved-drug.txt"
4with open("{}/{}".format(data_dir, filename), "r") as doc:
5 parsed_document = doc.read()
6
7print(parsed_document[:1000])

Solution 5: pdf2image + pytesseract [Back to Solutions]

The final parsing method we examine does not rely on cloud services, but rather relies on two libraries: pdf2image, and pytesseract. pytesseract lets you perform OCR locally on images, but not PDF files. So, we first convert our PDF into a set of images via pdf2image.

Parsing the document

PYTHON
1from matplotlib import pyplot as plt
2from pdf2image import convert_from_path
3import pytesseract
PYTHON
1# pdf2image extracts as a list of PIL.Image objects
2pages = convert_from_path(filename)
PYTHON
1# we look at the first page as a sanity check:
2
3plt.imshow(pages[0])
4plt.axis('off')
5plt.show()

Now, we can process the image of each page with pytesseract and concatenate the results to get our parsed document.

PYTHON
1label_ocr_pytesseract = "".join([pytesseract.image_to_string(page) for page in pages])
PYTHON
1print(label_ocr_pytesseract[:200])
Output
HIGHLIGHTS OF PRESCRIBING INFORMATION
These highlights do not include all the information needed to use
IWILFIN™ safely and effectively. See full prescribing information for
IWILFIN.
IWILFIN™ (eflor
PYTHON
1label_ocr_pytesseract = "".join([pytesseract.image_to_string(page) for page in pages])
2
3with open(f"pytesseract-parsed-{source_filename}.txt", "w") as f:
4 f.write(label_ocr_pytesseract)

Visualize the parsed document

PYTHON
1filename = "pytesseract-parsed-{}.txt".format(source_filename)
2with open("{}/{}".format(data_dir, filename), "r") as doc:
3 parsed_document = doc.read()
4
5print(parsed_document[:1000])

Document Questions

We can now ask a set of simple + complex questions and see how each parsing solution performs with Command-R. The questions are

  • What are the most common adverse reactions of Iwilfin?
    • Task: Simple information extraction
  • What is the recommended dosage of IWILFIN on body surface area between 0.5 and 0.75?
    • Task: Tabular data extraction
  • I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of Iwilfin.
    • Task: Overall document summary
PYTHON
1import cohere
2co = cohere.Client(api_key="{API_KEY}")
PYTHON
1"""
2Document Questions
3"""
4prompt = "What are the most common adverse reactions of Iwilfin?"
5# prompt = "What is the recommended dosage of Iwilfin on body surface area between 0.5 m2 and 0.75 m2?"
6# prompt = "I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of Iwilfin."
7
8"""
9Choose one of the above solutions
10"""
11source = "gcp"
12# source = "aws"
13# source = "unstructured-io"
14# source = "llamaparse-text"
15# source = "llamaparse-markdown"
16# source = "pytesseract"

Data Ingestion

In order to set up our RAG implementation, we need to separate the parsed text into chunks and load the chunks to an index. The index will allow us to retrieve relevant passages from the document for different queries. Here, we use a simple implementation of indexing using the hnswlib library. Note that there are many different indexing solutions that are appropriate for specific production use cases.

PYTHON
1"""
2Read parsed document content and chunk data
3"""
4
5import os
6from langchain_text_splitters import RecursiveCharacterTextSplitter
7
8documents = []
9
10with open("{}/{}-parsed-fda-approved-drug.txt".format(data_dir, source), "r") as doc:
11doc_content = doc.read()
12
13"""
14Personal notes on chunking
15https://medium.com/@ayhamboucher/llm-based-context-splitter-for-large-documents-445d3f02b01b
16"""
17
18
19# Chunk doc content
20text_splitter = RecursiveCharacterTextSplitter(
21 chunk_size=512,
22 chunk_overlap=200,
23 length_function=len,
24 is_separator_regex=False
25)
26
27# Split the text into chunks with some overlap
28chunks_ = text_splitter.create_documents([doc_content])
29documents = [c.page_content for c in chunks_]
30
31print("Source document has been broken down to {} chunks".format(len(documents)))
PYTHON
1"""
2Embed document chunks
3"""
4document_embeddings = co.embed(texts=documents, model="embed-english-v3.0", input_type="search_document").embeddings
PYTHON
1"""
2Create document index and add embedded chunks
3"""
4
5import hnswlib
6
7index = hnswlib.Index(space='ip', dim=1024) # space: inner product
8index.init_index(max_elements=len(document_embeddings), ef_construction=512, M=64)
9index.add_items(document_embeddings, list(range(len(document_embeddings))))
10print("Count:", index.element_count)
Output
Count: 115

Retrieval

In this step, we use k-nearest neighbors to fetch the most relevant documents for our query. Once the nearest neighbors are retrieved, we use Cohere’s reranker to reorder the documents in the most relevant order with regards to our input search query.

PYTHON
1"""
2Embed search query
3Fetch k nearest neighbors
4"""
5
6query_emb = co.embed(texts=[prompt], model='embed-english-v3.0', input_type="search_query").embeddings
7default_knn = 10
8knn = default_knn if default_knn <= index.element_count else index.element_count
9result = index.knn_query(query_emb, k=knn)
10neighbors = [(result[0][0][i], result[1][0][i]) for i in range(len(result[0][0]))]
11relevant_docs = [documents[x[0]] for x in sorted(neighbors, key=lambda x: x[1])]
PYTHON
1"""
2Rerank retrieved documents
3"""
4
5rerank_results = co.rerank(query=prompt, documents=relevant_docs, top_n=3, model='rerank-english-v2.0').results
6reranked_relevant_docs = format_docs_for_chat([x.document["text"] for x in rerank_results])

Final Step: Call Command-R + RAG!

PYTHON
1"""
2Call the /chat endpoint with command-r
3"""
4
5response = co.chat(
6 message=prompt,
7 model="command-r",
8 documents=reranked_relevant_docs
9)
10
11cited_response, citations_reference = insert_citations_in_order(response.text, response.citations, reranked_relevant_docs)
12print(cited_response)
13print("\n")
14print("References:")
15print(citations_reference)

Head-to-head Comparisons

Run the code cells below to make head to head comparisons of the different parsing techniques across different questions.

PYTHON
1import pandas as pd
2results = pd.read_csv("{}/results-table.csv".format(data_dir))
PYTHON
1question = input("""
2Question 1: What are the most common adverse reactions of Iwilfin?
3Question 2: What is the recommended dosage of Iwilfin on body surface area between 0.5 m2 and 0.75 m2?
4Question 3: I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of Iwilfin.
5
6Pick which question you want to see (1,2,3): """)
7references = input("Do you want to see the references as well? References are long and noisy (y/n): ")
8print("\n\n")
9
10index = {"1": 0, "2": 3, "3": 6}[question]
11
12for src in ["gcp", "aws", "unstructured-io", "llamaparse-text", "llamaparse-markdown", "pytesseract"]:
13 print("| {} |".format(src))
14 print("\n")
15 print(results[src][index])
16 if references == "y":
17 print("\n")
18 print("References:")
19 print(results[src][index+1])
20 print("\n")
Output
Question 1: What are the most common adverse reactions of Iwilfin?
Question 2: What is the recommended dosage of Iwilfin on body surface area between 0.5 m2 and 0.75 m2?
Question 3: I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of Iwilfin.
Pick which question you want to see (1,2,3): 3
Do you want to see the references as well? References are long and noisy (y/n): n
| gcp |
Compound Name: eflornithine hydrochloride ([0], [1], [2]) (IWILFIN ([1])™)
Indication: used to reduce the risk of relapse in adult and paediatric patients with high-risk neuroblastoma (HRNB) ([1], [3]), who have responded at least partially to prior multiagent, multimodality therapy. ([1], [3], [4])
Route of Administration: IWILFIN™ tablets ([1], [3], [4]) are taken orally twice daily ([3], [4]), with doses ranging from 192 to 768 mg based on body surface area. ([3], [4])
Mechanism of Action: IWILFIN™ is an ornithine decarboxylase inhibitor. ([0], [2])
| aws |
Compound Name: eflornithine ([0], [1], [2], [3]) (IWILFIN ([0])™)
Indication: used to reduce the risk of relapse ([0], [3]) in adults ([0], [3]) and paediatric patients ([0], [3]) with high-risk neuroblastoma (HRNB) ([0], [3]) who have responded to prior therapies. ([0], [3], [4])
Route of Administration: Oral ([2], [4])
Mechanism of Action: IWILFIN is an ornithine decarboxylase inhibitor. ([1])
| unstructured-io |
Compound Name: Iwilfin ([1], [2], [3], [4]) (eflornithine) ([0], [2], [3], [4])
Indication: Iwilfin is indicated to reduce the risk of relapse ([1], [3]) in adult and paediatric patients ([1], [3]) with high-risk neuroblastoma (HRNB) ([1], [3]), who have responded to prior anti-GD2 ([1]) immunotherapy ([1], [4]) and multi-modality therapy. ([1])
Route of Administration: Oral ([0], [3])
Mechanism of Action: Iwilfin is an ornithine decarboxylase inhibitor. ([1], [2], [3], [4])
| llamaparse-text |
Compound Name: IWILFIN ([2], [3]) (eflornithine) ([3])
Indication: IWILFIN is used to reduce the risk of relapse ([1], [2], [3]) in adult and paediatric patients ([1], [2], [3]) with high-risk neuroblastoma (HRNB) ([1], [2], [3]), who have responded at least partially to certain prior therapies. ([2], [3])
Route of Administration: IWILFIN is administered as a tablet. ([2])
Mechanism of Action: IWILFIN is an ornithine decarboxylase inhibitor. ([0], [1], [4])
| llamaparse-markdown |
Compound Name: IWILFIN ([1], [2]) (eflornithine) ([1])
Indication: IWILFIN is indicated to reduce the risk of relapse ([1], [2]) in adult and paediatric patients ([1], [2]) with high-risk neuroblastoma (HRNB) ([1], [2]), who have responded at least partially ([1], [2], [3]) to prior anti-GD2 immunotherapy ([1], [2]) and multiagent, multimodality therapy. ([1], [2], [3])
Route of Administration: Oral ([0], [1], [3], [4])
Mechanism of Action: IWILFIN acts as an ornithine decarboxylase inhibitor. ([1])
| pytesseract |
Compound Name: IWILFIN™ ([0], [2]) (eflornithine) ([0], [2])
Indication: IWILFIN is indicated to reduce the risk of relapse ([0], [2]) in adult and paediatric patients ([0], [2]) with high-risk neuroblastoma (HRNB) ([0], [2]), who have responded positively to prior anti-GD2 immunotherapy and multiagent, multimodality therapy. ([0], [2], [4])
Route of Administration: IWILFIN is administered orally ([0], [1], [3], [4]), in the form of a tablet. ([1])
Mechanism of Action: IWILFIN acts as an ornithine decarboxylase inhibitor. ([0])