Invoice Extractor

Invoice Extractor

Some manual tasks can be tedious, time consuming, and downright painful. One is the task of searching for specific information across a large body of documents. No one wants to go in and read every single one in order to extract the necessary information, especially if the format of those documents vary. Large language models can help make the process much faster and easier by learning to extract specific information from unstructured text.

The Invoice Extractor is designed to automatically extract information from invoices and receipts using a Cohere large language model. The app processes documents in two steps. It first takes an image of the invoice document and then computes its similarity to scores of other documents in order to classify the type of template that the invoice is based on. The app then collects all the documents that follow the same template and extracts information using the Cohere Generate endpoint.

The steps to build the Invoice Extractor are:

Step 1: Collect and prepare data
Step 2: Load the dataset
Step 3: Identify and classify the image template
Step 4: Extract information from documents

Read on for more details on each of these steps.

Building an Invoice Extraction App

Step 1: Collect and Prepare Data

To start, we’ll need to prepare a set of PDF documents that we will use to train the Cohere language model and for prediction. You can use any number of invoices that you may have available, as long as they are PDFs.

To store our training data, including documents and annotations, let’s set up a directory called vendors that follows the file tree structure below.

├── <vendor_name>
│   ├── annotations
│   │   ├── <invoice1>.json
│   │   ├── <invoice2>.json
│   │   └── <invoice3>.json
│   ├── image
│   │   └── <invoice1>.jpg
│   ├── pdf
│   │   ├── <invoice1>.pdf
│   │   ├── <invoice2>.pdf
│   │   └── <invoice3>.pdf
│   └── text
│       ├── <invoice1>.txt
│       ├── <invoice2>.txt
│       └── <invoice3>.txt

The app will generate the content under image and text, and annotations are stored in JSON format. Below is an example annotation.

{"Address": "16 Delancey St","Tax": "2.52", "Total": "15.92", "Order": "Crazy Rich Vegan Ramen, Mentai Mochi Spring Roll", "Date": "May 17, 2022"}

For documents that we’ll use for prediction, let’s set up a directory called test_set which follows the file tree structure below.

├── images
│   ├── <predict1>.jpg
│   ├── <predict2>.jpg
│   ├── <predict3>.jpg
│   └── <predict4>.jpg
└── pdf
    ├── <predict1>.pdf
    ├── <predict2>.pdf
    ├── <predict3>.pdf
    └── <predict4>.pdf

Step 2: Load the Dataset

Once we have prepared our dataset, we will then load the documents and have images of them rendered on the screen for the user to navigate through them horizontally. When using the Invoice Extractor, a user should be able to click onto one of these documents, and have the app automatically extract information from the document.

Next, we’ll create a Python file called app.py and use this file to run the Streamlit app.

Let's first load all the documents that will be used for prediction.

st.set_page_config(page_title="Template Gallery",layout="wide")

test_pdf_dir = os.path.join("test_set", "pdf")
test_image_dir = os.path.join("test_set", "images")

test_invoices = glob.glob(os.path.join(test_pdf_dir, '*'))
test_invoices.sort()

test_image_list  = []
test_image_paths = []

st.markdown("# Select Invoice to Extract!")

for test_invoice in test_invoices:
    base_name = os.path.basename(test_invoice)
    file_name = base_name.split(".")[0]

    # Convert first page of uploaded pdf to image
    image_path = os.path.join(test_image_dir, f"{file_name}.jpg")
    with open(image_path, "rb") as f:
        image_content = f.read()
        encoded = base64.b64encode(image_content).decode()
        test_image_list.append(f"data:image/jpeg;base64,{encoded}")
    test_image_paths.append(image_path)

test_clicked = clickable_images(test_image_list,
    titles=test_image_paths,
    div_style={"display": "flex", "justify-content": "start", "overflow-x": "auto"},
    img_style={"margin": "5px", "height": "500px"},
)

Step 3: Identify and Classify the Image Template

When extracting information from an invoice, our first step is to identify the type of vendor that has issued that invoice. We only want to extract from invoices where the model will produce highly accurate results and filter out invoices that will not. This way, we can ensure full automation with high confidence. We will be using Pytorch, an open source model, for image classification to identify the vendor name.

We use pretrained Resnet-18, an open source computer vision model, to obtain an embedding of our training data. Then we find the most similar image to the test data above the threshold. If none of the training data is above the threshold, the application identifies the document’s template as “new” and rejects the document.

Let's first load the image classification model.

import torch
import torch.nn as nn
import torchvision.models as models
import torchvision.transforms as transforms
from torch.autograd import Variable

# Load the pretrained model
model = models.resnet18(pretrained=True)
# Use the model object to select the desired layer
layer = model._modules.get('avgpool')

# Set model to evaluation mode
model.eval()

scaler = transforms.Resize((224, 224))
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])
to_tensor = transforms.ToTensor()

cos = nn.CosineSimilarity(dim=1, eps=1e-6)

Let's next compute the similarity score. We create a function, get_vendor_name, which iterates on each template image, gets the image embedding with the get_vector function, and computes a similarity score using image_resnet_score with the target image that we want to use for prediction. Then it will output the name of the template it classified.

# Returns image embedding given image path
def get_vector(image_path):
    img = Image.open(image_path)
    img_tensor = Variable(normalize(to_tensor(scaler(img))).unsqueeze(0))
    img_embedding = torch.zeros(512)
    def copy_data(m, i, o):
        img_embedding.copy_(o.data.reshape(o.data.size(1)))
    h = layer.register_forward_hook(copy_data)
    output = model(img_tensor)
    h.remove()
    return img_embedding

# Get image embedding vector and compute cosine similarity
def image_resnet_score(img1, img2):
    img1_score = get_vector(img1)
    img2_score = get_vector(img2)
    cos_sim = cos(img1_score.unsqueeze(0),img2_score.unsqueeze(0))
    return cos_sim

# Given image path, get the most similar template name
def get_vendor_name(image_path):
    scores = []

    # Compute similarity scores for each vendors
    vendor_paths = glob.glob(os.path.join("vendors", "*"))
    for vendor_path in vendor_paths:
        vendor_dir = os.path.join(vendor_path, 'image', "*")
        vendor_files =  glob.glob(vendor_dir)
        for vendor_file in vendor_files:
            scores.append(image_resnet_score(image_path, vendor_file))

    # Get the highest score and select the template if it exceeds threshold
    max_score = max(scores)
    if max_score > THRESHOLD:
        max_ind = scores.index(max_score)
        vendor_name = vendor_paths[max_ind].split("/")[1]
        return vendor_name
    return None

Step 4: Extract Information From Documents

We will use the parse_data function to load the text content of the document and parse_annotations to load annotations of documents. These functions will be used to construct prompts.

# Given the vendor name, we collect all the raw text of training data
def parse_data(vendor_name):
    text_dir = os.path.join('vendors', vendor_name, "text", "*")
    text_files = glob.glob(text_dir)
    text_files.sort()
    contents = []
    for text_file in text_files:
        with open(text_file) as f:
            content = f.read()
            contents.append(content)
    return contents

# Given the vendor name, we collect all the annotations of training data
def parse_annotations(vendor_name):
    anno_dir = os.path.join('vendors', vendor_name, "annotations", "*")
    anno_files = glob.glob(anno_dir)
    anno_files.sort()
    contents = []
    for anno_file in anno_files:
        with open(anno_file) as f:
            content = json.loads(f.read())
            contents.append(content)
    return contents

Let's put the functions we defined above together and create a function, extract_invoice, which given the index of the document clicked, returns the extracted information. We run image classification using the get_vendor_name we defined above. Then we run the construct_prompt function to create a prompt for extraction.

The prompt is in the following format.

<invoice text content1>  
Address: <address1>  
====  
<invoice text content2>  
Address: <address2>  
====  
<invoice text content to predict>  
Address:

The model has to complete the address, which is the field we want to extract.

Below is the source code.

MAX_EXAMPLES = 4

# Create prompt consisting of raw text and annotation of training data, the field we want to extract, and the raw text of the document we want to predict.
def construct_prompt(texts, annotations, field, test_text):
    prompts = []
    separator = "\n=====\n"
    for text, annotation in zip(texts[:MAX_EXAMPLES], annotations[:MAX_EXAMPLES]):
        prompt = text + "\n"
        anno_prompt = f"\n{field}: "
        anno_prompt += annotation[field]
        prompt += anno_prompt
        prompts.append(prompt)
    return separator.join(prompts) + "\n=====\n" + test_text + f"\n\n{field}:"

def extract_invoice(idx, test_image_paths):
    # Get template name by running image classification
    template = get_vendor_name(test_image_paths[idx])

    # Collect raw text, annotation of training data
    texts = parse_data(template)
    annotations = parse_annotations(template)

    # Collect all fields to extract
    fields = annotations[0].keys()

    # Collect raw text of the document to predict
    reader = PdfReader(test_invoices[idx])
    first_page = reader.pages[0]
    test_text = first_page.extract_text()

    col1, col2 = st.columns(2)
    with col1:
      st.image(test_image_paths[idx])
    with col2:
      for field in fields:
          prompt = construct_prompt(texts, annotations, field, test_text)
          response = co.generate(
              model='small',
              prompt=prompt,
              max_tokens=50,
              temperature=0.3,
              k=0,
              p=1,
              frequency_penalty=0,
              presence_penalty=0,
              stop_sequences=["====="],
              return_likelihoods='NONE')
          st.markdown(f"### {field}:{response.generations[0].text[:-5]}")
      st.markdown("### Finished Extracting")
    return template

Finally let's call this function when a document is clicked:

if test_clicked > -1:
    extract_invoice(test_clicked, test_image_paths)

Conclusion

And that concludes the process of creating our Invoice Extractor app to automatically extract information from invoices and receipts. To get started building your own version, create a free Cohere account.