Analysis of Form 10-K/10-Q Using Cohere and RAG
Getting Started
You may use this script to jumpstart financial analysis of 10-Ks or 10-Qs with Cohere’s Command model.
This cookbook relies on helpful tooling from LlamaIndex, as well as our Cohere SDK. If you’re familiar with LlamaIndex, it should be easy to slot this process into your own productivity flows.
Step 1: Loading a 10-K
You may run the following cells to load a 10-K that has already been preprocessed with OCR.
💡 If you’d like to run the OCR pipeline yourself, you can find more info in the section titled PDF to Text using OCR and
pdf2image
.
We’ll need to convert the text into chunks of a certain size in order for the Cohere embedding model to properly ingest them down the line.
We choose to use LlamaIndex’s SentenceSplitter
in this case in order to get these chunks. We must pass a tokenization callable, which we can do using the transformers
library.
You may also apply further transformations from the LlamaIndex repo if you so choose. Take a look at the docs for inspiration on what is possible with transformations.
Step 2: Load document into a LlamaIndex vector store
Loading the document into a LlamaIndex vector store will allow us to use the Cohere embedding model and rerank model to retrieve the relevant parts of the form to pass into Command.
Step 3: Query generation and retrieval
In order to do RAG, we need a query or a set of queries to actually do the retrieval step. As is standard in RAG settings, we’ll use Command to generate those queries for us. Then, we’ll use those queries along with the LlamaIndex retriever we built earlier to retrieve the most relevant pieces of the 10-K.
To learn more about document mode and query generation, check out our documentation.
Now, with the queries in hand, we search against our vector index.
Step 4: Make a RAG request to Command using document mode
Now that we have our nicely formatted chunks from the 10-K, we can pass them directly into Command using the Cohere SDK. By passing the chunks into the documents
kwarg, we enable document mode, which will perform grounded inference on the documents you pass in.
You can see this for yourself by inspecting the response.citations
field to check where the model is citing from.
You can learn more about the chat
endpoint by checking out the API reference here.
Appendix
PDF to Text using OCR and pdf2image
This method will be required for any PDFs you have that need to be converted to text.
WARNING: this process can take a long time without the proper optimizations. We have provided a snippet for your use below, but use at your own risk.
To go from PDF to text with PyTesseract, there is an intermediary step of converting the PDF to an image first, then passing that image into the OCR package, as OCR is usually only available for images.
To do this, we use pdf2image
, which uses poppler
behind the scenes to convert the PDF into a PNG. From there, we can pass the image (which is a PIL Image object) directly into the OCR tool.