Analysis of Form 10-K/10-Q Using Cohere and RAG

Alex BarbetAlex Barbet

Getting Started

You may use this script to jumpstart financial analysis of 10-Ks or 10-Qs with Cohere’s Command model.

This cookbook relies on helpful tooling from LlamaIndex, as well as our Cohere SDK. If you’re familiar with LlamaIndex, it should be easy to slot this process into your own productivity flows.

PYTHON
1%%capture
2!sudo apt install tesseract-ocr poppler-utils
3!pip install "cohere<5" langchain llama-index llama-index-embeddings-cohere llama-index-postprocessor-cohere-rerank pytesseract pdf2image
PYTHON
1# Due to compatibility issues, we need to do imports like this
2from llama_index.core.schema import TextNode
3
4%%capture
5!pip install unstructured
PYTHON
1import cohere
2from getpass import getpass
3
4# Set up Cohere client
5COHERE_API_KEY = getpass("Enter your Cohere API key: ")
6
7# Instantiate a client to communicate with Cohere's API using our Python SDK
8co = cohere.Client(COHERE_API_KEY)
Output
Enter your Cohere API key: ··········

Step 1: Loading a 10-K

You may run the following cells to load a 10-K that has already been preprocessed with OCR.

💡 If you’d like to run the OCR pipeline yourself, you can find more info in the section titled PDF to Text using OCR and pdf2image.

PYTHON
1# Using langchain here since they have access to the Unstructured Data Loader powered by unstructured.io
2from langchain_community.document_loaders import UnstructuredURLLoader
3
4# Load up Airbnb's 10-K from this past fiscal year (filed in 2024)
5# Feel free to fill in some other EDGAR path
6url = "https://www.sec.gov/Archives/edgar/data/1559720/000155972024000006/abnb-20231231.htm"
7loader = UnstructuredURLLoader(urls=[url], headers={"User-Agent": "cohere cohere@cohere.com"})
8documents = loader.load()
9
10edgar_10k = documents[0].page_content
11
12# Load the document(s) as simple text nodes, to be passed to the tokenization processor
13nodes = [TextNode(text=document.page_content, id_=f"doc_{i}") for i, document in enumerate(documents)]
Output
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.

We’ll need to convert the text into chunks of a certain size in order for the Cohere embedding model to properly ingest them down the line.

We choose to use LlamaIndex’s SentenceSplitter in this case in order to get these chunks. We must pass a tokenization callable, which we can do using the transformers library.

You may also apply further transformations from the LlamaIndex repo if you so choose. Take a look at the docs for inspiration on what is possible with transformations.

PYTHON
1from llama_index.core.ingestion import IngestionPipeline
2from llama_index.core.node_parser import SentenceSplitter
3
4from transformers import AutoTokenizer
5
6model_id = "CohereForAI/c4ai-command-r-v01"
7tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
8
9# TODO: replace with a HF implementation so this is much faster. We'll
10# presumably release it when we OS the model
11tokenizer_fn = lambda x: tokenizer(x).input_ids if len(x) > 0 else []
12
13pipeline = IngestionPipeline(
14 transformations=[
15 SentenceSplitter(chunk_size=512, chunk_overlap=0, tokenizer=tokenizer_fn)
16 ]
17)
18
19# Run the pipeline to transform the text
20nodes = pipeline.run(nodes=nodes)
Output
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning:
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
tokenizer_config.json: 0%| | 0.00/7.92k [00:00<?, ?B/s]
tokenization_cohere_fast.py: 0%| | 0.00/43.7k [00:00<?, ?B/s]
configuration_cohere.py: 0%| | 0.00/7.37k [00:00<?, ?B/s]
A new version of the following files was downloaded from https://huggingface.co/CohereForAI/c4ai-command-r-v01:
- configuration_cohere.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/CohereForAI/c4ai-command-r-v01:
- tokenization_cohere_fast.py
- configuration_cohere.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
tokenizer.json: 0%| | 0.00/12.8M [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/429 [00:00<?, ?B/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Step 2: Load document into a LlamaIndex vector store

Loading the document into a LlamaIndex vector store will allow us to use the Cohere embedding model and rerank model to retrieve the relevant parts of the form to pass into Command.

PYTHON
1from llama_index.core import Settings, VectorStoreIndex
2
3from llama_index.postprocessor.cohere_rerank import CohereRerank
4
5from llama_index.embeddings.cohere import CohereEmbedding
6
7# Instantiate the embedding model
8embed_model = CohereEmbedding(cohere_api_key=COHERE_API_KEY)
9
10# Global settings
11Settings.chunk_size = 512
12Settings.embed_model = embed_model
13
14# Create the vector store
15index = VectorStoreIndex(nodes)
16
17retriever = index.as_retriever(similarity_top_k=30) # Change to whatever top_k you want
18
19# Instantiate the reranker
20rerank = CohereRerank(api_key=COHERE_API_KEY, top_n=15)
21
22# Function `retrieve` is ready, using both Cohere embeddings for similarity search as well as
23retrieve = lambda query: rerank.postprocess_nodes(retriever.retrieve(query), query_str=query)

Step 3: Query generation and retrieval

In order to do RAG, we need a query or a set of queries to actually do the retrieval step. As is standard in RAG settings, we’ll use Command to generate those queries for us. Then, we’ll use those queries along with the LlamaIndex retriever we built earlier to retrieve the most relevant pieces of the 10-K.

To learn more about document mode and query generation, check out our documentation.

PYTHON
1PROMPT = "List the overall revenue numbers for 2021, 2022, and 2023 in the 10-K as bullet points, then explain the revenue growth trends."
2
3# Get queries to run against our index from the command-nightly model
4r = co.chat(PROMPT, model="command-r", search_queries_only=True)
5if r.search_queries:
6 queries = [q["text"] for q in r.search_queries]
7else:
8 print("No queries returned by the model")

Now, with the queries in hand, we search against our vector index.

PYTHON
1# Convenience function for formatting documents
2def format_for_cohere_client(nodes_):
3 return [
4 {
5 "text": node.node.text,
6 "llamaindex_id": node.node.id_,
7 }
8 for node
9 in nodes_
10 ]
11
12
13documents = []
14# Retrieve a set of chunks from the vector index and append them to the list of
15# documents that should be included in the final RAG step
16for query in queries:
17 ret_nodes = retrieve(query)
18 documents.extend(format_for_cohere_client(ret_nodes))
19
20# One final dedpulication step in case multiple queries return the same chunk
21documents = [dict(t, id=f"doc_{i}") for i, t in enumerate({tuple(d.items()) for d in documents})]

Step 4: Make a RAG request to Command using document mode

Now that we have our nicely formatted chunks from the 10-K, we can pass them directly into Command using the Cohere SDK. By passing the chunks into the documents kwarg, we enable document mode, which will perform grounded inference on the documents you pass in.

You can see this for yourself by inspecting the response.citations field to check where the model is citing from.

You can learn more about the chat endpoint by checking out the API reference here.

PYTHON
1# Make a request to the model
2response = co.chat(
3 message=PROMPT,
4 model="command-r",
5 temperature=0.3,
6 documents=documents,
7 prompt_truncation="AUTO"
8)
9
10print(response.text)
Output
Here are the overall revenue numbers for the years 2021, 2022, and 2023 as bullet points:
- 2021: $5,992 million
- 2022: $8,399 million
- 2023: $9,917 million
Revenue increased by 18% in 2023 compared to 2022, primarily due to a 14% increase in Nights and Experiences Booked, which reached 54.5 million. This, combined with higher average daily rates, resulted in a 16% increase in Gross Booking Value, which reached $10.0 billion.
The revenue growth trend demonstrates sustained strong travel demand. On a constant-currency basis, revenue increased by 17% in 2023 compared to the previous year.
Other factors influencing the company's financial performance are described outside of the revenue growth trends.
PYTHON
1# Helper function for displaying response WITH citations
2def insert_citations(text: str, citations: list[dict]):
3 """
4 A helper function to pretty print citations.
5 """
6 offset = 0
7 # Process citations in the order they were provided
8 for citation in citations:
9 # Adjust start/end with offset
10 start, end = citation['start'] + offset, citation['end'] + offset
11 cited_docs = [doc[4:] for doc in citation["document_ids"]]
12 # Shorten citations if they're too long for convenience
13 if len(cited_docs) > 3:
14 placeholder = "[" + ", ".join(cited_docs[:3]) + "...]"
15 else:
16 placeholder = "[" + ", ".join(cited_docs) + "]"
17 # ^ doc[4:] removes the 'doc_' prefix, and leaves the quoted document
18 modification = f'{text[start:end]} {placeholder}'
19 # Replace the cited text with its bolded version + placeholder
20 text = text[:start] + modification + text[end:]
21 # Update the offset for subsequent replacements
22 offset += len(modification) - (end - start)
23
24 return text
25
26print(insert_citations(response.text, response.citations))
Output
Here are the overall revenue numbers for the years 2021, 2022, and 2023 as bullet points:
- 2021: $5,992 million [13]
- 2022: $8,399 million [13]
- 2023: $9,917 million [13]
Revenue increased by 18% in 2023 [11] compared to 2022, primarily due to a 14% increase in Nights and Experiences Booked [11], which reached 54.5 million. [11] This, combined with higher average daily rates [11], resulted in a 16% increase in Gross Booking Value [11], which reached $10.0 billion. [11]
The revenue growth trend demonstrates sustained strong travel demand. [11] On a constant-currency basis [11], revenue increased by 17% in 2023 [11] compared to the previous year.
Other factors [8, 14] influencing the company's financial performance are described outside of the revenue growth trends. [8, 14]

Appendix

PDF to Text using OCR and pdf2image

This method will be required for any PDFs you have that need to be converted to text.

WARNING: this process can take a long time without the proper optimizations. We have provided a snippet for your use below, but use at your own risk.

To go from PDF to text with PyTesseract, there is an intermediary step of converting the PDF to an image first, then passing that image into the OCR package, as OCR is usually only available for images.

To do this, we use pdf2image, which uses poppler behind the scenes to convert the PDF into a PNG. From there, we can pass the image (which is a PIL Image object) directly into the OCR tool.

PYTHON
1import pytesseract
2from pdf2image import convert_from_path
3
4# pdf2image extracts as a list of PIL.Image objects
5# TODO: host this PDF somewhere
6pages = convert_from_path("/content/uber_10k.pdf")
7
8# We access the only page in this sample PDF by indexing at 0
9pages = [pytesseract.image_to_string(page) for page in pages]

Token count / price comparison and latency

PYTHON
1def get_response(prompt, rag):
2 if rag:
3 # Get queries to run against our index from the command-nightly model
4 r = co.chat(prompt, model="command-r", search_queries_only=True)
5 if r.search_queries:
6 queries = [q["text"] for q in r.search_queries]
7 else:
8 print("No queries returned by the model")
9
10 documents = []
11 # Retrieve a set of chunks from the vector index and append them to the list of
12 # documents that should be included in the final RAG step
13 for query in queries:
14 ret_nodes = retrieve(query)
15 documents.extend(format_for_cohere_client(ret_nodes))
16
17 # One final dedpulication step in case multiple queries return the same chunk
18 documents = [dict(t) for t in {tuple(d.items()) for d in documents}]
19
20 # Make a request to the model
21 response = co.chat(
22 message=prompt,
23 model="command-r",
24 temperature=0.3,
25 documents=documents,
26 prompt_truncation="AUTO"
27 )
28 else:
29 response = co.chat(
30 message=prompt,
31 model="command-r",
32 temperature=0.3,
33 )
34
35 return response
PYTHON
1prompt_template = """# financial form 10-K
2{tenk}
3
4# question
5{question}"""
6
7full_context_prompt = prompt_template.format(tenk=edgar_10k, question=PROMPT)
PYTHON
1r1 = get_response(PROMPT, rag=True)
2r2 = get_response(full_context_prompt, rag=False)
PYTHON
1def get_price(r):
2 return (r.token_count["prompt_tokens"] * 0.5 / 10e6) + (r.token_count["response_tokens"] * 1.5 / 10e6)
PYTHON
1rag_price = get_price(r1)
2full_context_price = get_price(r2)
3
4print(f"RAG is {(full_context_price - rag_price) / full_context_price:.0%} cheaper than full context")
Output
RAG is 93% cheaper than full context
PYTHON
1%timeit get_response(PROMPT, rag=True)
Output
14.9 s ± 1.4 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
PYTHON
1%timeit get_response(full_context_prompt, rag=False)
Output
22.7 s ± 7.43 s per loop (mean ± std. dev. of 7 runs, 1 loop each)