Analysis of Form 10-K/10-Q Using Cohere and RAG

Analysis of Form 10-K/10-Q Using Cohere and RAG

Alex Barbet

Alex Barbet

Getting Started

You may use this script to jumpstart financial analysis of 10-Ks or 10-Qs with Cohere's Command model.

This cookbook relies on helpful tooling from LlamaIndex, as well as our Cohere SDK. If you're familiar with LlamaIndex, it should be easy to slot this process into your own productivity flows.

%%capture
!sudo apt install tesseract-ocr poppler-utils
!pip install "cohere<5" langchain llama-index llama-index-embeddings-cohere llama-index-postprocessor-cohere-rerank pytesseract pdf2image
# Due to compatibility issues, we need to do imports like this
from llama_index.core.schema import TextNode

%%capture
!pip install unstructured
import cohere
from getpass import getpass

# Set up Cohere client
COHERE_API_KEY = getpass("Enter your Cohere API key: ")

# Instantiate a client to communicate with Cohere's API using our Python SDK
co = cohere.Client(COHERE_API_KEY)

Enter your Cohere API key: ··········

Step 1: Loading a 10-K

You may run the following cells to load a 10-K that has already been preprocessed with OCR.

💡

If you'd like to run the OCR pipeline yourself, you can find more info in the section titled PDF to Text using OCR and pdf2image.

# Using langchain here since they have access to the Unstructured Data Loader powered by unstructured.io
from langchain_community.document_loaders import UnstructuredURLLoader

# Load up Airbnb's 10-K from this past fiscal year (filed in 2024)
# Feel free to fill in some other EDGAR path
url = "https://www.sec.gov/Archives/edgar/data/1559720/000155972024000006/abnb-20231231.htm"
loader = UnstructuredURLLoader(urls=[url], headers={"User-Agent": "cohere [email protected]"})
documents = loader.load()

edgar_10k = documents[0].page_content

# Load the document(s) as simple text nodes, to be passed to the tokenization processor
nodes = [TextNode(text=document.page_content, id_=f"doc_{i}") for i, document in enumerate(documents)]
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.

We'll need to convert the text into chunks of a certain size in order for the Cohere embedding model to properly ingest them down the line.

We choose to use LlamaIndex's SentenceSplitter in this case in order to get these chunks. We must pass a tokenization callable, which we can do using the transformers library.

You may also apply further transformations from the LlamaIndex repo if you so choose. Take a look at the docs for inspiration on what is possible with transformations.

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter

from transformers import AutoTokenizer

model_id = "CohereForAI/c4ai-command-r-v01"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# TODO: replace with a HF implementation so this is much faster. We'll
# presumably release it when we OS the model
tokenizer_fn = lambda x: tokenizer(x).input_ids if len(x) > 0 else []

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512, chunk_overlap=0, tokenizer=tokenizer_fn)
    ]
)

# Run the pipeline to transform the text
nodes = pipeline.run(nodes=nodes)
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(



tokenizer_config.json:   0%|          | 0.00/7.92k [00:00<?, ?B/s]



tokenization_cohere_fast.py:   0%|          | 0.00/43.7k [00:00<?, ?B/s]



configuration_cohere.py:   0%|          | 0.00/7.37k [00:00<?, ?B/s]


A new version of the following files was downloaded from https://huggingface.co/CohereForAI/c4ai-command-r-v01:
- configuration_cohere.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/CohereForAI/c4ai-command-r-v01:
- tokenization_cohere_fast.py
- configuration_cohere.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.



tokenizer.json:   0%|          | 0.00/12.8M [00:00<?, ?B/s]



special_tokens_map.json:   0%|          | 0.00/429 [00:00<?, ?B/s]


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Step 2: Load document into a LlamaIndex vector store

Loading the document into a LlamaIndex vector store will allow us to use the Cohere embedding model and rerank model to retrieve the relevant parts of the form to pass into Command.

from llama_index.core import Settings, VectorStoreIndex

from llama_index.postprocessor.cohere_rerank import CohereRerank

from llama_index.embeddings.cohere import CohereEmbedding

# Instantiate the embedding model
embed_model = CohereEmbedding(cohere_api_key=COHERE_API_KEY)

# Global settings
Settings.chunk_size = 512
Settings.embed_model = embed_model

# Create the vector store
index = VectorStoreIndex(nodes)

retriever = index.as_retriever(similarity_top_k=30) # Change to whatever top_k you want

# Instantiate the reranker
rerank = CohereRerank(api_key=COHERE_API_KEY, top_n=15)

# Function `retrieve` is ready, using both Cohere embeddings for similarity search as well as
retrieve = lambda query: rerank.postprocess_nodes(retriever.retrieve(query), query_str=query)

Step 3: Query generation and retrieval

In order to do RAG, we need a query or a set of queries to actually do the retrieval step. As is standard in RAG settings, we'll use Command to generate those queries for us. Then, we'll use those queries along with the LlamaIndex retriever we built earlier to retrieve the most relevant pieces of the 10-K.

To learn more about document mode and query generation, check out our documentation.

PROMPT = "List the overall revenue numbers for 2021, 2022, and 2023 in the 10-K as bullet points, then explain the revenue growth trends."

# Get queries to run against our index from the command-nightly model
r = co.chat(PROMPT, model="command-r", search_queries_only=True)
if r.search_queries:
    queries = [q["text"] for q in r.search_queries]
else:
    print("No queries returned by the model")

Now, with the queries in hand, we search against our vector index.

# Convenience function for formatting documents
def format_for_cohere_client(nodes_):
    return [
        {
            "text": node.node.text,
            "llamaindex_id": node.node.id_,
        }
        for node
        in nodes_
    ]


documents = []
# Retrieve a set of chunks from the vector index and append them to the list of
# documents that should be included in the final RAG step
for query in queries:
    ret_nodes = retrieve(query)
    documents.extend(format_for_cohere_client(ret_nodes))

# One final dedpulication step in case multiple queries return the same chunk
documents = [dict(t, id=f"doc_{i}") for i, t in enumerate({tuple(d.items()) for d in documents})]

Step 4: Make a RAG request to Command using document mode

Now that we have our nicely formatted chunks from the 10-K, we can pass them directly into Command using the Cohere SDK. By passing the chunks into the documents kwarg, we enable document mode, which will perform grounded inference on the documents you pass in.

You can see this for yourself by inspecting the response.citations field to check where the model is citing from.

You can learn more about the chat endpoint by checking out the API reference here.

# Make a request to the model
response = co.chat(
    message=PROMPT,
    model="command-r",
    temperature=0.3,
    documents=documents,
    prompt_truncation="AUTO"
)

print(response.text)
Here are the overall revenue numbers for the years 2021, 2022, and 2023 as bullet points:
- 2021: $5,992 million
- 2022: $8,399 million
- 2023: $9,917 million

Revenue increased by 18% in 2023 compared to 2022, primarily due to a 14% increase in Nights and Experiences Booked, which reached 54.5 million. This, combined with higher average daily rates, resulted in a 16% increase in Gross Booking Value, which reached $10.0 billion. 

The revenue growth trend demonstrates sustained strong travel demand. On a constant-currency basis, revenue increased by 17% in 2023 compared to the previous year.

Other factors influencing the company's financial performance are described outside of the revenue growth trends.
# Helper function for displaying response WITH citations
def insert_citations(text: str, citations: list[dict]):
    """
    A helper function to pretty print citations.
    """
    offset = 0
    # Process citations in the order they were provided
    for citation in citations:
        # Adjust start/end with offset
        start, end = citation['start'] + offset, citation['end'] + offset
        cited_docs = [doc[4:] for doc in citation["document_ids"]]
        # Shorten citations if they're too long for convenience
        if len(cited_docs) > 3:
            placeholder = "[" + ", ".join(cited_docs[:3]) + "...]"
        else:
            placeholder = "[" + ", ".join(cited_docs) + "]"
        # ^ doc[4:] removes the 'doc_' prefix, and leaves the quoted document
        modification = f'{text[start:end]} {placeholder}'
        # Replace the cited text with its bolded version + placeholder
        text = text[:start] + modification + text[end:]
        # Update the offset for subsequent replacements
        offset += len(modification) - (end - start)

    return text

print(insert_citations(response.text, response.citations))
Here are the overall revenue numbers for the years 2021, 2022, and 2023 as bullet points:
- 2021: $5,992 million [13]
- 2022: $8,399 million [13]
- 2023: $9,917 million [13]

Revenue increased by 18% in 2023 [11] compared to 2022, primarily due to a 14% increase in Nights and Experiences Booked [11], which reached 54.5 million. [11] This, combined with higher average daily rates [11], resulted in a 16% increase in Gross Booking Value [11], which reached $10.0 billion. [11] 

The revenue growth trend demonstrates sustained strong travel demand. [11] On a constant-currency basis [11], revenue increased by 17% in 2023 [11] compared to the previous year.

Other factors [8, 14] influencing the company's financial performance are described outside of the revenue growth trends. [8, 14]

Appendix

PDF to Text using OCR and pdf2image

This method will be required for any PDFs you have that need to be converted to text.

WARNING: this process can take a long time without the proper optimizations. We have provided a snippet for your use below, but use at your own risk.

To go from PDF to text with PyTesseract, there is an intermediary step of converting the PDF to an image first, then passing that image into the OCR package, as OCR is usually only available for images.

To do this, we use pdf2image, which uses poppler behind the scenes to convert the PDF into a PNG. From there, we can pass the image (which is a PIL Image object) directly into the OCR tool.

import pytesseract
from pdf2image import convert_from_path

# pdf2image extracts as a list of PIL.Image objects
# TODO: host this PDF somewhere
pages = convert_from_path("/content/uber_10k.pdf")

# We access the only page in this sample PDF by indexing at 0
pages = [pytesseract.image_to_string(page) for page in pages]

Token count / price comparison and latency

def get_response(prompt, rag):
    if rag:
        # Get queries to run against our index from the command-nightly model
        r = co.chat(prompt, model="command-r", search_queries_only=True)
        if r.search_queries:
            queries = [q["text"] for q in r.search_queries]
        else:
            print("No queries returned by the model")

        documents = []
        # Retrieve a set of chunks from the vector index and append them to the list of
        # documents that should be included in the final RAG step
        for query in queries:
            ret_nodes = retrieve(query)
            documents.extend(format_for_cohere_client(ret_nodes))

        # One final dedpulication step in case multiple queries return the same chunk
        documents = [dict(t) for t in {tuple(d.items()) for d in documents}]

        # Make a request to the model
        response = co.chat(
            message=prompt,
            model="command-r",
            temperature=0.3,
            documents=documents,
            prompt_truncation="AUTO"
        )
    else:
        response = co.chat(
            message=prompt,
            model="command-r",
            temperature=0.3,
        )

    return response
prompt_template = """# financial form 10-K
{tenk}

# question
{question}"""

full_context_prompt = prompt_template.format(tenk=edgar_10k, question=PROMPT)
r1 = get_response(PROMPT, rag=True)
r2 = get_response(full_context_prompt, rag=False)
def get_price(r):
    return (r.token_count["prompt_tokens"] * 0.5 / 10e6) + (r.token_count["response_tokens"] * 1.5 / 10e6)
rag_price = get_price(r1)
full_context_price = get_price(r2)

print(f"RAG is {(full_context_price - rag_price) / full_context_price:.0%} cheaper than full context")
RAG is 93% cheaper than full context
%timeit get_response(PROMPT, rag=True)
14.9 s ± 1.4 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit get_response(full_context_prompt, rag=False)
22.7 s ± 7.43 s per loop (mean ± std. dev. of 7 runs, 1 loop each)