Alex Barbet

Getting Started

You may use this script to jumpstart financial analysis of 10-Ks or 10-Qs with Cohere’s Command model.

This cookbook relies on helpful tooling from LlamaIndex, as well as our Cohere SDK. If you’re familiar with LlamaIndex, it should be easy to slot this process into your own productivity flows.

PYTHON

1 %%capture
2 !sudo apt install tesseract-ocr poppler-utils
3 !pip install "cohere<5" langchain llama-index llama-index-embeddings-cohere llama-index-postprocessor-cohere-rerank pytesseract pdf2image

PYTHON

1 # Due to compatibility issues, we need to do imports like this
2 from llama_index.core.schema import TextNode
3 
4 %%capture
5 !pip install unstructured

PYTHON

1 import cohere
2 from getpass import getpass
3 
4 # Set up Cohere client
5 COHERE_API_KEY = getpass("Enter your Cohere API key: ")
6 
7 # Instantiate a client to communicate with Cohere's API using our Python SDK
8 co = cohere.Client(COHERE_API_KEY)

Output

Enter your Cohere API key: ··········

Step 1: Loading a 10-K

You may run the following cells to load a 10-K that has already been preprocessed with OCR.

💡 If you’d like to run the OCR pipeline yourself, you can find more info in the section titled PDF to Text using OCR and pdf2image.

PYTHON

1 # Using langchain here since they have access to the Unstructured Data Loader powered by unstructured.io
2 from langchain_community.document_loaders import UnstructuredURLLoader
3 
4 # Load up Airbnb's 10-K from this past fiscal year (filed in 2024)
5 # Feel free to fill in some other EDGAR path
6 url = "https://www.sec.gov/Archives/edgar/data/1559720/000155972024000006/abnb-20231231.htm"
7 loader = UnstructuredURLLoader(urls=[url], headers={"User-Agent": "cohere cohere@cohere.com"})
8 documents = loader.load()
9 
10 edgar_10k = documents[0].page_content
11 
12 # Load the document(s) as simple text nodes, to be passed to the tokenization processor
13 nodes = [TextNode(text=document.page_content, id_=f"doc_{i}") for i, document in enumerate(documents)]

Output

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.

We’ll need to convert the text into chunks of a certain size in order for the Cohere embedding model to properly ingest them down the line.

We choose to use LlamaIndex’s SentenceSplitter in this case in order to get these chunks. We must pass a tokenization callable, which we can do using the transformers library.

You may also apply further transformations from the LlamaIndex repo if you so choose. Take a look at the docs for inspiration on what is possible with transformations.

PYTHON

1 from llama_index.core.ingestion import IngestionPipeline
2 from llama_index.core.node_parser import SentenceSplitter
3 
4 from transformers import AutoTokenizer
5 
6 model_id = "CohereForAI/c4ai-command-r-v01"
7 tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
8 
9 # TODO: replace with a HF implementation so this is much faster. We'll
10 # presumably release it when we OS the model
11 tokenizer_fn = lambda x: tokenizer(x).input_ids if len(x) > 0 else []
12 
13 pipeline = IngestionPipeline(
14     transformations=[
15         SentenceSplitter(chunk_size=512, chunk_overlap=0, tokenizer=tokenizer_fn)
16     ]
17 )
18 
19 # Run the pipeline to transform the text
20 nodes = pipeline.run(nodes=nodes)

Output

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning:
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
    warnings.warn(
tokenizer_config.json:   0%|          | 0.00/7.92k [00:00<?, ?B/s]
tokenization_cohere_fast.py:   0%|          | 0.00/43.7k [00:00<?, ?B/s]
configuration_cohere.py:   0%|          | 0.00/7.37k [00:00<?, ?B/s]
A new version of the following files was downloaded from https://huggingface.co/CohereForAI/c4ai-command-r-v01:
- configuration_cohere.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/CohereForAI/c4ai-command-r-v01:
- tokenization_cohere_fast.py
- configuration_cohere.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
tokenizer.json:   0%|          | 0.00/12.8M [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/429 [00:00<?, ?B/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Step 2: Load document into a LlamaIndex vector store

Loading the document into a LlamaIndex vector store will allow us to use the Cohere embedding model and rerank model to retrieve the relevant parts of the form to pass into Command.

PYTHON

1 from llama_index.core import Settings, VectorStoreIndex
2 
3 from llama_index.postprocessor.cohere_rerank import CohereRerank
4 
5 from llama_index.embeddings.cohere import CohereEmbedding
6 
7 # Instantiate the embedding model
8 embed_model = CohereEmbedding(cohere_api_key=COHERE_API_KEY)
9 
10 # Global settings
11 Settings.chunk_size = 512
12 Settings.embed_model = embed_model
13 
14 # Create the vector store
15 index = VectorStoreIndex(nodes)
16 
17 retriever = index.as_retriever(similarity_top_k=30) # Change to whatever top_k you want
18 
19 # Instantiate the reranker
20 rerank = CohereRerank(api_key=COHERE_API_KEY, top_n=15)
21 
22 # Function `retrieve` is ready, using both Cohere embeddings for similarity search as well as
23 retrieve = lambda query: rerank.postprocess_nodes(retriever.retrieve(query), query_str=query)

Step 3: Query generation and retrieval

In order to do RAG, we need a query or a set of queries to actually do the retrieval step. As is standard in RAG settings, we’ll use Command to generate those queries for us. Then, we’ll use those queries along with the LlamaIndex retriever we built earlier to retrieve the most relevant pieces of the 10-K.

To learn more about document mode and query generation, check out our documentation.

PYTHON

1 PROMPT = "List the overall revenue numbers for 2021, 2022, and 2023 in the 10-K as bullet points, then explain the revenue growth trends."
2 
3 # Get queries to run against our index from the command-nightly model
4 r = co.chat(PROMPT, model="command-r", search_queries_only=True)
5 if r.search_queries:
6     queries = [q["text"] for q in r.search_queries]
7 else:
8     print("No queries returned by the model")

Now, with the queries in hand, we search against our vector index.

PYTHON

1 # Convenience function for formatting documents
2 def format_for_cohere_client(nodes_):
3     return [
4         {
5             "text": node.node.text,
6             "llamaindex_id": node.node.id_,
7         }
8         for node
9         in nodes_
10     ]
11 
12 
13 documents = []
14 # Retrieve a set of chunks from the vector index and append them to the list of
15 # documents that should be included in the final RAG step
16 for query in queries:
17     ret_nodes = retrieve(query)
18     documents.extend(format_for_cohere_client(ret_nodes))
19 
20 # One final dedpulication step in case multiple queries return the same chunk
21 documents = [dict(t, id=f"doc_{i}") for i, t in enumerate({tuple(d.items()) for d in documents})]

Step 4: Make a RAG request to Command using document mode

Now that we have our nicely formatted chunks from the 10-K, we can pass them directly into Command using the Cohere SDK. By passing the chunks into the documents kwarg, we enable document mode, which will perform grounded inference on the documents you pass in.

You can see this for yourself by inspecting the response.citations field to check where the model is citing from.

You can learn more about the chat endpoint by checking out the API reference here.

PYTHON

1 # Make a request to the model
2 response = co.chat(
3     message=PROMPT,
4     model="command-r",
5     temperature=0.3,
6     documents=documents,
7     prompt_truncation="AUTO"
8 )
9 
10 print(response.text)

Output

Here are the overall revenue numbers for the years 2021, 2022, and 2023 as bullet points:
- 2021: $5,992 million
- 2022: $8,399 million
- 2023: $9,917 million
Revenue increased by 18% in 2023 compared to 2022, primarily due to a 14% increase in Nights and Experiences Booked, which reached 54.5 million. This, combined with higher average daily rates, resulted in a 16% increase in Gross Booking Value, which reached $10.0 billion.
The revenue growth trend demonstrates sustained strong travel demand. On a constant-currency basis, revenue increased by 17% in 2023 compared to the previous year.
Other factors influencing the company's financial performance are described outside of the revenue growth trends.

PYTHON

1 # Helper function for displaying response WITH citations
2 def insert_citations(text: str, citations: list[dict]):
3     """
4     A helper function to pretty print citations.
5     """
6     offset = 0
7     # Process citations in the order they were provided
8     for citation in citations:
9         # Adjust start/end with offset
10         start, end = citation['start'] + offset, citation['end'] + offset
11         cited_docs = [doc[4:] for doc in citation["document_ids"]]
12         # Shorten citations if they're too long for convenience
13         if len(cited_docs) > 3:
14             placeholder = "[" + ", ".join(cited_docs[:3]) + "...]"
15         else:
16             placeholder = "[" + ", ".join(cited_docs) + "]"
17         # ^ doc[4:] removes the 'doc_' prefix, and leaves the quoted document
18         modification = f'{text[start:end]} {placeholder}'
19         # Replace the cited text with its bolded version + placeholder
20         text = text[:start] + modification + text[end:]
21         # Update the offset for subsequent replacements
22         offset += len(modification) - (end - start)
23 
24     return text
25 
26 print(insert_citations(response.text, response.citations))

Output

Here are the overall revenue numbers for the years 2021, 2022, and 2023 as bullet points:
- 2021: $5,992 million [13]
- 2022: $8,399 million [13]
- 2023: $9,917 million [13]
Revenue increased by 18% in 2023 [11] compared to 2022, primarily due to a 14% increase in Nights and Experiences Booked [11], which reached 54.5 million. [11] This, combined with higher average daily rates [11], resulted in a 16% increase in Gross Booking Value [11], which reached $10.0 billion. [11]
The revenue growth trend demonstrates sustained strong travel demand. [11] On a constant-currency basis [11], revenue increased by 17% in 2023 [11] compared to the previous year.
Other factors [8, 14] influencing the company's financial performance are described outside of the revenue growth trends. [8, 14]

Appendix

PDF to Text using OCR and `pdf2image`

This method will be required for any PDFs you have that need to be converted to text.

WARNING: this process can take a long time without the proper optimizations. We have provided a snippet for your use below, but use at your own risk.

To go from PDF to text with PyTesseract, there is an intermediary step of converting the PDF to an image first, then passing that image into the OCR package, as OCR is usually only available for images.

To do this, we use pdf2image, which uses poppler behind the scenes to convert the PDF into a PNG. From there, we can pass the image (which is a PIL Image object) directly into the OCR tool.

PYTHON

1 import pytesseract
2 from pdf2image import convert_from_path
3 
4 # pdf2image extracts as a list of PIL.Image objects
5 # TODO: host this PDF somewhere
6 pages = convert_from_path("/content/uber_10k.pdf")
7 
8 # We access the only page in this sample PDF by indexing at 0
9 pages = [pytesseract.image_to_string(page) for page in pages]

Token count / price comparison and latency

PYTHON

1 def get_response(prompt, rag):
2     if rag:
3         # Get queries to run against our index from the command-nightly model
4         r = co.chat(prompt, model="command-r", search_queries_only=True)
5         if r.search_queries:
6             queries = [q["text"] for q in r.search_queries]
7         else:
8             print("No queries returned by the model")
9 
10         documents = []
11         # Retrieve a set of chunks from the vector index and append them to the list of
12         # documents that should be included in the final RAG step
13         for query in queries:
14             ret_nodes = retrieve(query)
15             documents.extend(format_for_cohere_client(ret_nodes))
16 
17         # One final dedpulication step in case multiple queries return the same chunk
18         documents = [dict(t) for t in {tuple(d.items()) for d in documents}]
19 
20         # Make a request to the model
21         response = co.chat(
22             message=prompt,
23             model="command-r",
24             temperature=0.3,
25             documents=documents,
26             prompt_truncation="AUTO"
27         )
28     else:
29         response = co.chat(
30             message=prompt,
31             model="command-r",
32             temperature=0.3,
33         )
34 
35     return response

PYTHON

1 prompt_template = """# financial form 10-K
2 {tenk}
3 
4 # question
5 {question}"""
6 
7 full_context_prompt = prompt_template.format(tenk=edgar_10k, question=PROMPT)

PYTHON

1 r1 = get_response(PROMPT, rag=True)
2 r2 = get_response(full_context_prompt, rag=False)

PYTHON

1 def get_price(r):
2     return (r.token_count["prompt_tokens"] * 0.5 / 10e6) + (r.token_count["response_tokens"] * 1.5 / 10e6)

PYTHON

1 rag_price = get_price(r1)
2 full_context_price = get_price(r2)
3 
4 print(f"RAG is {(full_context_price - rag_price) / full_context_price:.0%} cheaper than full context")

Output

RAG is 93% cheaper than full context

PYTHON

1 %timeit get_response(PROMPT, rag=True)

Output

14.9 s ± 1.4 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

PYTHON

1 %timeit get_response(full_context_prompt, rag=False)

Output

22.7 s ± 7.43 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

1	%%capture
2	!sudo apt install tesseract-ocr poppler-utils
3	!pip install "cohere<5" langchain llama-index llama-index-embeddings-cohere llama-index-postprocessor-cohere-rerank pytesseract pdf2image

1	# Due to compatibility issues, we need to do imports like this
2	from llama_index.core.schema import TextNode
3
4	%%capture
5	!pip install unstructured

1	import cohere
2	from getpass import getpass
3
4	# Set up Cohere client
5	COHERE_API_KEY = getpass("Enter your Cohere API key: ")
6
7	# Instantiate a client to communicate with Cohere's API using our Python SDK
8	co = cohere.Client(COHERE_API_KEY)

1	# Using langchain here since they have access to the Unstructured Data Loader powered by unstructured.io
2	from langchain_community.document_loaders import UnstructuredURLLoader
3
4	# Load up Airbnb's 10-K from this past fiscal year (filed in 2024)
5	# Feel free to fill in some other EDGAR path
6	url = "https://www.sec.gov/Archives/edgar/data/1559720/000155972024000006/abnb-20231231.htm"
7	loader = UnstructuredURLLoader(urls=[url], headers={"User-Agent": "cohere cohere@cohere.com"})
8	documents = loader.load()
9
10	edgar_10k = documents[0].page_content
11
12	# Load the document(s) as simple text nodes, to be passed to the tokenization processor
13	nodes = [TextNode(text=document.page_content, id_=f"doc_{i}") for i, document in enumerate(documents)]

1	from llama_index.core.ingestion import IngestionPipeline
2	from llama_index.core.node_parser import SentenceSplitter
3
4	from transformers import AutoTokenizer
5
6	model_id = "CohereForAI/c4ai-command-r-v01"
7	tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
8
9	# TODO: replace with a HF implementation so this is much faster. We'll
10	# presumably release it when we OS the model
11	tokenizer_fn = lambda x: tokenizer(x).input_ids if len(x) > 0 else []
12
13	pipeline = IngestionPipeline(
14	transformations=[
15	SentenceSplitter(chunk_size=512, chunk_overlap=0, tokenizer=tokenizer_fn)
16	]
17	)
18
19	# Run the pipeline to transform the text
20	nodes = pipeline.run(nodes=nodes)

1	from llama_index.core import Settings, VectorStoreIndex
2
3	from llama_index.postprocessor.cohere_rerank import CohereRerank
4
5	from llama_index.embeddings.cohere import CohereEmbedding
6
7	# Instantiate the embedding model
8	embed_model = CohereEmbedding(cohere_api_key=COHERE_API_KEY)
9
10	# Global settings
11	Settings.chunk_size = 512
12	Settings.embed_model = embed_model
13
14	# Create the vector store
15	index = VectorStoreIndex(nodes)
16
17	retriever = index.as_retriever(similarity_top_k=30) # Change to whatever top_k you want
18
19	# Instantiate the reranker
20	rerank = CohereRerank(api_key=COHERE_API_KEY, top_n=15)
21
22	# Function `retrieve` is ready, using both Cohere embeddings for similarity search as well as
23	retrieve = lambda query: rerank.postprocess_nodes(retriever.retrieve(query), query_str=query)

1	PROMPT = "List the overall revenue numbers for 2021, 2022, and 2023 in the 10-K as bullet points, then explain the revenue growth trends."
2
3	# Get queries to run against our index from the command-nightly model
4	r = co.chat(PROMPT, model="command-r", search_queries_only=True)
5	if r.search_queries:
6	queries = [q["text"] for q in r.search_queries]
7	else:
8	print("No queries returned by the model")

1	# Convenience function for formatting documents
2	def format_for_cohere_client(nodes_):
3	return [
4	{
5	"text": node.node.text,
6	"llamaindex_id": node.node.id_,
7	}
8	for node
9	in nodes_
10	]
11
12
13	documents = []
14	# Retrieve a set of chunks from the vector index and append them to the list of
15	# documents that should be included in the final RAG step
16	for query in queries:
17	ret_nodes = retrieve(query)
18	documents.extend(format_for_cohere_client(ret_nodes))
19
20	# One final dedpulication step in case multiple queries return the same chunk
21	documents = [dict(t, id=f"doc_{i}") for i, t in enumerate({tuple(d.items()) for d in documents})]

1	# Make a request to the model
2	response = co.chat(
3	message=PROMPT,
4	model="command-r",
5	temperature=0.3,
6	documents=documents,
7	prompt_truncation="AUTO"
8	)
9
10	print(response.text)

1	# Helper function for displaying response WITH citations
2	def insert_citations(text: str, citations: list[dict]):
3	"""
4	A helper function to pretty print citations.
5	"""
6	offset = 0
7	# Process citations in the order they were provided
8	for citation in citations:
9	# Adjust start/end with offset
10	start, end = citation['start'] + offset, citation['end'] + offset
11	cited_docs = [doc[4:] for doc in citation["document_ids"]]
12	# Shorten citations if they're too long for convenience
13	if len(cited_docs) > 3:
14	placeholder = "[" + ", ".join(cited_docs[:3]) + "...]"
15	else:
16	placeholder = "[" + ", ".join(cited_docs) + "]"
17	# ^ doc[4:] removes the 'doc_' prefix, and leaves the quoted document
18	modification = f'{text[start:end]} {placeholder}'
19	# Replace the cited text with its bolded version + placeholder
20	text = text[:start] + modification + text[end:]
21	# Update the offset for subsequent replacements
22	offset += len(modification) - (end - start)
23
24	return text
25
26	print(insert_citations(response.text, response.citations))

1	import pytesseract
2	from pdf2image import convert_from_path
3
4	# pdf2image extracts as a list of PIL.Image objects
5	# TODO: host this PDF somewhere
6	pages = convert_from_path("/content/uber_10k.pdf")
7
8	# We access the only page in this sample PDF by indexing at 0
9	pages = [pytesseract.image_to_string(page) for page in pages]

1	def get_response(prompt, rag):
2	if rag:
3	# Get queries to run against our index from the command-nightly model
4	r = co.chat(prompt, model="command-r", search_queries_only=True)
5	if r.search_queries:
6	queries = [q["text"] for q in r.search_queries]
7	else:
8	print("No queries returned by the model")
9
10	documents = []
11	# Retrieve a set of chunks from the vector index and append them to the list of
12	# documents that should be included in the final RAG step
13	for query in queries:
14	ret_nodes = retrieve(query)
15	documents.extend(format_for_cohere_client(ret_nodes))
16
17	# One final dedpulication step in case multiple queries return the same chunk
18	documents = [dict(t) for t in {tuple(d.items()) for d in documents}]
19
20	# Make a request to the model
21	response = co.chat(
22	message=prompt,
23	model="command-r",
24	temperature=0.3,
25	documents=documents,
26	prompt_truncation="AUTO"
27	)
28	else:
29	response = co.chat(
30	message=prompt,
31	model="command-r",
32	temperature=0.3,
33	)
34
35	return response

1	prompt_template = """# financial form 10-K
2	{tenk}
3
4	# question
5	{question}"""
6
7	full_context_prompt = prompt_template.format(tenk=edgar_10k, question=PROMPT)

1	r1 = get_response(PROMPT, rag=True)
2	r2 = get_response(full_context_prompt, rag=False)

1	def get_price(r):
2	return (r.token_count["prompt_tokens"] * 0.5 / 10e6) + (r.token_count["response_tokens"] * 1.5 / 10e6)

1	rag_price = get_price(r1)
2	full_context_price = get_price(r2)
3
4	print(f"RAG is {(full_context_price - rag_price) / full_context_price:.0%} cheaper than full context")

Getting Started

Step 1: Loading a 10-K

Step 2: Load document into a LlamaIndex vector store

Step 3: Query generation and retrieval

Step 4: Make a RAG request to Command using document mode

Appendix

PDF to Text using OCR and pdf2image

Token count / price comparison and latency

PDF to Text using OCR and `pdf2image`