Chunking Strategies

Ania Bialas

PYTHON

1 %%capture
2 !pip install cohere
3 !pip install -qU langchain-text-splitters
4 !pip install llama-index-embeddings-cohere
5 !pip install llama-index-postprocessor-cohere-rerank

PYTHON

1 import requests
2 from typing import List
3 
4 from bs4 import BeautifulSoup
5 
6 import cohere
7 from getpass import getpass
8 from IPython.display import HTML, display
9 
10 from langchain_text_splitters import CharacterTextSplitter
11 from langchain_text_splitters import RecursiveCharacterTextSplitter
12 
13 from llama_index.core import Document
14 from llama_index.embeddings.cohere import CohereEmbedding
15 from llama_index.postprocessor.cohere_rerank import CohereRerank
16 from llama_index.core import VectorStoreIndex, ServiceContext

PYTHON

1 co_model = 'command-r'
2 co_api_key = getpass("Enter Cohere API key: ")
3 co = cohere.Client(api_key=co_api_key)

Output

Enter Cohere API key: ··········

Introduction

Chunking is an essential component of any RAG-based system. This cookbook aims to demonstrate how different chunking strategies affect the results of LLM-generated output. There are multiple considerations that need to be taken into account when designing chunking strategy. Therefore, we begin by providing a framework for these strategies and then jump into a practical example. We will focus our example on transcript calls, which create a unique challenge because of their rich content and the change of people speaking throughout the text.

Chunking strategies framework

Document splitting

By document splitting, we mean deciding on the conditions under which we will break the text. At this stage, we should ask, “Are there any parts of consecutive text we want to ensure we do not break?“. If the answer is “no”, then, the content-independent splitting strategies are helpful. On the other hand, in scenarios like transcripts or meeting notes, we probably would like to keep the content of one speaker together, which might require us to deploy content-dependent strategies.

Content-independent splitting strategies

We split the document based on some content-independent conditions, among the most popular ones are:

splitting by the number of characters,
splitting by sentence,
splitting by a given character, for example, \n for paragraphs.

The advantage of this scenario is that we do not need to make any assumptions about the text. However, some considerations remain, like whether we want to preserve some semantic structure, for example, sentences or paragraphs. Sentence splitting is better suited if we are looking for small chunks to ensure accuracy. Conversely, paragraphs preserve more context and might be more useful in open-ended questions.

Content-dependent splitting strategies

On the other hand, there are scenarios in which we care about preserving some text structure. Then, we develop custom splitting strategies based on the document’s content. A prime example is call transcripts. In such scenarios, we aim to ensure that one person’s speech is fully contained within a chunk.

Creating chunks from the document splits

After the document is split, we need to decide on the desired size of our chunks (the split only defines how we break the document, but we can create bigger chunks from multiple splits).

Smaller chunks support more accurate retrieval. However, they might lack context. On the other hand, larger chunks offer more context, but they reduce the effectiveness of the retrieval. It is important to experiment with different settings to find the optimal balance.

Overlapping chunks

Overlapping chunks is a useful technique to have in the toolbox. Especially when we employ content-independent splitting strategies, it helps us mitigate some of the pitfalls of breaking the document without fully understanding the text. Overlapping guarantees that there is always some buffer between the chunks, and even if an important piece of information might be split in the original splitting strategy, it is more probable that the full information will be captured in the next chunk. The disadvantage of this method is that it creates redundancy.

Getting started

Designing a robust chunking strategy is as much a science as an art. There are no straightforward answers; the most effective strategies often emerge through experimentation. Therefore, let’s dive straight into an example to illustrate this concept.

Utils

PYTHON

1 def set_css():
2   display(HTML('''
3 
4   '''))
5 get_ipython().events.register('pre_run_cell', set_css)
6 
7 set_css()

PYTHON

1 def insert_citations(text: str, citations: List[dict]):
2     """
3     A helper function to pretty print citations.
4     """
5     offset = 0
6     # Process citations in the order they were provided
7     for citation in citations:
8         # Adjust start/end with offset
9         start, end = citation['start'] + offset, citation['end'] + offset
10         placeholder = "[" + ", ".join(doc[4:] for doc in citation["document_ids"]) + "]"
11         # ^ doc[4:] removes the 'doc_' prefix, and leaves the quoted document
12         modification = f'{text[start:end]} {placeholder}'
13         # Replace the cited text with its bolded version + placeholder
14         text = text[:start] + modification + text[end:]
15         # Update the offset for subsequent replacements
16         offset += len(modification) - (end - start)
17 
18     return text
19 
20 def build_retriever(documents, top_n=5):
21   # Create the embedding model
22   embed_model = CohereEmbedding(
23       cohere_api_key=co_api_key,
24       model_name="embed-english-v3.0",
25       input_type="search_query",
26   )
27 
28   # Load the data, for this example data needs to be in a test file
29   index = VectorStoreIndex.from_documents(
30       documents,
31       embed_model=embed_model
32   )
33 
34   # Create a cohere reranker
35   cohere_rerank = CohereRerank(api_key=co_api_key)
36 
37   # Create the retriever
38   retriever = index.as_retriever(node_postprocessors=[cohere_rerank], similarity_top_k=top_n)
39   return retriever

Load the data

In this example we will work with an 2023 Tesla earning call transcript.

PYTHON

1 # Get all investement memos (19) in bvp repository
2 url_path = 'https://www.fool.com/earnings/call-transcripts/2024/01/24/tesla-tsla-q4-2023-earnings-call-transcript/'
3 response = requests.get(url_path)
4 soup = BeautifulSoup(response.content, 'html.parser')
5 
6 target_divs = soup.find("div", {"class": "article-body"}).find_all("p")[2:]
7 print('Length of the script: ', len(target_divs))
8 
9 print()
10 print('Example of processed text:')
11 text = '\n\n'.join([div.get_text() for div in target_divs])
12 print(text[:500])

Output

Length of the script:  385
Example of processed text:
Martin Viecha
Good afternoon, everyone, and welcome to Tesla's fourth-quarter 2023 Q&amp;A webcast. My name is Martin Viecha, VP of investor relations, and I'm joined today by Elon Musk, Vaibhav Taneja, and a number of other executives. Our Q4 results were announced at about 3 p.m. Central Time in the update that we published at the same link as this webcast.
During this call, we will discuss our business outlook and make forward-looking statements. These comments are based on our predictions and

Example 1: Chunking using content-independent strategies

Let’s begin with a simple content-independent strategy. We aim to answer the question, Who mentions Jonathan Nolan?. We chose this question as it is easily verifiable and it requires to identify the speaker. The answer to this question can be found in the downloaded transcript, here is the relevant passage:

Elon Musk -- Chief Executive Officer and Product Architect
Yeah. The creators of Westworld, Jonathan Nolan, Lisa Joy Nolan, are friends -- are all friends of mine, actually. And I invited them to come see the lab and, like, well, come see it, hopefully soon. It's pretty well -- especially the sort of subsystem test stands where you've just got like one leg on a test stand just doing repetitive exercises and one arm on a test stand pretty well.

PYTHON

1 # Define the question
2 question = "Who mentions Jonathan Nolan?"

In this case, we are more concerned about accuracy than a verbose answer, so we focus on keeping the chunks small. To ensure that the desired size is not exceeded, we will randomly split the list of characters, in our case ["\n\n", "\n", " ", ""].

We employ the RecursiveCharacterTextSplitter from LangChain for this task.

PYTHON

1 # Define the chunking function
2 def get_chunks(text, chunk_size, chunk_overlap):
3   text_splitter = RecursiveCharacterTextSplitter(
4     chunk_size=chunk_size,
5     chunk_overlap=chunk_overlap,
6     length_function=len,
7     is_separator_regex=False,
8   )
9 
10   documents = text_splitter.create_documents([text])
11   documents = [Document(text=doc.page_content) for doc in documents]
12 
13   return documents

Experiment 1 - no overlap

In our first experiment we define the chunk size as 500 and allow no overlap between consecutive chunks.

Subsequently, we implement the standard RAG pipeline. We feed the chunks into a retriever, selecting the top_n most pertinent to the query chunks, and supply them as context to the generation model. Throughout this pipeline, we leverage Cohere’s endpoints, specifically, co.embed, co.re.rank, and finally, co.chat.

PYTHON

1 chunk_size = 500
2 chunk_overlap = 0
3 documents = get_chunks(text, chunk_size, chunk_overlap)
4 retriever = build_retriever(documents)
5 
6 source_nodes = retriever.retrieve(question)
7 print('Number of docuemnts: ',len(source_nodes))
8 source_nodes= [{"text": ni.get_content()}for ni in source_nodes]
9 
10 
11 response = co.chat(
12   message=question,
13   documents=source_nodes,
14   model=co_model
15 )
16 response = response
17 print(response.text)

Output

Number of docuemnts:  5
An unknown speaker mentions Jonathan Nolan in a conversation about the creators of Westworld. They mention that Jonathan Nolan and Lisa Joy Nolan are friends of theirs, and that they have invited them to visit the lab.

A notable feature of co.chat is its ability to ground the model’s answer within the context. This means we can identify which chunks were used to generate the answer. Below, we show the previous output of the model together with the citation reference, where [num] represents the index of the chunk.

PYTHON

1 print(insert_citations(response.text, response.citations))

Output

An unknown speaker [0] mentions Jonathan Nolan in a conversation about the creators of Westworld. [0] They mention that Jonathan Nolan and Lisa Joy Nolan [0] are friends [0] of theirs, and that they have invited them to visit the lab. [0]

Indeed, by printing the cited chunk, we can validate that the text was divided so that the generation model could not provide the correct response. Notably, the speaker’s name is not included in the context, which is why the model refes to an unknown speaker.

PYTHON

1 print(source_nodes[0])

Output

1 {'text': "Yeah. The creators of Westworld, Jonathan Nolan, Lisa Joy Nolan, are friends -- are all friends of mine, actually. And I invited them to come see the lab and, like, well, come see it, hopefully soon. It's pretty well -- especially the sort of subsystem test stands where you've just got like one leg on a test stand just doing repetitive exercises and one arm on a test stand pretty well.\n\nYeah.\n\nUnknown speaker\n\nWe're not entering Westworld anytime soon."}

Experiment 2 - allow overlap

In the previous experiment, we discovered that the chunks were generated in a way that made it impossible to generate the correct answer. The name of the speaker was not included in the relevant chunk.

Therefore, this time to mitigate this issue, we allow for overlap between consecutive chunks.

PYTHON

1 chunk_size = 500
2 chunk_overlap = 100
3 documents = get_chunks(text,chunk_size, chunk_overlap)
4 retriever = build_retriever(documents)
5 
6 source_nodes = retriever.retrieve(question)
7 print('Number of docuemnts: ',len(source_nodes))
8 source_nodes= [{"text": ni.get_content()}for ni in source_nodes]
9 
10 
11 response = co.chat(
12   message=question,
13   documents=source_nodes,
14   model=co_model
15 )
16 response = response
17 print(response.text)

Output

Number of docuemnts:  5
Elon Musk mentions Jonathan Nolan. Musk is the CEO and Product Architect of the lab that resembles the set of Westworld, a show created by Jonathan Nolan and Lisa Joy Nolan.

Again, we can print the text along with the citations.

PYTHON

1 print(insert_citations(response.text, response.citations))

Output

Elon Musk [0] mentions Jonathan Nolan. Musk is the CEO and Product Architect [0] of the lab [0] that resembles the set of Westworld [0], a show created by Jonathan Nolan [0] and Lisa Joy Nolan. [0]

And investigate the chunks which were used as context to answer the query.

PYTHON

1 source_nodes[0]

Output

1 {'text': "Yeah, not the best reference.\n\nElon Musk -- Chief Executive Officer and Product Architect\n\nYeah. The creators of Westworld, Jonathan Nolan, Lisa Joy Nolan, are friends -- are all friends of mine, actually. And I invited them to come see the lab and, like, well, come see it, hopefully soon. It's pretty well -- especially the sort of subsystem test stands where you've just got like one leg on a test stand just doing repetitive exercises and one arm on a test stand pretty well.\n\nYeah."}

As we can see, by allowing overlap we managed to get the correct answer to our question.

Example 2: Chunking using content-dependent strategies

In the previous experiment, we provided an example of how using or not using overlapping can affect a model’s performance, particularly in documents such as call transcripts where subjects change frequently. Ensuring that each chunk contains all relevant information is crucial. While we managed to retrieve the correct information by introducing overlapping into the chunking strategy, this might still not be the optimal approach for transcripts with longer speaker speeches.

Therefore, in this experiment, we will adopt a content-dependent strategy.

Our proposed approach entails segmenting the text whenever a new speaker begins speaking, which requires preprocessing the text accordingly.

Preprocess the text

Firstly, let’s observe that in the HTML text, each time the speaker changes, their name is enclosed within <p><strong>Name</strong></p> tags, denoting the speaker’s name in bold letters.

To facilitate our text chunking process, we’ll use the above observation and introduce a unique character sequence ###, which we’ll utilize as a marker for splitting the text.

PYTHON

1 print('HTML text')
2 print(target_divs[:3])
3 print('-------------------\n')
4 
5 text_custom = []
6 for div in target_divs:
7   if div.get_text() is None:
8     continue
9   if str(div).startswith('<p><strong>'):
10     text_custom.append(f'### {div.get_text()}')
11   else:
12     text_custom.append(div.get_text())
13 
14 text_custom = '\n'.join(text_custom)
15 print(text_custom[:500])

Output

HTML text
[<p><strong>Martin Viecha</strong></p>, <p>Good afternoon, everyone, and welcome to Tesla's fourth-quarter 2023 Q&amp;A webcast. My name is Martin Viecha, VP of investor relations, and I'm joined today by Elon Musk, Vaibhav Taneja, and a number of other executives. Our Q4 results were announced at about 3 p.m. Central Time in the update that we published at the same link as this webcast.</p>, <p>During this call, we will discuss our business outlook and make forward-looking statements. These comments are based on our predictions and expectations as of today. Actual events or results could differ materially due to a number of risks and uncertainties, including those mentioned in our most recent filings with the SEC. [Operator instructions] But before we jump into Q&amp;A, Elon has some opening remarks.</p>]
-------------------
### Martin Viecha
Good afternoon, everyone, and welcome to Tesla's fourth-quarter 2023 Q&amp;A webcast. My name is Martin Viecha, VP of investor relations, and I'm joined today by Elon Musk, Vaibhav Taneja, and a number of other executives. Our Q4 results were announced at about 3 p.m. Central Time in the update that we published at the same link as this webcast.
During this call, we will discuss our business outlook and make forward-looking statements. These comments are based on our predictions an

In this approach, we prioritize splitting the text at the appropriate separator, ###. To ensure this behavior, we’ll use CharacterTextSplitter from LangChain, guaranteeing such behavior. From our analysis of the text and the fact that we aim to preserve entire speaker speeches intact, we anticipate that most of them will exceed a length of 500. Hence, we’ll increase the chunk size to 1000.

PYTHON

1 separator = "###"
2 chunk_size = 1000
3 chunk_overlap = 0
4 
5 text_splitter = CharacterTextSplitter(
6     separator = separator,
7     chunk_size=chunk_size,
8     chunk_overlap=chunk_overlap,
9     length_function=len,
10     is_separator_regex=False,
11 )
12 
13 documents = text_splitter.create_documents([text_custom])
14 documents = [Document(text=doc.page_content) for doc in documents]
15 
16 retriever = build_retriever(documents)
17 
18 source_nodes = retriever.retrieve(question)
19 print('Number of docuemnts: ',len(source_nodes))
20 source_nodes= [{"text": ni.get_content()}for ni in source_nodes]
21 
22 response = co.chat(
23   message=question,
24   documents=source_nodes,
25   model=co_model
26 )
27 response = response
28 print(response.text)

Output

WARNING:langchain_text_splitters.base:Created a chunk of size 5946, which is longer than the specified 1000
WARNING:langchain_text_splitters.base:Created a chunk of size 4092, which is longer than the specified 1000
WARNING:langchain_text_splitters.base:Created a chunk of size 1782, which is longer than the specified 1000
WARNING:langchain_text_splitters.base:Created a chunk of size 1392, which is longer than the specified 1000
WARNING:langchain_text_splitters.base:Created a chunk of size 2046, which is longer than the specified 1000
WARNING:langchain_text_splitters.base:Created a chunk of size 1152, which is longer than the specified 1000
WARNING:langchain_text_splitters.base:Created a chunk of size 1304, which is longer than the specified 1000
WARNING:langchain_text_splitters.base:Created a chunk of size 1295, which is longer than the specified 1000
WARNING:langchain_text_splitters.base:Created a chunk of size 2090, which is longer than the specified 1000
WARNING:langchain_text_splitters.base:Created a chunk of size 1251, which is longer than the specified 1000
Number of docuemnts:  5
Elon Musk mentions Jonathan Nolan. Musk is friends with the creators of Westworld, Jonathan Nolan and Lisa Joy Nolan.

Below we validate the answer using citations.

PYTHON

1 print(insert_citations(response.text, response.citations))

Output

Elon Musk [0] mentions Jonathan Nolan. [0] Musk is friends [0] with the creators of Westworld [0], Jonathan Nolan [0] and Lisa Joy Nolan. [0]

PYTHON

1 source_nodes[0]

Output

1 {'text': "Elon Musk -- Chief Executive Officer and Product Architect\nYeah. The creators of Westworld, Jonathan Nolan, Lisa Joy Nolan, are friends -- are all friends of mine, actually. And I invited them to come see the lab and, like, well, come see it, hopefully soon. It's pretty well -- especially the sort of subsystem test stands where you've just got like one leg on a test stand just doing repetitive exercises and one arm on a test stand pretty well.\nYeah.\n### Unknown speaker\nWe're not entering Westworld anytime soon.\n### Elon Musk -- Chief Executive Officer and Product Architect\nRight, right. Yeah. I take -- take safety very very seriously.\n### Martin Viecha\nThank you. The next question from Norman is: How many Cybertruck orders are in the queue? And when do you anticipate to be able to fulfill existing orders?"}

Discussion

This example highlights some of the concerns that arise when implementing chunking strategies. This is a field of ongoing research, and many exciting surveys have been published in domain-specific applications. For example, this paper examines different chunking strategies in finance.

1	%%capture
2	!pip install cohere
3	!pip install -qU langchain-text-splitters
4	!pip install llama-index-embeddings-cohere
5	!pip install llama-index-postprocessor-cohere-rerank

1	import requests
2	from typing import List
3
4	from bs4 import BeautifulSoup
5
6	import cohere
7	from getpass import getpass
8	from IPython.display import HTML, display
9
10	from langchain_text_splitters import CharacterTextSplitter
11	from langchain_text_splitters import RecursiveCharacterTextSplitter
12
13	from llama_index.core import Document
14	from llama_index.embeddings.cohere import CohereEmbedding
15	from llama_index.postprocessor.cohere_rerank import CohereRerank
16	from llama_index.core import VectorStoreIndex, ServiceContext

1	co_model = 'command-r'
2	co_api_key = getpass("Enter Cohere API key: ")
3	co = cohere.Client(api_key=co_api_key)

1	def set_css():
2	display(HTML('''
3
4	'''))
5	get_ipython().events.register('pre_run_cell', set_css)
6
7	set_css()

1	def insert_citations(text: str, citations: List[dict]):
2	"""
3	A helper function to pretty print citations.
4	"""
5	offset = 0
6	# Process citations in the order they were provided
7	for citation in citations:
8	# Adjust start/end with offset
9	start, end = citation['start'] + offset, citation['end'] + offset
10	placeholder = "[" + ", ".join(doc[4:] for doc in citation["document_ids"]) + "]"
11	# ^ doc[4:] removes the 'doc_' prefix, and leaves the quoted document
12	modification = f'{text[start:end]} {placeholder}'
13	# Replace the cited text with its bolded version + placeholder
14	text = text[:start] + modification + text[end:]
15	# Update the offset for subsequent replacements
16	offset += len(modification) - (end - start)
17
18	return text
19
20	def build_retriever(documents, top_n=5):
21	# Create the embedding model
22	embed_model = CohereEmbedding(
23	cohere_api_key=co_api_key,
24	model_name="embed-english-v3.0",
25	input_type="search_query",
26	)
27
28	# Load the data, for this example data needs to be in a test file
29	index = VectorStoreIndex.from_documents(
30	documents,
31	embed_model=embed_model
32	)
33
34	# Create a cohere reranker
35	cohere_rerank = CohereRerank(api_key=co_api_key)
36
37	# Create the retriever
38	retriever = index.as_retriever(node_postprocessors=[cohere_rerank], similarity_top_k=top_n)
39	return retriever

1	# Get all investement memos (19) in bvp repository
2	url_path = 'https://www.fool.com/earnings/call-transcripts/2024/01/24/tesla-tsla-q4-2023-earnings-call-transcript/'
3	response = requests.get(url_path)
4	soup = BeautifulSoup(response.content, 'html.parser')
5
6	target_divs = soup.find("div", {"class": "article-body"}).find_all("p")[2:]
7	print('Length of the script: ', len(target_divs))
8
9	print()
10	print('Example of processed text:')
11	text = '\n\n'.join([div.get_text() for div in target_divs])
12	print(text[:500])

1	# Define the question
2	question = "Who mentions Jonathan Nolan?"

1	# Define the chunking function
2	def get_chunks(text, chunk_size, chunk_overlap):
3	text_splitter = RecursiveCharacterTextSplitter(
4	chunk_size=chunk_size,
5	chunk_overlap=chunk_overlap,
6	length_function=len,
7	is_separator_regex=False,
8	)
9
10	documents = text_splitter.create_documents([text])
11	documents = [Document(text=doc.page_content) for doc in documents]
12
13	return documents

1	chunk_size = 500
2	chunk_overlap = 0
3	documents = get_chunks(text, chunk_size, chunk_overlap)
4	retriever = build_retriever(documents)
5
6	source_nodes = retriever.retrieve(question)
7	print('Number of docuemnts: ',len(source_nodes))
8	source_nodes= [{"text": ni.get_content()}for ni in source_nodes]
9
10
11	response = co.chat(
12	message=question,
13	documents=source_nodes,
14	model=co_model
15	)
16	response = response
17	print(response.text)

1	chunk_size = 500
2	chunk_overlap = 100
3	documents = get_chunks(text,chunk_size, chunk_overlap)
4	retriever = build_retriever(documents)
5
6	source_nodes = retriever.retrieve(question)
7	print('Number of docuemnts: ',len(source_nodes))
8	source_nodes= [{"text": ni.get_content()}for ni in source_nodes]
9
10
11	response = co.chat(
12	message=question,
13	documents=source_nodes,
14	model=co_model
15	)
16	response = response
17	print(response.text)

1	print('HTML text')
2	print(target_divs[:3])
3	print('-------------------\n')
4
5	text_custom = []
6	for div in target_divs:
7	if div.get_text() is None:
8	continue
9	if str(div).startswith('<p><strong>'):
10	text_custom.append(f'### {div.get_text()}')
11	else:
12	text_custom.append(div.get_text())
13
14	text_custom = '\n'.join(text_custom)
15	print(text_custom[:500])

1	separator = "###"
2	chunk_size = 1000
3	chunk_overlap = 0
4
5	text_splitter = CharacterTextSplitter(
6	separator = separator,
7	chunk_size=chunk_size,
8	chunk_overlap=chunk_overlap,
9	length_function=len,
10	is_separator_regex=False,
11	)
12
13	documents = text_splitter.create_documents([text_custom])
14	documents = [Document(text=doc.page_content) for doc in documents]
15
16	retriever = build_retriever(documents)
17
18	source_nodes = retriever.retrieve(question)
19	print('Number of docuemnts: ',len(source_nodes))
20	source_nodes= [{"text": ni.get_content()}for ni in source_nodes]
21
22	response = co.chat(
23	message=question,
24	documents=source_nodes,
25	model=co_model
26	)
27	response = response
28	print(response.text)