Chunking Strategies
Introduction
Chunking is an essential component of any RAG-based system. This cookbook aims to demonstrate how different chunking strategies affect the results of LLM-generated output. There are multiple considerations that need to be taken into account when designing chunking strategy. Therefore, we begin by providing a framework for these strategies and then jump into a practical example. We will focus our example on transcript calls, which create a unique challenge because of their rich content and the change of people speaking throughout the text.
Chunking strategies framework
Document splitting
By document splitting, we mean deciding on the conditions under which we will break the text. At this stage, we should ask, “Are there any parts of consecutive text we want to ensure we do not break?“. If the answer is “no”, then, the content-independent splitting strategies are helpful. On the other hand, in scenarios like transcripts or meeting notes, we probably would like to keep the content of one speaker together, which might require us to deploy content-dependent strategies.
Content-independent splitting strategies
We split the document based on some content-independent conditions, among the most popular ones are:
- splitting by the number of characters,
- splitting by sentence,
- splitting by a given character, for example,
\n
for paragraphs.
The advantage of this scenario is that we do not need to make any assumptions about the text. However, some considerations remain, like whether we want to preserve some semantic structure, for example, sentences or paragraphs. Sentence splitting is better suited if we are looking for small chunks to ensure accuracy. Conversely, paragraphs preserve more context and might be more useful in open-ended questions.
Content-dependent splitting strategies
On the other hand, there are scenarios in which we care about preserving some text structure. Then, we develop custom splitting strategies based on the document’s content. A prime example is call transcripts. In such scenarios, we aim to ensure that one person’s speech is fully contained within a chunk.
Creating chunks from the document splits
After the document is split, we need to decide on the desired size of our chunks (the split only defines how we break the document, but we can create bigger chunks from multiple splits).
Smaller chunks support more accurate retrieval. However, they might lack context. On the other hand, larger chunks offer more context, but they reduce the effectiveness of the retrieval. It is important to experiment with different settings to find the optimal balance.
Overlapping chunks
Overlapping chunks is a useful technique to have in the toolbox. Especially when we employ content-independent splitting strategies, it helps us mitigate some of the pitfalls of breaking the document without fully understanding the text. Overlapping guarantees that there is always some buffer between the chunks, and even if an important piece of information might be split in the original splitting strategy, it is more probable that the full information will be captured in the next chunk. The disadvantage of this method is that it creates redundancy.
Getting started
Designing a robust chunking strategy is as much a science as an art. There are no straightforward answers; the most effective strategies often emerge through experimentation. Therefore, let’s dive straight into an example to illustrate this concept.
Utils
Load the data
In this example we will work with an 2023 Tesla earning call transcript.
Example 1: Chunking using content-independent strategies
Let’s begin with a simple content-independent strategy. We aim to answer the question, Who mentions Jonathan Nolan?
. We chose this question as it is easily verifiable and it requires to identify the speaker. The answer to this question can be found in the downloaded transcript, here is the relevant passage:
In this case, we are more concerned about accuracy than a verbose answer, so we focus on keeping the chunks small. To ensure that the desired size is not exceeded, we will randomly split the list of characters, in our case ["\n\n", "\n", " ", ""]
.
We employ the RecursiveCharacterTextSplitter
from LangChain for this task.
Experiment 1 - no overlap
In our first experiment we define the chunk size as 500 and allow no overlap between consecutive chunks.
Subsequently, we implement the standard RAG pipeline. We feed the chunks into a retriever, selecting the top_n
most pertinent to the query chunks, and supply them as context to the generation model. Throughout this pipeline, we leverage Cohere’s endpoints, specifically, co.embed
, co.re.rank
, and finally, co.chat
.
A notable feature of co.chat
is its ability to ground the model’s answer within the context. This means we can identify which chunks were used to generate the answer. Below, we show the previous output of the model together with the citation reference, where [num]
represents the index of the chunk.
Indeed, by printing the cited chunk, we can validate that the text was divided so that the generation model could not provide the correct response. Notably, the speaker’s name is not included in the context, which is why the model refes to an unknown speaker
.
Experiment 2 - allow overlap
In the previous experiment, we discovered that the chunks were generated in a way that made it impossible to generate the correct answer. The name of the speaker was not included in the relevant chunk.
Therefore, this time to mitigate this issue, we allow for overlap between consecutive chunks.
Again, we can print the text along with the citations.
And investigate the chunks which were used as context to answer the query.
As we can see, by allowing overlap we managed to get the correct answer to our question.
Example 2: Chunking using content-dependent strategies
In the previous experiment, we provided an example of how using or not using overlapping can affect a model’s performance, particularly in documents such as call transcripts where subjects change frequently. Ensuring that each chunk contains all relevant information is crucial. While we managed to retrieve the correct information by introducing overlapping into the chunking strategy, this might still not be the optimal approach for transcripts with longer speaker speeches.
Therefore, in this experiment, we will adopt a content-dependent strategy.
Our proposed approach entails segmenting the text whenever a new speaker begins speaking, which requires preprocessing the text accordingly.
Preprocess the text
Firstly, let’s observe that in the HTML text, each time the speaker changes, their name is enclosed within <p><strong>Name</strong></p>
tags, denoting the speaker’s name in bold letters.
To facilitate our text chunking process, we’ll use the above observation and introduce a unique character sequence ###
, which we’ll utilize as a marker for splitting the text.
In this approach, we prioritize splitting the text at the appropriate separator, ###.
To ensure this behavior, we’ll use CharacterTextSplitter
from LangChain, guaranteeing such behavior. From our analysis of the text and the fact that we aim to preserve entire speaker speeches intact, we anticipate that most of them will exceed a length of 500. Hence, we’ll increase the chunk size to 1000.
Below we validate the answer using citations.
Discussion
This example highlights some of the concerns that arise when implementing chunking strategies. This is a field of ongoing research, and many exciting surveys have been published in domain-specific applications. For example, this paper examines different chunking strategies in finance.