RAG With Chat Embed and Rerank via Pinecone
This notebook shows how to build a RAG-powered chatbot with Cohere’s Chat endpoint. The chatbot can extract relevant information from external documents and produce verifiable, inline citations in its responses.
This application will use several Cohere API endpoints:
- Chat: For handling the main logic of the chatbot, including turning a user message into queries, generating responses, and producing citations
- Embed: For turning textual documents into their embeddings representation, later to be used in retrieval (we’ll use the latest, state-of-the-art Embed v3 model)
- Rerank: For reranking the retrieved documents according to their relevance to a query
The diagram below provides an overview of what we’ll build.
Here is a summary of the steps involved.
Initial phase:
- Step 0: Ingest the documents – get documents, chunk, embed, and index.
For each user-chatbot interaction:
- Step 1: Get the user message
- Step 2: Call the Chat endpoint in query-generation mode
- If at least one query is generated
- Step 3: Retrieve and rerank relevant documents
- Step 4: Call the Chat endpoint in document mode to generate a grounded response with citations
- If no query is generated
- Step 4: Call the Chat endpoint in normal mode to generate a response
First, we define the list of documents we want to ingest and make available for retrieval. As an example, we’ll use the contents from the first module of Cohere’s LLM University: What are Large Language Models?.
Usually the number of documents for practical applications is vast, and so we’ll need to be able to search documents efficiently. This involves breaking the documents into chunks, generating embeddings, and indexing the embeddings, as shown in the image below.
We implement this in the Vectorstore
class below, which takes the raw_documents
list as input. Three methods are immediately called when creating an object of the Vectorstore
class:
load_and_chunk()
This method uses the partition_html()
method from the unstructured
library to load the documents from URL and break them into smaller chunks. Each chunk is turned into a dictionary object with three fields:
title
- the web page’s title,text
- the textual content of the chunk, andurl
- the web page’s URL.
embed()
This method uses Cohere’s embed-english-v3.0
model to generate embeddings of the chunked documents. Since our documents will be used for retrieval, we set input_type="search_document"
. We send the documents to the Embed endpoint in batches, because the endpoint has a limit of 96 documents per call.
index()
This method uses the hsnwlib
package to index the document chunk embeddings. This will ensure efficient similarity search during retrieval. Note that hnswlib
uses a vector library, and we have chosen it for its simplicity.
In the code cell below, we initialize an instance of the Vectorstore
class and pass in the raw_documents
list as input.
The Vectorstore
class also has a retrieve()
method, which we’ll use to retrieve relevant document chunks given a query (as in Step 3 in the diagram shared at the beginning of this notebook). This method has two components: (1) dense retrieval, and (2) reranking.
Dense retrieval
First, we embed the query using the same embed-english-v3.0
model we used to embed the document chunks, but this time we set input_type="search_query"
.
Search is performed by the knn_query()
method from the hnswlib
library. Given a query, it returns the document chunks most similar to the query. We can define the number of document chunks to return using the attribute self.retrieve_top_k=10
.
Reranking
After semantic search, we implement a reranking step. While our semantic search component is already highly capable of retrieving relevant sources, the Rerank endpoint provides an additional boost to the quality of the search results, especially for complex and domain-specific queries. It takes the search results and sorts them according to their relevance to the query.
We call the Rerank endpoint with the co.rerank()
method and define the number of top reranked document chunks to retrieve using the attribute self.rerank_top_k=3
. The model we use is rerank-english-v2.0
.
This method returns the top retrieved document chunks chunks_retrieved
so that they can be passed to the chatbot.
In the code cell below, we check the document chunks that are retrieved for the query "multi-head attention definition"
.
Test Retrieval
Next, we implement a class to handle the interaction between the user and the chatbot. It takes an instance of the Vectorstore
class as input.
The run()
method will be used to run the chatbot application. It begins with the logic for getting the user message, along with a way for the user to end the conversation.
Based on the user message, the chatbot needs to decide if it needs to consult external information before responding. If so, the chatbot determines an optimal set of search queries to use for retrieval. When we call co.chat()
with search_queries_only=True
, the Chat endpoint handles this for us automatically.
The generated queries can be accessed from the search_queries
field of the object that is returned. Then, what happens next depends on how many queries are returned.
- If queries are returned, we call the
retrieve()
method of the Vectorstore object for the retrieval step. The retrieved document chunks are then passed to the Chat endpoint by adding adocuments
parameter when we callco.chat()
again. - Otherwise, if no queries are returned, we call the Chat endpoint another time, passing the user message and without needing to add any documents to the call.
In either case, we also pass the conversation_id
parameter, which retains the interactions between the user and the chatbot in the same conversation thread. We also enable the stream
parameter so we can stream the chatbot response.
We then print the chatbot’s response. In the case that the external information was used to generate a response, we also display citations.
We can now run the chatbot. For this, we create the instance of Chatbot
and run the chatbot by invoking the run()
method.
The format of each citation is:
start
: The starting point of a span where one or more documents are referencedend
: The ending point of a span where one or more documents are referencedtext
: The text representing this spandocument_ids
: The IDs of the documents being referenced (doc_0
being the ID of the first document passed to thedocuments
creating parameter in the endpoint call, and so on)