Master Reranking with Cohere Models

Open in Colab

Reranking is a technique that provides a semantic boost to the search quality of any keyword or vector search system, and is especially useful in RAG systems.

We can rerank results from semantic search as well as any other search systems such as lexical search. This means that companies can retain an existing keyword-based (also called “lexical”) or semantic search system for the first-stage retrieval and integrate the Rerank endpoint in the second-stage reranking.

In this tutorial, you’ll learn about:

Reranking lexical/semantic search results
Reranking semi-structured data
Reranking tabular data
Multilingual reranking

You’ll learn these by building an onboarding assistant for new hires.

Setup

To get started, first we need to install the cohere library and create a Cohere client.

PYTHON

1 # pip install cohere
2 
3 import cohere
4 
5 # Get your free API key: https://dashboard.cohere.com/api-keys
6 co = cohere.ClientV2(api_key="COHERE_API_KEY")

Reranking lexical/semantic search results

Rerank requires just a single line of code to implement.

Suppose we have a list of search results of an FAQ list, which can come from semantic, lexical, or any other types of search systems. But this list may not be optimally ranked for relevance to the user query.

This is where Rerank can help. We call the endpoint using co.rerank() and pass the following arguments:

query: The user query
documents: The list of documents
top_n: The top reranked documents to select
model: We choose Rerank English 3

PYTHON

1 # Define the documents
2 faqs = [
3     {
4         "text": "Reimbursing Travel Expenses: Easily manage your travel expenses by submitting them through our finance tool. Approvals are prompt and straightforward."
5     },
6     {
7         "text": "Working from Abroad: Working remotely from another country is possible. Simply coordinate with your manager and ensure your availability during core hours."
8     },
9     {
10         "text": "Health and Wellness Benefits: We care about your well-being and offer gym memberships, on-site yoga classes, and comprehensive health insurance."
11     },
12     {
13         "text": "Performance Reviews Frequency: We conduct informal check-ins every quarter and formal performance reviews twice a year."
14     },
15 ]

PYTHON

1 # Add the user query
2 query = "Are there fitness-related perks?"
3 
4 # Rerank the documents
5 results = co.rerank(
6     model="rerank-v3.5",
7     query=query,
8     documents=faqs,
9     top_n=1,
10 )
11 
12 print(results)

id='2fa5bc0d-28aa-4c99-8355-7de78dbf3c86' results=[RerankResponseResultsItem(document=None, index=2, relevance_score=0.01798621), RerankResponseResultsItem(document=None, index=3, relevance_score=8.463939e-06)] meta=ApiMeta(api_version=ApiMetaApiVersion(version='1', is_deprecated=None, is_experimental=None), billed_units=ApiMetaBilledUnits(input_tokens=None, output_tokens=None, search_units=1.0, classifications=None), tokens=None, warnings=None)

PYTHON

1 # Display the reranking results
2 def return_results(results, documents):
3     for idx, result in enumerate(results.results):
4         print(f"Rank: {idx+1}")
5         print(f"Score: {result.relevance_score}")
6         print(f"Document: {documents[result.index]}\n")
7 
8 
9 return_results(results, faqs_short)

Rank: 1
Score: 0.01798621
Document: {'text': 'Health and Wellness Benefits: We care about your well-being and offer gym memberships, on-site yoga classes, and comprehensive health insurance.'}
Rank: 2
Score: 8.463939e-06
Document: {'text': 'Performance Reviews Frequency: We conduct informal check-ins every quarter and formal performance reviews twice a year.'}

Reranking semi-structured data

The Rerank 3 model supports multi-aspect and semi-structured data like emails, invoices, JSON documents, code, and tables. By setting the rank fields, you can select which fields the model should consider for reranking.

In the following example, we’ll use an email data example. It is a semi-stuctured data that contains a number of fields – from, to, date, subject, and text.

Suppose the new hire now wants to search for any emails about check-in sessions. Let’s pretend we have a list of 5 emails retrieved from the email provider’s API.

To perform reranking over semi-structured data, we serialize the documents to YAML format, which prepares the data in the format required for reranking. Then, we pass the YAML formatted documents to the Rerank endpoint.

PYTHON

1 # Define the documents
2 emails = [
3     {
4         "from": "hr@co1t.com",
5         "to": "david@co1t.com",
6         "date": "2024-06-24",
7         "subject": "A Warm Welcome to Co1t!",
8         "text": "We are delighted to welcome you to the team! As you embark on your journey with us, you'll find attached an agenda to guide you through your first week.",
9     },
10     {
11         "from": "it@co1t.com",
12         "to": "david@co1t.com",
13         "date": "2024-06-24",
14         "subject": "Setting Up Your IT Needs",
15         "text": "Greetings! To ensure a seamless start, please refer to the attached comprehensive guide, which will assist you in setting up all your work accounts.",
16     },
17     {
18         "from": "john@co1t.com",
19         "to": "david@co1t.com",
20         "date": "2024-06-24",
21         "subject": "First Week Check-In",
22         "text": "Hello! I hope you're settling in well. Let's connect briefly tomorrow to discuss how your first week has been going. Also, make sure to join us for a welcoming lunch this Thursday at noon—it's a great opportunity to get to know your colleagues!",
23     },
24 ]

PYTHON

1 # Convert the documents to YAML format
2 yaml_docs = [yaml.dump(doc, sort_keys=False) for doc in emails]
3 
4 # Add the user query
5 query = "Any email about check ins?"
6 
7 # Rerank the documents
8 results = co.rerank(
9     model="rerank-v3.5",
10     query=query,
11     documents=yaml_docs,
12     top_n=2,
13 )
14 
15 return_results(results, emails)

Rank: 1
Score: 0.1979091
Document: {'from': 'john@co1t.com', 'to': 'david@co1t.com', 'date': '2024-06-24', 'subject': 'First Week Check-In', 'text': "Hello! I hope you're settling in well. Let's connect briefly tomorrow to discuss how your first week has been going. Also, make sure to join us for a welcoming lunch this Thursday at noon—it's a great opportunity to get to know your colleagues!"}
Rank: 2
Score: 9.535461e-05
Document: {'from': 'hr@co1t.com', 'to': 'david@co1t.com', 'date': '2024-06-24', 'subject': 'A Warm Welcome to Co1t!', 'text': "We are delighted to welcome you to the team! As you embark on your journey with us, you'll find attached an agenda to guide you through your first week."}

Reranking tabular data

Many enterprises rely on tabular data, such as relational databases, CSVs, and Excel. To perform reranking, you can transform a dataframe into a list of JSON records and use Rerank 3’s JSON capabilities to rank them. We follow the same steps in the previous example, where we convert the data into YAML format before passing it to the Rerank endpoint.

Here’s an example of reranking a CSV file that contains employee information.

PYTHON

1 import pandas as pd
2 from io import StringIO
3 
4 # Create a demo CSV file
5 data = """name,role,join_date,email,status
6 Rebecca Lee,Senior Software Engineer,2024-07-01,rebecca@co1t.com,Full-time
7 Emma Williams,Product Designer,2024-06-15,emma@co1t.com,Full-time
8 Michael Jones,Marketing Manager,2024-05-20,michael@co1t.com,Full-time
9 Amelia Thompson,Sales Representative,2024-05-20,amelia@co1t.com,Part-time
10 Ethan Davis,Product Designer,2024-05-25,ethan@co1t.com,Contractor"""
11 data_csv = StringIO(data)
12 
13 # Load the CSV file
14 df = pd.read_csv(data_csv)
15 df.head(1)

Here’s what the table looks like:

name	role	join_date	email	status
Rebecca Lee	Senior Software Engineer	2024-07-01	rebecca@co1t.com	Full-time

Below, we’ll get results from the Rerank endpoint:

PYTHON

1 # Define the documents
2 employees = df.to_dict("records")
3 
4 # Convert the documents to YAML format
5 yaml_docs = [yaml.dump(doc, sort_keys=False) for doc in employees]
6 
7 # Add the user query
8 query = "Any full-time product designers who joined recently?"
9 
10 # Rerank the documents
11 results = co.rerank(
12     model="rerank-v3.5",
13     query=query,
14     documents=yaml_docs,
15     top_n=1,
16 )
17 return_results(results, employees)

Rank: 1
Score: 0.986828
Document: {'name': 'Emma Williams', 'role': 'Product Designer', 'join_date': '2024-06-15', 'email': 'emma@co1t.com', 'status': 'Full-time'}

Multilingual reranking

The Rerank models (rerank-v3.5 and rerank-multilingual-v3.0) support 100+ languages. This means you can perform semantic search on texts in different languages.

In the example below, we repeat the steps of performing reranking with one difference – changing the model type to a multilingual one. Here, we use the rerank-v3.5 model. Here, we are reranking the FAQ list using an Arabic query.

PYTHON

1 # Define the query
2 query = "هل هناك مزايا تتعلق باللياقة البدنية؟"  # Are there fitness benefits?
3 
4 # Rerank the documents
5 results = co.rerank(
6     model="rerank-v3.5",
7     query=query,
8     documents=faqs,
9     top_n=1,
10 )
11 
12 return_results(results, faqs)

Rank: 1
Score: 0.42232594
Document: {'text': 'Health and Wellness Benefits: We care about your well-being and offer gym memberships, on-site yoga classes, and comprehensive health insurance.'}
Rank: 2
Score: 0.00025118678
Document: {'text': 'Performance Reviews Frequency: We conduct informal check-ins every quarter and formal performance reviews twice a year.'}

Conclusion

In this tutorial, you learned about:

How to rerank lexical/semantic search results
How to rerank semi-structured data
How to rerank tabular data
How to perform multilingual reranking

We have now seen two critical components of a powerful search system - semantic search, or dense retrieval (Part 4) and reranking (Part 5). These building blocks are essential for implementing RAG solutions.

In Part 6, you will learn how to implement RAG.