For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DASHBOARDPLAYGROUNDDOCSCOMMUNITYLOG IN
Guides and conceptsAPI ReferenceRelease NotesLLMUCookbooks
Guides and conceptsAPI ReferenceRelease NotesLLMUCookbooks
  • Get Started
    • Introduction
    • Installation
    • Creating a client
    • Playground
    • FAQs
  • Models
    • An Overview of Cohere's Models
    • Aya
    • Embed
    • Rerank
  • Text Generation
    • Introduction to Text Generation at Cohere
    • Using the Chat API
    • Reasoning
    • Image Inputs
    • Streaming Responses
    • Predictable Outputs
    • Advanced Generation Parameters
    • Tool Use
    • Tokens and Tokenizers
    • Summarizing Text
    • Safety Modes
  • Embeddings (Vectors, Search, Retrieval)
    • Introduction to Embeddings at Cohere
    • Semantic Search with Embeddings
    • Multimodal Embeddings
    • Batch Embedding Jobs
  • Going to Production
    • API Keys and Rate Limits
    • Going Live
    • Deprecations
    • How Does Cohere's Pricing Work?
  • Integrations
    • Integrating Embedding Models with Other Tools
    • Cohere and LangChain
    • LlamaIndex and Cohere
  • Deployment Options
    • Overview
    • SDK Compatibility
  • Tutorials
    • Cookbooks
    • LLM University
    • Build Things with Cohere!
    • Agentic RAG
    • Cohere on Azure
  • Responsible Use
    • Security
    • Usage Policy
    • Command A Technical Report
    • Command R and Command R+ Model Card
  • Cohere Labs
    • Cohere Labs Acceptable Use Policy
  • More Resources
    • Cohere Toolkit
    • Datasets
    • Improve Cohere Docs
    • Preparing the Rerank Fine-tuning Data
LogoLogodocs
DASHBOARDPLAYGROUNDDOCSCOMMUNITYLOG IN
On this page
  • Data format
  • Data Requirements
  • Evaluation Datasets
  • Create a Dataset with the Python SDK

Preparing the Rerank Fine-tuning Data

Was this page helpful?
Edit this page
Previous
Built with

In this section, we will walk through how you can prepare your data for fine-tuning for Rerank.

Data format

First, ensure your data is in jsonl format. There are three required fields:

  • query: This contains the question or target.
  • relevant_passages: This contains a list of documents or passages that contain information that answers the query.
  • hard_negatives: This contains examples that appear to be relevant to the query but ultimately are not because they don’t contain the answer. They differ from easy negatives, which are totally unrelated to the query. Hard negatives are optional, but providing them lead to improvements in the overall performance. We believe roughly five hard negatives leads to meaningful improvement, so include that many if you’re able to.

Here are a few example lines from a dataset that could be used to train a model that finds the paraphrased question most relevant to a target question.

JSON
1{"query": "What are your views on the supreme court's decision to make playing national anthem mandatory in cinema halls?", "relevant_passages": ["What are your views on Supreme Court decision of must National Anthem before movies?"], "hard_negatives": ["Is the decision of SC justified by not allowing national anthem inside courts but making it compulsory at cinema halls?", "Why has the supreme court of India ordered that cinemas play the national anthem before the screening of all movies? Is it justified?", "Is it a good decision by SC to play National Anthem in the theater before screening movie?", "Why is the national anthem being played in theaters?", "What does Balaji Vishwanathan think about the compulsory national anthem rule?"]}
2{"query": "Will Google's virtual monopoly in web search ever end? When?", "relevant_passages": ["Is Google's search monopoly capable of being disrupted?"], "hard_negatives": ["Who is capable of ending Google's monopoly in search?", "What is the future of Google?", "When will the Facebook era end?", "When will Facebook stop being the most popular?", "What happened to Google Search?"]}

Data Requirements

To pass the validation tests Cohere performs on uploaded data, ensure that:

  • There is at least one relevant_passage for every query.
  • Your dataset contains at least 256 unique queries, in total.
  • Your data is encoded in UTF-8.

Evaluation Datasets

Evaluation data is utilized to calculate metrics that depict the performance of your fine-tuned model. You have the option of generating a validation dataset yourself, or you can opt instead to allow us to divide your training file into separate train and evaluation datasets.

Create a Dataset with the Python SDK

If you intend to fine-tune through our UI you can skip to the next chapter. Otherwise continue reading to learn how to create datasets for fine-tuning via our Python SDK. Before you start we recommend that you read about the dataset API. Below you will find some code samples on how create datasets via the SDK:

PYTHON
1import cohere
2
3# instantiate the Cohere client
4co = cohere.ClientV2("YOUR_API_KEY")
5
6rerank_dataset = co.create_dataset(
7 name="rerank-dataset",
8 data=open("path/to/train.jsonl", "rb"),
9 type="reranker-finetune-input",
10)
11print(rerank_dataset.await_validation())
12
13rerank_dataset_with_eval = co.create_dataset(
14 name="rerank-dataset-with-eval",
15 data=open("path/to/train.jsonl", "rb"),
16 eval_data=open("path/to/eval.jsonl", "rb"),
17 type="reranker-finetune-input",
18)
19print(rerank_dataset_with_eval.await_validation())