Preparing the Rerank Fine-tuning Data
In this section, we will walk through how you can prepare your data for fine-tuning for Rerank.
Data format
First, ensure your data is in jsonl
format. There are three required fields:
query
: This contains the question or target.relevant_passages
: This contains a list of documents or passages that contain information that answers thequery
.hard_negatives
: This contains examples that appear to be relevant to the query but ultimately are not because they don’t contain the answer. They differ from easy negatives, which are totally unrelated to the query. Hard negatives are optional, but providing them lead to improvements in the overall performance. We believe roughly five hard negatives leads to meaningful improvement, so include that many if you’re able to.
Here are a few example lines from a dataset that could be used to train a model that finds the paraphrased question most relevant to a target question.
Data Requirements
To pass the validation tests Cohere performs on uploaded data, ensure that:
- There is at least one
relevant_passage
for every query. - Your dataset contains at least 256 unique queries, in total.
- Your data is encoded in UTF-8.
Evaluation Datasets
Evaluation data is utilized to calculate metrics that depict the performance of your fine-tuned model. You have the option of generating a validation dataset yourself, or you can opt instead to allow us to divide your training file into separate train and evaluation datasets.
Create a Dataset with the Python SDK
If you intend to fine-tune through our UI you can skip to the next chapter. Otherwise continue reading to learn how to create datasets for fine-tuning via our Python SDK. Before you start we recommend that you read about the dataset API. Below you will find some code samples on how create datasets via the SDK: