Preparing the Generate Fine-tuning Data

In this section, we will walk through how you can prepare your data for fine-tuning a model for Generate.

To be able to start a fine-tune job, your dataset must contain at least 32 examples.

Data format

Data is stored as tuples, where each tuple consists of a prompt and its corresponding completion. Here's an example:

  • prompt: What is the capital of France?
  • completion: The capital of France is Paris.

When it comes to command-style data, you have the option of saving your data in either a .jsonl or .csv format.

jsonl

{"prompt": "What is the capital of France?", "completion": "The capital of France is Paris."}
{"prompt": "What is the smallest state in the USA?", "completion": "The smallest state in the USA is Rhode Island."}

csv

prompt,completion
What is the capital of France? , The capital of France is Paris.
What is the smallest state in the USA? , The smallest state in the USA is Rhode Island.

Cleaning your Dataset

For optimal results, we suggest cleaning your dataset before beginning the fine-tuning process.

A list of things you might want to fix includes:

  • Making sure that your dataset does not contain duplicate examples.
  • Making sure that your examples are utf-8 encoded.

If some of your examples don't pass our validation checks, we'll filter them out so that your fine-tuning job can start without interruption. As long as you have a sufficient number of valid training examples, you're good to go.

Evaluation Datasets

Evaluation data is used to determine the performance of your fine-tuned model based on a number of metrics. You have the option of providing a validation dataset yourself or, if you prefer not to, we will divide your training file into separate train and evaluation datasets.

Adding a Dataset with the Python SDK

If you intend to fine-tune through our UI you can skip to the next chapter. Otherwise, continue reading to learn how to add datasets via our Python SDK so that they can be accessed and used later. Before you start, we recommend that you read about the dataset API.

Below you will find some code samples on how to create datasets via the SDK:

import cohere

# instantiate the Cohere client
co = cohere.Client("YOUR_API_KEY")  
                                          
prompt_completion_dataset = co.datasets.create(name="prompt-completion-dataset",
                                              data=open("path/to/train.csv", "rb"),
                                              type="prompt-completion-finetune-input")

print(co.wait(prompt_completion_dataset))

## add an evaluation dataset

prompt_completion_dataset_with_eval = co.datasets.create(name="prompt-completion-dataset-with-eval",
                                                        data=open("path/to/train.jsonl", "rb"),
                                                        eval_data=open("path/to/eval.jsonl", "rb"),
                                                        dataset_type="prompt-completion-finetune-input")

print(co.wait(prompt_completion_dataset_with_eval))

# in case the validation failes you can check the dataset object to see what the error was
print(prompt_completion_dataset)