Preparing the Chat Fine-tuning Data

In this section, we will walk through how you can prepare your data for fine-tuning a one of the Command family of models for Chat.

Data format

First, ensure your data is in jsonl format. It should have the following structure:

messages: This contains a list of messages of the conversation

A message consist of the following parts:

role: The current speaker. You can pick from System, User, or Chatbot.
content: This contains the content of the message.

Here is a chat example that trains a chat bot to answer questions. Notice that, for the sake of readability, the document spans over multiple lines. For your dataset, make sure that each line contains one whole example.

JSON

1 {
2   "messages": [
3     {
4       "role": "System",
5       "content": "You are a large language model trained by Cohere."
6     },
7     {
8       "role": "User",
9       "content": "Hi! What were Time magazines top 10 cover stories in the last 10 years?"
10     },
11     {
12       "role": "Chatbot",
13       "content": "Time magazines top 10 cover stories in the last 10 years were:\\n\\n1. Volodymyr Zelenskyy\\n2. Elon Musk\\n3. Martin Luther King Jr.\\n4. How Earth Survived\\n5. Her Lasting Impact\\n6. Nothing to See Here\\n7. Meltdown\\n8. Deal With It\\n9. The Top of America\\n10. Bitter Pill"
14     },
15     {
16       "role": "User",
17       "content": "Who is Volodymyr Zelenskyy?"
18     },
19     {
20       "role": "Chatbot",
21       "content": "Volodymyr Zelenskyy is a Ukrainian politician who was elected President of Ukraine in 2019. He previously served as the Minister of Internal Affairs in the government of Prime Minister Volodymyr Groysman."
22     },
23     {
24       "role": "User",
25       "content": "Thank you!"
26     }
27   ]
28 }

Data Requirements

To pass the validation tests Cohere performs on uploaded data, ensure that:

You have the proper roles. There are only three acceptable values for the role field: System, Chatbot or User. There should be at least one instance of Chatbot and User in each conversation. If your dataset includes other roles, an error will be thrown.
A system instruction should be uploaded as the first message in the conversation, with role: System. All other messages with role: System will be treated as speakers in the conversation.
Each turn in the conversation should be within the training context length of 16384 tokens to avoid being dropped from the dataset. We explain a turn in the “Chat Customization Best Practices” section below.
Your data is encoded in UTF-8.

Evaluation Datasets

Evaluation data is utilized to calculate metrics that depict the performance of your fine-tuned model. You have the option of generating a validation dataset yourself, or you can opt instead to allow us to divide your training file into separate train and evaluation datasets.

Create a Dataset with the Python SDK

If you intend to fine-tune through our UI you can skip to the next chapter. Otherwise continue reading to learn how to create datasets for fine-tuning via our Python SDK. Before you start, we recommend that you read about datasets. Please also see the ‘Data Formatting and Requirements’ in ‘Using the Python SDK’ in the next chapter for a full table of expected validation errors. Below you will find some code samples on how create datasets via the SDK:

PYTHON

1 import cohere
2 
3 # instantiate the Cohere client
4 co = cohere.Client("YOUR_API_KEY")
5 
6 chat_dataset = co.datasets.create(
7     name="chat-dataset",
8     data=open("path/to/train.jsonl", "rb"),
9     type="chat-finetune-input",
10 )
11 print(co.wait(chat_dataset))
12 
13 chat_dataset_with_eval = co.datasets.create(
14     name="chat-dataset-with-eval",
15     data=open("path/to/train.jsonl", "rb"),
16     eval_data=open("path/to/eval.jsonl", "rb"),
17     type="chat-finetune-input",
18 )
19 print(co.wait(chat_dataset_with_eval))

Chat Customization Best Practices

A turn includes all messages up to the Chatbot speaker. The following conversation has two turns:

JSON

1 {
2   "messages": [
3     {
4       "role": "System",
5       "content": "You are a chatbot trained to answer to my every question."
6     },
7     {
8       "role": "User",
9       "content": "Hello"
10     },
11     {
12       "role": "Chatbot",
13       "content": "Greetings! How can I help you?"
14     },
15     {
16       "role": "User",
17       "content": "What makes a good running route?"
18     },
19     {
20       "role": "Chatbot",
21       "content": "A sidewalk-lined road is ideal so that you’re up and off the road away from vehicular traffic."
22     }
23   ]
24 }

A few things to bear in mind:

The preamble is always kept within the context window. This means that the preamble and all turns within the context window should be within 16384 tokens.
To check how many tokens your data is, you can use the co.tokenize() api.
If any turns are above the context length of 16384 tokens, we will drop them from the training data.
If an evaluation file is not uploaded, we will make our best effort to automatically split your uploaded conversations into an 80/20 split. In other words, if you upload a training dataset containing only the minimum of two conversations, we’ll randomly put one of them in the training set, and the other in the evaluation set.