Preparing the Chat Fine-tuning Data
In this section, we will walk through how you can prepare your data for fine-tuning a one of the Command family of models for Chat.
Data format
First, ensure your data is in jsonl
format. It should have the following structure:
messages
: This contains a list of messages of the conversation
A message consist of the following parts:
role
: The current speaker. You can pick fromSystem
,User
, orChatbot
.content
: This contains the content of the message.
Here is a chat example that trains a chat bot to answer questions. Notice that, for the sake of readability, the document spans over multiple lines. For your dataset, make sure that each line contains one whole example.
Data Requirements
To pass the validation tests Cohere performs on uploaded data, ensure that:
- You have the proper roles. There are only three acceptable values for the
role
field:System
,Chatbot
orUser
. There should be at least one instance ofChatbot
andUser
in each conversation. If your dataset includes other roles, an error will be thrown. - A preamble should be uploaded as the first message in the conversation, with
role: System
. All other messages withrole: System
will be treated as speakers in the conversation. - Each turn in the conversation should be within the training context length of 16384 tokens to avoid being dropped from the dataset. We explain a turn in the “Chat Customization Best Practices” section below.
- Your data is encoded in UTF-8.
Evaluation Datasets
Evaluation data is utilized to calculate metrics that depict the performance of your fine-tuned model. You have the option of generating a validation dataset yourself, or you can opt instead to allow us to divide your training file into separate train and evaluation datasets.
Create a Dataset with the Python SDK
If you intend to fine-tune through our UI you can skip to the next chapter. Otherwise continue reading to learn how to create datasets for fine-tuning via our Python SDK. Before you start, we recommend that you read about datasets. Please also see the ‘Data Formatting and Requirements’ in ‘Using the Python SDK’ in the next chapter for a full table of expected validation errors. Below you will find some code samples on how create datasets via the SDK:
Chat Customization Best Practices
A turn includes all messages up to the Chatbot speaker. The following conversation has two turns:
A few things to bear in mind:
- The preamble is always kept within the context window. This means that the preamble and all turns within the context window should be within 16384 tokens.
- To check how many tokens your data is, you can use the co.tokenize() api.
- If any turns are above the context length of 16384 tokens, we will drop them from the training data.
- If an evaluation file is not uploaded, we will make our best effort to automatically split your uploaded conversations into an 80/20 split. In other words, if you upload a training dataset containing only the minimum of two conversations, we’ll randomly put one of them in the training set, and the other in the evaluation set.