Preparing the Classify Fine-tuning data
In this section, we will walk through how you can prepare your data for fine-tuning models for Classification.
For classification fine-tunes we can choose between two types of datasets:
- Single-label data
- Multi-label data
To be able to start a fine-tune you need at least 40 examples. Each label needs to have at least 5 examples and there should be at least 2 unique labels.
Single-label Data
Single-label data consists of a text and a label. Here’s an example:
- text: This movie offers that rare combination of entertainment and education
- label: positive
Please notice that both text and label are required fields. When it comes to single-label data, you have the option to save your information in either a .jsonl
or .csv
format.
Multi-label Data
Multi-label data differs from single-label data in the following ways:
- We only accept
jsonl
format - An example might have more than one label
- An example might also have 0 labels
Clean your Dataset
To achieve optimal results, we suggest cleaning your dataset before beginning the fine-tuning process. Here are some things you might want to fix:
- Make sure that your dataset does not contain duplicate examples.
- Make sure that your examples are utf-8 encoded
If some of your examples don’t pass our validation checks, we’ll filter them out so that your fine-tuning job can start without interruption. As long as you have a sufficient number of valid training examples, you’re good to go.
Evaluation Datasets
Evaluation data is utilized to calculate metrics that depict the performance of your fine-tuned model. You have the option of generating a validation dataset yourself, or you can opt instead to allow us to divide your training file into separate train and evaluation datasets on our end.
Create a Dataset with the Python SDK
If you intend to fine-tune through our UI you can skip to the next chapter. Otherwise continue reading to learn how to create datasets for fine-tuning via our Python SDK. Before you start, we recommend that you read about the dataset API. Below you will find some code samples on how create datasets via the SDK: