Preparing the Classify Fine-tuning data
Preparing the Classify Fine-tuning data
Preparing the Classify Fine-tuning data
In this section, we will walk through how you can prepare your data for fine-tuning models for Classification.
For classification fine-tunes we can choose between two types of datasets:
To be able to start a fine-tune you need at least 40 examples. Each label needs to have at least 5 examples and there should be at least 2 unique labels.
Single-label data consists of a text and a label. Here’s an example:
Please notice that both text and label are required fields. When it comes to single-label data, you have the option to save your information in either a .jsonl or .csv format.
Multi-label data differs from single-label data in the following ways:
jsonl formatTo achieve optimal results, we suggest cleaning your dataset before beginning the fine-tuning process. Here are some things you might want to fix:
If some of your examples don’t pass our validation checks, we’ll filter them out so that your fine-tuning job can start without interruption. As long as you have a sufficient number of valid training examples, you’re good to go.
Evaluation data is utilized to calculate metrics that depict the performance of your fine-tuned model. You have the option of generating a validation dataset yourself, or you can opt instead to allow us to divide your training file into separate train and evaluation datasets on our end.
If you intend to fine-tune through our UI you can skip to the next chapter. Otherwise continue reading to learn how to create datasets for fine-tuning via our Python SDK. Before you start, we recommend that you read about the dataset API. Below you will find some code samples on how create datasets via the SDK: