Improving the Classify Fine-tuning Results

There are several things you need to take into account to achieve the best fine-tuned model for Classification, all of which are based on giving the model higher-quality data.

Refining data quality

Text cleaning: Improving the quality of the data is often the best investment you can make when solving a problem with machine learning. If the text contains symbols, URLs, or HTML code which are not needed for a specific task, for example, make sure to remove them from the trained file (and from the text you later send to the trained model).
Number of examples: The minimum number of labeled examples is 40. Be aware, however, that the more examples you can include, the better!
Number of examples per class: In addition to the overall number of examples, it’s important to have many examples of each class in the dataset. You should include at least five examples per label.
Mix of examples in dataset: We recommend that you have a balanced (roughly equal) number of examples per class in your dataset. Imbalanced data is a well-known problem in machine learning, and can lead to sub-optimal results.
Length of texts: The context size for text is currently 512 tokens. Any tokens beyond this limit are truncated.
Deduplication: Ensure that each labeled example in your dataset is unique.
High quality test set: In the data upload step, upload a separate test set of examples that you want to see the model benchmarked on. These can be examples that were manually written or verified.

Troubleshooting

We have a dedicated guide for troubleshooting fine-tuned models which is consistent for all the different model types and endpoints. Check it out here.