There are several things you need to take into account to achieve the best fine-tuned model for Classification, all of which are based on giving the model higher-quality data.
- Text cleaning: Improving the quality of the data is often the best investment you can make when solving a problem with machine learning. If the text contains symbols, URLs, or HTML code which are not needed for a specific task, for example, make sure to remove them from the trained file (and from the text you later send to the trained model).
- Number of examples: The minimum number of labeled examples is 40. Be aware, however, that the more examples you can include, the better!
- Number of examples per class: In addition to the overall number of examples, it's important to have many examples of each class in the dataset. You should include at least five examples per label.
- Mix of examples in dataset: We recommend that you have a balanced (roughly equal) number of examples per class in your dataset. Imbalanced data is a well-known problem in machine learning, and can lead to sub-optimal results.
- Length of texts: The context size for text is currently 512 tokens. Any tokens beyond this limit are truncated.
- Deduplication: Ensure that each labeled example in your dataset is unique.
- High quality test set: In the data upload step, upload a separate test set of examples that you want to see the model benchmarked on. These can be examples that were manually written or verified.
We have a dedicated guide for troubleshooting fine-tuned models which is consistent for all the different model types and endpoints. Check it out here.
Updated 13 days ago