Training a Representation Model
In this article, we look at training a representation model, which covers both the Embed and Classify endpoints.
See here if you'd like to get an overview of training a generation model.
A Text Classification Example
Text classification is one of the most common language understanding tasks. A lot of business use cases can be mapped to text classification. Examples include:
- Evaluating the tone and sentiment of an incoming customer message (e.g. classes: “positive” and “negative”)
- Routing incoming customer messages to the appropriate agent (e.g. classes: “billing”, “tech support”, “other”)
- Evaluating if a user comment needs to be flagged for moderator attention (e.g. classes: “flag for moderation”, “neutral”)
In this article, we'll train a representation model for sentiment classification.
Why Train a Representation Model
Training leads to the best classification results a language model can achieve. That said, untrained baseline embeddings can perform well in a lot of tasks (See the text classification article for an example of how to train a sentiment classifier on top of baseline embedding models). But if we need to get that extra boost in performance, training makes our LLM become a specialist for the task we care about.
How to Train a Representation Model
The training file is a Comma Separated Values (CSV) file with a column for text and another for the number of the class. The contents of that file can look like this:
The CSV file can be prepared in Excel or in text format like this with a
My order was late, 0 Shipping was fast!, 1 Order arrived on time, 1 Items are always sold out, 1
That CSV file is then what you upload in the Representation training dialog box in the Playground.
New to Cohere?
Get started now and get unprecedented access to world-class Generation and Representation models with billions of parameters.
What Training a Representation Model Does
A representation LLM is excellent at generating sentence embeddings (lists of numbers that capture the meaning of the sentences). These embeddings are great at indicating how similar sentences are to each other. We can plot them to explore their similarities and differences (points that are close together have similar embeddings).
Consider a case where we have five customer messages. Visualizing their embeddings can look like this:
Such an embedding captures semantic similarity – so for example, messages about shipping are close to each other on the left.
If we want to build the best sentiment classifier, however, then we need our embedding model to care about sentiment more than it cares about semantic similarity.
If we colour the points depending on their sentiment, it could look like this:
Successfully training a representation model on customer sentiment leads to a model which embeds sentences in this fashion:
Training an embedding model on customer sentiment leads to an embedding model where the embeddings of positive comments are similar to each other and distinct from those of negative comments. This leads to better sentiment classification results.
Tips to improve embedding/training quality
There are several things to take into account to achieve the best trained embeddings:
- Text cleaning: Improving the quality of the data is often the best investment in problem solving with machine learning. If the text, for example contains symbols or URLs or HTML code which are not needed for a specific task, make sure to remove them from the trained file (and from the text you later send to the trained model).
- Number of examples: The minimum number of labeled examples is 250, though we advise having at least 500 to achieve good training results. The more examples the better.
- Number of examples per class: In addition to the overall number of examples, it's important to have many examples of each class in the dataset.
- Mix of examples in dataset: We recommend that you have a balanced (roughly equal) number of examples per class in your dataset.
- Length of texts: The context size for text is currently 512 tokens. Subsequent tokens are truncated.
- Deduplication: Ensure that each labelled example in your dataset is unique.
- High quality test set: In the data upload step, upload a separate test set of examples that you want to see the model benchmarked on. These can be examples that were manually written or verified.
Training a Representation Model: Step-by-step
Training a representation model consists of a few simple steps. Let’s go through the steps for training a representation model.
On the Cohere dashboard, go to the models page and click on "Create a custom model"
Choose the Embed or Classify Option
Click on the tile that says "Classify" or "Embed".
Both classify and embed endpoints will custom train a representation model
Upload Your Data
Upload your training dataset data by going to ‘Training data’ and clicking on the upload file button. Your data should be in CSV format with exactly two columns—the first and second columns consisting of the examples and labels respectively.
Optionally, you can upload a validation dataset. This will not be used during training but instead, will be used for evaluating the model’s performance post-training. To do so, go to ‘Upload validation set (optional)’ and repeat the same steps you just did with the training dataset. If you don’t upload a validation dataset, the platform will automatically set aside a validation dataset from the training dataset.
At this point in time, if there are labels with less than 5 unique examples, we will remove those labels from your training set.
As shown above, the label 'Area' had fewer than 5 examples so it has been removed from the training set.
Once done, click on ‘Next’.
Preview Your Data
The preview window will show a few samples of your training dataset, and if you uploaded it, your validation dataset.
Toggle between the tabs 'Training' and 'Validation' to see a sample of your respective datasets.
At the bottom of this page, the distribution of labels in each respective dataset is shown.
If you are happy with how the samples look, click on 'Continue'.
Now, everything is set for training to begin. Click on 'Start training' to proceed.
We can’t wait to see what you start building! Share your projects or find support on our Discord.
Updated about 1 month ago