Content Moderation
This Guide Uses the Classify Endpoint.
You can find more information about the endpoint here.
The Classify endpoint streamlines the task of running a text classification task. You can deploy different kinds of content moderation use cases according to your needs, all through a single endpoint.
As online communities continue to grow, content moderators need a way to moderate user-generated content at scale. To appreciate the wide-ranging need for content moderation, we can refer to the paper A Unified Typology of Harmful Content by Banko et al. [Source]. It provides a unified typology of harmful content generated within online communities and a comprehensive list of examples, which can be grouped into four types:
- Hate and Harassment
- Self-Inflicted Harm
- Ideological Harm
- Exploitation
There are publicly available datasets within the content moderation space which you can experiment with, for example:
- Social Media Toxicity dataset from Surge AI
- Wikipedia Comments dataset by Jigsaw/Conversation AI
- Civil Comments dataset by Jigsaw/Conversation AI
- Hate Speech Dataset by Derczynski et al.
A Quick Walkthrough
Here we take a quick look at performing a toxicity detection using the Classify endpoint of the Cohere API. In this example, our task is to classify a list of example social media comments as either toxic or benign.
LLMs work by conditioning on some examples of what we want its outputs to look like. In our case, we’ll provide a few examples of labeled data, wherein each data point contains the comment's text and the associated toxicity label. Then we feed the model with the inputs we want to classify and the model will return the predicted class it belongs to.
We’ll use the Cohere Playground, which is an interface that helps you quickly prototype and experiment with LLMs.
First, we choose the model we want to use and enter the labeled examples. The model will work fine with as few as 5 examples per class, but in general, the more data, the better. In this example, we’ll provide 5 examples for each class: toxic and benign.
Here’s a better look at all ten examples:
Number | Text | Label |
---|---|---|
1 | yo how are you | benign |
2 | PUDGE MID! | benign |
3 | I WILL REMEMBER THIS FOREVER | benign |
4 | I think I saw it first | benign |
5 | bring me a potion | benign |
6 | I will honestly kill you | toxic |
7 | get rekt moron | toxic |
8 | go to hell | toxic |
9 | f a g o t | toxic |
10 | you are hot trash | toxic |
Next we enter the list of inputs we want to classify and run the classification. Here we have 5 inputs.
Here’s a better look at all five inputs and outcomes:
Number | Text | Label (Actual) | Label (Predicted) |
---|---|---|---|
1 | this game sucks, you suck | toxic | toxic |
2 | put your neck in a noose | toxic | toxic |
3 | buy the black potion | benign | benign |
4 | top mia | benign | benign |
5 | good work team | benign | benign |
In this small example, the model got all classifications correct. We can then generate the equivalent code to access the Classify endpoint by exporting the code from the Playground.
The following is the corresponding code snippet for the API call. From here, we can further build the content moderation solution according to the scale and integration needs.
import cohere
from cohere.responses.classify import Example
co = cohere.Client('{apiKey}')
response = co.classify(
inputs=["this game sucks, you suck", "put your neck in a noose", "buy the black potion", "top mia", "good work team"],
examples=[Example("yo how are you", "benign"), Example("PUDGE MID!", "benign"), Example("I WILL REMEMBER THIS FOREVER", "benign"), Example("I think I saw it first", "benign"), Example("bring me a potion", "benign"), Example("I will honestly kill you", "toxic"), Example("get rekt moron", "toxic"), Example("go to hell", "toxic"), Example("f*a*g*o*t", "toxic"), Example("you are hot trash", "toxic")])
print('The confidence levels of the labels are: {}'.format(response.classifications))
const cohere = require('cohere-ai');
cohere.init('{apiKey}');
(async () => {
const response = await cohere.classify({
model: 'small',
inputs: ["this game sucks, you suck", "you f*g*t", "put your neck in a noose", "buy the black potion", "top mia", "good work team"],
examples: [{"text": "yo how are you", "label": "benign"}, {"text": "PUDGE MID!", "label": "benign"}, {"text": "I WILL REMEMBER THIS FOREVER", "label": "benign"}, {"text": "I think I saw it first", "label": "benign"}, {"text": "bring me a potion", "label": "benign"}, {"text": "I will honestly kill you", "label": "toxic"}, {"text": "get rekt moron", "label": "toxic"}, {"text": "go to hell", "label": "toxic"}, {"text": "f*a*g*o*t", "label": "toxic"}, {"text": "you are hot trash", "label": "toxic"}]
});
console.log(`The confidence levels of the labels are ${JSON.stringify(response.body.classifications)}`);
})();
package main
import (
"fmt"
cohere "github.com/cohere-ai/cohere-go"
)
func main() {
co, err := cohere.CreateClient("{apiKey}")
if err != nil {
fmt.Println(err)
return
}
response, err := co.Classify(cohere.ClassifyOptions{
Model: "small",
Inputs: []string{`this game sucks, you suck`, `you f*g*t`, `put your neck in a noose`, `buy the black potion`, `top mia`, `good work team`},
Examples: []cohere.Example{{Text: `yo how are you`, Label: `benign`}, {Text: `PUDGE MID!`, Label: `benign`}, {Text: `I WILL REMEMBER THIS FOREVER`, Label: `benign`}, {Text: `I think I saw it first`, Label: `benign`}, {Text: `bring me a potion`, Label: `benign`}, {Text: `I will honestly kill you`, Label: `toxic`}, {Text: `get rekt moron`, Label: `toxic`}, {Text: `go to hell`, Label: `toxic`}, {Text: `f*a*g*o*t`, Label: `toxic`}, {Text: `you are hot trash`, Label: `toxic`}},
})
if err != nil {
fmt.Println(err)
return
}
fmt.Println("The confidence levels of the labels are:", response.Classifications)
}
curl --location --request POST 'https://api.cohere.ai/classify' \
--header 'Authorization: BEARER {apiKey}' \
--header 'Content-Type: application/json' \
--data-raw '{
"model": "small",
"inputs": ["this game sucks, you suck", "you f*g*t", "put your neck in a noose", "buy the black potion", "top mia", "good work team"],
"examples": [{"text": "yo how are you", "label": "benign"}, {"text": "PUDGE MID!", "label": "benign"}, {"text": "I WILL REMEMBER THIS FOREVER", "label": "benign"}, {"text": "I think I saw it first", "label": "benign"}, {"text": "bring me a potion", "label": "benign"}, {"text": "I will honestly kill you", "label": "toxic"}, {"text": "get rekt moron", "label": "toxic"}, {"text": "go to hell", "label": "toxic"}, {"text": "f*a*g*o*t", "label": "toxic"}, {"text": "you are hot trash", "label": "toxic"}]
}'
Next Steps
To get the best classification performance, you will likely need to perform custom training, which is a method for customizing an LLM model with your own dataset. This is especially true for a content moderation task, where no two communities are the same and where the nature of the content is always evolving. The model will need to capture the nuances of the content within a given community at a given time, and custom model training is a way to do that.
The Cohere platform lets you train a model using a dataset you provide. Refer to this article for a step-by-step guide.
In summary, Cohere’s LLM API empowers developers to build content moderation systems at scale without having to worry about building and deploying machine learning models in-house. In particular, teams can perform text classification tasks via the Classify endpoint. Try it now!
Updated about 1 month ago