> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.cohere.com/llms.txt.
> For full documentation content, see https://docs.cohere.com/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.cohere.com/_mcp/server.

# Text Classification Using Embeddings

> This page discusses the creation of a text classification model using word vector embeddings.

This notebook shows how to build a classifier using Cohere's embeddings.

<img alt="first we embed the text in the dataset, then we use that to train a classifier" src="https://github.com/cohere-ai/cohere-developer-experience/raw/main/notebooks/images/simple-classifier-embeddings.png" />

The example classification task here will be sentiment analysis of film reviews. We'll train a simple classifier to detect whether a film review is negative (class 0) or positive (class 1).

We'll go through the following steps:

1. Get the dataset
2. Get the embeddings of the reviews (for both the training set and the test set).
3. Train a classifier using the training set
4. Evaluate the performance of the classifier on the testing set

If you're running an older version of the SDK you'll want to upgrade it, like this:

```python PYTHON
#!pip install --upgrade cohere
```

## 1. Get the dataset

```python PYTHON
import cohere
from sklearn.model_selection import train_test_split

import pandas as pd
pd.set_option('display.max_colwidth', None)

df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)
```

```python PYTHON
df.head()
```

<table border="1" class="dataframe fern-table">
  <thead>
    <tr>
      <th />

      <th>
        0
      </th>

      <th>
        1
      </th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <th>
        0
      </th>

      <td>
        a stirring , funny and finally transporting re imagining of beauty and
        the beast and 1930s horror films
      </td>

      <td>
        1
      </td>
    </tr>

    <tr>
      <th>
        1
      </th>

      <td>
        apparently reassembled from the cutting room floor of any given
        daytime soap
      </td>

      <td>
        0
      </td>
    </tr>

    <tr>
      <th>
        2
      </th>

      <td>
        they presume their audience wo n't sit still for a sociology lesson ,
        however entertainingly presented , so they trot out the conventional
        science fiction elements of bug eyed monsters and futuristic women in
        skimpy clothes
      </td>

      <td>
        0
      </td>
    </tr>

    <tr>
      <th>
        3
      </th>

      <td>
        this is a visually stunning rumination on love , memory , history and
        the war between art and commerce
      </td>

      <td>
        1
      </td>
    </tr>

    <tr>
      <th>
        4
      </th>

      <td>
        jonathan parker 's bartleby should have been the be all end all of the
        modern office anomie films
      </td>

      <td>
        1
      </td>
    </tr>
  </tbody>
</table>

We'll only use a subset of the training and testing datasets in this example. We'll only use 500 examples since this is a toy example. You'll want to increase the number to get better performance and evaluation.

The `train_test_split` method splits arrays or matrices into random train and test subsets.

```python PYTHON
num_examples = 500
df_sample = df.sample(num_examples)

sentences_train, sentences_test, labels_train, labels_test = train_test_split(
            list(df_sample[0]), list(df_sample[1]), test_size=0.25, random_state=0)


sentences_train = sentences_train[:95]
sentences_test = sentences_test[:95]

labels_train = labels_train[:95]
labels_test = labels_test[:95]
```

## 2. Set up the Cohere client and get the embeddings of the reviews

We're now ready to retrieve the embeddings from the API. You'll need your API key for this next cell. [Sign up to Cohere](https://dashboard.cohere.com/) and get one if you haven't yet.

```python PYTHON
model_name = "embed-v4.0"
api_key = ""

input_type = "classification"

co = cohere.Client(api_key)
```

```python PYTHON
embeddings_train = co.embed(texts=sentences_train,
                            model=model_name,
                            input_type=input_type
                            ).embeddings

embeddings_test = co.embed(texts=sentences_test,
                           model=model_name,
                           input_type=input_type
                            ).embeddings

```

Note that the ordering of the arguments is important. If you put `input_type` in before `model_name`, you'll get an error.

We now have two sets of embeddings, `embeddings_train` contains the embeddings of the training sentences while `embeddings_test` contains the embeddings of the testing sentences.

Curious what an embedding looks like? We can print it:

```python PYTHON
print(f"Review text: {sentences_train[0]}")
print(f"Embedding vector: {embeddings_train[0][:10]}")
```

```
Review text: the script was reportedly rewritten a dozen times either 11 times too many or else too few
Embedding vector: [1.1531117, -0.8543223, -1.2496399, -0.28317127, -0.75870246, 0.5373464, 0.63233083, 0.5766576, 1.8336298, 0.44203663]
```

## 3. Train a classifier using the training set

Now that we have the embedding, we can train our classifier. We'll use an SVM from sklearn.

```python PYTHON
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler


svm_classifier = make_pipeline(StandardScaler(), SVC(class_weight='balanced'))

svm_classifier.fit(embeddings_train, labels_train)

```

```
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('svc', SVC(class_weight='balanced'))])
```

## 4. Evaluate the performance of the classifier on the testing set

```python PYTHON
score = svm_classifier.score(embeddings_test, labels_test)
print(f"Validation accuracy on is {100*score}%!")
```

```
Validation accuracy on Large is 91.2%!
```

You may get a slightly different number when you run this code.

This was a small scale example, meant as a proof of concept and designed to illustrate how you can build a custom classifier quickly using a small amount of labelled data and Cohere's embeddings. Increase the number of training examples to achieve better performance on this task.