๐Ÿš€ New multimodal model: Command A Vision! (Learn more) ๐Ÿš€

Text Classification Using Embeddings

This notebook shows how to build a classifier using Cohereโ€™s embeddings.

first we embed the text in the dataset, then we use that to train a classifier

The example classification task here will be sentiment analysis of film reviews. Weโ€™ll train a simple classifier to detect whether a film review is negative (class 0) or positive (class 1).

Weโ€™ll go through the following steps:

  1. Get the dataset
  2. Get the embeddings of the reviews (for both the training set and the test set).
  3. Train a classifier using the training set
  4. Evaluate the performance of the classifier on the testing set

If youโ€™re running an older version of the SDK youโ€™ll want to upgrade it, like this:

PYTHON
1#!pip install --upgrade cohere

1. Get the dataset

PYTHON
1import cohere
2from sklearn.model_selection import train_test_split
3
4import pandas as pd
5pd.set_option('display.max_colwidth', None)
6
7df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)
PYTHON
1df.head()
01
0

a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films

1
1

apparently reassembled from the cutting room floor of any given daytime soap

0
2

they presume their audience wo nโ€™t sit still for a sociology lesson , however entertainingly presented , so they trot out the conventional science fiction elements of bug eyed monsters and futuristic women in skimpy clothes

0
3

this is a visually stunning rumination on love , memory , history and the war between art and commerce

1
4

jonathan parker โ€˜s bartleby should have been the be all end all of the modern office anomie films

1

Weโ€™ll only use a subset of the training and testing datasets in this example. Weโ€™ll only use 500 examples since this is a toy example. Youโ€™ll want to increase the number to get better performance and evaluation.

The train_test_split method splits arrays or matrices into random train and test subsets.

PYTHON
1num_examples = 500
2df_sample = df.sample(num_examples)
3
4sentences_train, sentences_test, labels_train, labels_test = train_test_split(
5 list(df_sample[0]), list(df_sample[1]), test_size=0.25, random_state=0)
6
7
8sentences_train = sentences_train[:95]
9sentences_test = sentences_test[:95]
10
11labels_train = labels_train[:95]
12labels_test = labels_test[:95]

2. Set up the Cohere client and get the embeddings of the reviews

Weโ€™re now ready to retrieve the embeddings from the API. Youโ€™ll need your API key for this next cell. Sign up to Cohere and get one if you havenโ€™t yet.

PYTHON
1model_name = "embed-v4.0"
2api_key = ""
3
4input_type = "classification"
5
6co = cohere.Client(api_key)
PYTHON
1embeddings_train = co.embed(texts=sentences_train,
2 model=model_name,
3 input_type=input_type
4 ).embeddings
5
6embeddings_test = co.embed(texts=sentences_test,
7 model=model_name,
8 input_type=input_type
9 ).embeddings

Note that the ordering of the arguments is important. If you put input_type in before model_name, youโ€™ll get an error.

We now have two sets of embeddings, embeddings_train contains the embeddings of the training sentences while embeddings_test contains the embeddings of the testing sentences.

Curious what an embedding looks like? We can print it:

PYTHON
1print(f"Review text: {sentences_train[0]}")
2print(f"Embedding vector: {embeddings_train[0][:10]}")
Review text: the script was reportedly rewritten a dozen times either 11 times too many or else too few
Embedding vector: [1.1531117, -0.8543223, -1.2496399, -0.28317127, -0.75870246, 0.5373464, 0.63233083, 0.5766576, 1.8336298, 0.44203663]

3. Train a classifier using the training set

Now that we have the embedding, we can train our classifier. Weโ€™ll use an SVM from sklearn.

PYTHON
1from sklearn.svm import SVC
2from sklearn.pipeline import make_pipeline
3from sklearn.preprocessing import StandardScaler
4
5
6svm_classifier = make_pipeline(StandardScaler(), SVC(class_weight='balanced'))
7
8svm_classifier.fit(embeddings_train, labels_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('svc', SVC(class_weight='balanced'))])

4. Evaluate the performance of the classifier on the testing set

PYTHON
1score = svm_classifier.score(embeddings_test, labels_test)
2print(f"Validation accuracy on is {100*score}%!")
Validation accuracy on Large is 91.2%!

You may get a slightly different number when you run this code.

This was a small scale example, meant as a proof of concept and designed to illustrate how you can build a custom classifier quickly using a small amount of labelled data and Cohereโ€™s embeddings. Increase the number of training examples to achieve better performance on this task.