The Cohere Datasets API (and How to Use It)

The Cohere platform allows you to upload and manage datasets that can be used for Fine-tuning models or in batch embedding with Embedding Jobs. Datasets can be managed in the Dashboard or programmatically using the Datasets API.

File Size Limits

There are certain limits to the files you can upload, specifically:

A Dataset can be as large as 1.5GB
Organizations have up to 10GB of storage across all their users

Retention

You should also be aware of how Cohere handles data retention. This is the most important context:

Datasets get deleted 30 days after creation
You can also manually delete a dataset in the Dashboard UI or using the Datasets API

Managing Datasets using the Python SDK

Getting Set up

First, let’s install the SDK

$ pip install cohere

Import dependencies and set up the Cohere client.

PYTHON

1 import cohere
2 
3 co = cohere.Client(api_key="Your API key")

(All the rest of the examples on this page will be in Python, but you can find more detailed instructions for getting set up by checking out the Github repositories for Python, Typescript, and Go.)

Dataset Creation

Datasets are created by uploading files, specifying both a name for the dataset and the dataset type.

The file extension and file contents have to match the requirements for the selected dataset type. See the table below to learn more about the supported dataset types.

The dataset name is useful when browsing the datasets you’ve uploaded. In addition to its name, each dataset will also be assigned a unique id when it’s created.

Here is an example code snippet illustrating the process of creating a dataset, with both the name and the dataset type specified.

PYTHON

1 my_dataset = co.datasets.create(
2     name="shakespeare",
3     data=open("./shakespeare.jsonl", "rb"),
4     type="chat-finetune-input",
5 )
6 
7 print(my_dataset.id)

Dataset Validation

Whenever a dataset is created, the data is validated asynchronously against the rules for the specified dataset type . This validation is kicked off automatically on the backend, and must be completed before a dataset can be used with other endpoints.

Here’s a code snippet showing how to check the validation status of a dataset you’ve created.

PYTHON

1 ds = co.wait(my_dataset)
2 print(ds.dataset.validation_status)

To help you interpret the results, here’s a table specifying all the possible API error messages and what they mean:

Error Code	Endpoint	Error Explanation	Actions Required
400	Create	The name parameter must be set.	Set a name parameter.
400	Create	The type parameter must be set.	Set a type parameter.
400	Create	The dataset type is invalid.	Set the type parameter to a supported type.
400	Create	You have exceeded capacity.	Delete unused datasets to free up space.
400	Create	You have used an invalid csv delimiter.	The csv delimiter must be one character long.
400	Create	The name must be less than 50 characters long.	Shorten your dataset name.
400	Create	You used an invalid parameter for part: %v use file or an evaluation file.	The file parameters must be a named file or an evaluation file.
499	Create	The upload connection was closed.	Don’t cancel the upload request.
	Validation	The required field was not fund in the dataset (line: )	You are missing a required field, which must be supplied.
	Validation	Custom validation rules per type	There should be enough context in the validation error message to fix the dataset.
	Validation	csv files must have a header with the required fields: [, , …].	Fix your csv file to have a ‘headers’ row with the required field names.
404	Get	The dataset with id '' was not found.	Make sure you’re passing in the right id.

Dataset Metadata Preservation

The Dataset API will preserve metadata if specified at time of upload. During the create dataset step, you can specify either keep_fields or optional_fields which are a list of strings which correspond to the field of the metadata you’d like to preserve. keep_fields is more restrictive, where if the field is missing from an entry, the dataset will fail validation whereas optional_fields will skip empty fields and validation will still pass.

Sample Dataset Input Format

JSONL

{"wiki_id": 69407798, "url": "https://en.wikipedia.org/wiki?curid=69407798", "views": 5674.4492597435465, "langs": 38, "title": "Deaths in 2022", "text": "The following notable deaths occurred in 2022. Names are reported under the date of death, in alphabetical order. A typical entry reports information in the following sequence:", "paragraph_id": 0, "id": 0}
{"wiki_id": 3524766, "url": "https://en.wikipedia.org/wiki?curid=3524766", "views": 5409.5609619796405, "title": "YouTube", "text": "YouTube is a global online video sharing and social media platform headquartered in San Bruno, California. It was launched on February 14, 2005, by Steve Chen, Chad Hurley, and Jawed Karim. It is owned by Google, and is the second most visited website, after Google Search. YouTube has more than 2.5 billion monthly users who collectively watch more than one billion hours of videos each day. , videos were being uploaded at a rate of more than 500 hours of content per minute.", "paragraph_id": 0, "id": 1}

As seen in the above example, the following would be a valid create_dataset call since langs is in the first entry but not in the second entry. The fields wiki_id, url, views and title are present in both JSONs.

PYTHON

1 # Upload a dataset for embed jobs
2 ds = co.datasets.create(
3     name="sample_file",
4     # insert your file path here - you can upload it on the right - we accept .csv and jsonl files
5     data=open("embed_jobs_sample_data.jsonl", "rb"),
6     keep_fields=["wiki_id", "url", "views", "title"],
7     optional_fields=["langs"],
8     type="embed-input",
9 )
10 
11 # wait for the dataset to finish validation
12 print(co.wait(ds))

Using Datasets for Fine-tuning

Once the dataset passes validation, it can be used to fine-tune a model. To do this properly, you must include at least five train examples per label.

In the example below, we will create a new dataset and upload an evaluation set using the optional eval_data parameter. We will then kick off a fine-tuning job using co.finetuning.create_finetuned_model.

PYTHON

1 # create a dataset
2 my_dataset = co.datasets.create(
3     name="shakespeare",
4     type="chat-finetune-input",
5     data=open("./shakespeare.jsonl", "rb"),
6     eval_data=open("./shakespeare-eval.jsonl", "rb"),
7 )
8 
9 co.wait(my_dataset)
10 
11 # start training a custom model using the dataset
12 co.finetuning.create_finetuned_model(
13     request=FinetunedModel(
14         name="shakespearean-model",
15         settings=Settings(
16             base_model=BaseModel(
17                 base_type="BASE_TYPE_CHAT",
18             ),
19             dataset_id=my_dataset.id,
20         ),
21     )
22 )

Dataset Types

When a dataset is created, the dataset_type field must be specified in order to indicate the type of tasks this dataset is meant for.

Datasets of type chat-finetune-input, for example, are expected to have a json with the field messages containing a list of messages:

PYTHON

1 {
2     "messages": [
3         {
4             "role": "System",
5             "content": "You are a large language model trained by Cohere.",
6         },
7         {
8             "role": "User",
9             "content": "Hi! What were Time magazines top 10 cover stories in the last 10 years?",
10         },
11         {
12             "role": "Chatbot",
13             "content": "Time magazines top 10 cover stories in the last 10 years were:\\n\\n1. Volodymyr Zelenskyy\\n2. Elon Musk\\n3. Martin Luther King Jr.\\n4. How Earth Survived\\n5. Her Lasting Impact\\n6. Nothing to See Here\\n7. Meltdown\\n8. Deal With It\\n9. The Top of America\\n10. Bitter Pill",
14         },
15     ]
16 }

The following table describes the types of datasets supported by the Dataset API:

Supported Dataset Types

Dataset Type	Description	Schema	Rules	Task Type	Status	File Types Supported	Are Metadata Fields Supported?	Sample File
`single-label-classification-finetune-input`	A file containing text and a single label (class) for each text	`text:string label:string`	You must include 40 valid train examples, with five examples per label. A label cannot be present in all examples There must be 24 valid evaluation examples.	Classification Fine-tuning	Supported	`csv` and `jsonl`	No	Art classification file
`multi-label-classification-finetune-input`	A file containing text and an array of label(s) (class) for each text	`text:string label:list[string]`	You must include 40 valid train examples, with five examples per label. A label cannot be present in all examples. There must be 24 valid evaluation examples.	Classification Fine-tuning	Supported	`jsonl`	No	n/a
`reranker-finetune-input`	A file containing queries and an array of passages relevant to the query. There must also be “hard negatives”, passages semantically similar but ultimately not relevant.	`query:string relevant_passages:list[string] hard_negatives:list[string]`	There must be 256 train examples and at least 64 evaluation examples. There must be at least one relevant passage, with no overlap between relevant passage and hard negatives.	Rerank Fine-tuning	Supported	`jsonl`	No	train_valid.json
`chat-finetune-input`	A file containing conversations	`messages: list[{role: string, content: string}]`	There must be two valid train examples and one valid evaluation example.	Chat Fine-tuning	In progress/not supported	`jsonl`	No	train_celestial_fox.json
`embed-input`	A file containing text to be embedded	`text:string`	None of the rows in the file can be empty.	Embed job	Supported	`csv` and `jsonl`	Yes	embed_jobs_sample_data.jsonl / embed_jobs_sample_data.csv

Downloading a dataset

Datasets can be fetched using its unique id. Note that the dataset name and id are different from each other; names can be duplicated, while ids cannot.

Here is an example code snippet showing how to fetch a dataset by its unique id.

PYTHON

1 # fetch the dataset by ID
2 my_dataset_response = co.datasets.get(id="<DATASET_ID>")
3 my_dataset = my_dataset_response.dataset
4 
5 # print each entry in the dataset
6 for record in my_dataset:
7     print(record)
8 
9 # save the dataset as jsonl
10 co.utils.save_dataset(
11     dataset=my_dataset, filepath="./path/to/new/file.jsonl"
12 )
13 # or save the dataset as csv
14 co.utils.save_dataset(
15     dataset=my_dataset, filepath="./path/to/new/file.csv"
16 )

Deleting a dataset

Datasets are automatically deleted after 30 days, but they can also be deleted manually. Here’s a code snippet showing how to do that:

PYTHON

1 co.datasets.delete(id="<DATASET_ID>")