Dataset
The Dataset API makes it easy to upload, download, and manage the large datasets that you need for fine-tuning models. You can find more information in the official Dataset API specs.
💡 The examples in this guide refer to the Cohere Python SDK, which comes with some convenience methods, but most of the functionality can be accessed in any language using the HTTP API.Overview
In this initial section, we'll cover some limitations in the Dataset API, and discuss Cohere's policy on data retention.
File Size Limits
There are certain limits to the files you can upload through the Dataset API, specifically:
- A Dataset can be as large as 1.5GB
- Organizations have up to 10GB of storage across all their users
Retention
You should also be aware of how Cohere handles data retention. This is the most important context:
- Datasets get deleted 30 days after creation
- You can also manually delete a dataset using
co.delete_dataset
Dataset Creation
Datasets are created by uploading files, specifying both a name
for the dataset and the dataset_type
.
The file extension and file contents have to match the requirements for the selected dataset_type
. See the table below to learn more about the supported dataset types.
The dataset name
is useful when browsing the datasets you've uploaded. In addition to its name, each dataset will also be assigned a unique id
when it's created.
Here is an example code snippet illustrating the process of creating a dataset, with both the name
and the dataset_type
specified.
my_dataset = co.create_dataset(
name="shakespeare",
data=open("./shakespeare.csv", "rb"),
dataset_type="prompt-completion-finetune-input")
print(my_dataset.id)
Dataset Validation
Whenever a dataset is created, the data is validated asynchronously against the rules for the specified dataset_type
. This validation is kicked off automatically on the backend, and must be completed before a dataset can be used with other endpoints.
Here's a code snippet showing how to check the validation status of a dataset you've created.
my_dataset.await_validation() # this will error if validation fails
print(my_dataset.validation_status)
Using Datasets for Fine-tuning
Once the dataset passes validation, it can be used to fine-tune a model. To do this properly, you must include at least five train examples per label.
In the example below, we will create a new dataset and upload an evaluation set using the optional eval_data
parameter. We will then kick off a fine-tuning job using co.create_custom_model
.
# create a dataset
my_dataset = co.create_dataset(
name="shakespeare",
dataset_type="prompt-completion-finetune-input",
data=open("./shakespeare.csv", "rb"),
eval_data=open("./shakespeare-eval.csv", "rb")
).await_validation()
# start training a custom model using the dataset
co.create_custom_model(
name="shakespearean-model",
model_type="GENERATIVE",
dataset=my_dataset)
Dataset Types
When a dataset is created, the dataset_type
field must be specified in order to indicate the type of tasks this dataset is meant for.
Datasets of type prompt-completion-finetune-input
, for example, are expected to have entries with the fields prompt
and completion
, like so:
{
"prompt": "Say the word fish"
"completion": "fish"
},
{
"prompt": "Count to three"
"completion": "1, 2, 3"
}
...
The following table describes the types of datasets supported by the Dataset API:
Dataset Type | Description | Schema | Rules | Task Type | Status | File Types Supported | Are Metadata Fields Supported? | Sample File |
---|---|---|---|---|---|---|---|---|
prompt-completion-finetune-input | Command-style data with a prompt and completion. | prompt:string completition:string | You must include at least 32 unique, utf-8 encoded training examples. Only 16 are required if you've uploaded evaluation data. | Generative Fine-tuning | Supported | csv and jsonl | No | math_solver_total.json |
single-label-classification-finetune-input | A file containing text and a single label (class) for each text | text:string label:string | You must include 40 valid train examples, with five examples per label. A label cannot be present in all examples There must be 24 valid evaluation examples. | Classification Fine-tuning | Supported | csv and jsonl | No | Art classification file |
multi-label-classification-finetune-input | A file containing text and an array of label(s) (class) for each text | text:string label:list[string] | You must include 40 valid train examples, with five examples per label A label cannot be present in all examples. There must be 24 valid evaluation examples. | Classification Fine-tuning | Supported | csv and jsonl | No | n/a |
reranker-finetune-input | A file containing queries and an array of passages relevant to the query. There must also be "hard negatives", passages semantically similar but ultimately not relevant. | query:string relevant_passages:list[string] hard_negatives:list[string] | There must be 256 train examples and at least 64 evaluation examples. There must be at least one relevant passage, with no overlap between relevant passage and hard negatives. | Rerank Fine-tuning | Supported | jsonl | No | train_valid.json |
chat-finetune-input | A file containing conversations | messages: list[Message] - Message - role: text context: text | There must be two valid train examples and one valid evaluation example. | Chat Fine-tuning | In progress/not supported | jsonl | No | train_celestial_fox.json |
Downloading a dataset
Datasets can be fetched using its unique id
. Note that the dataset name
and id
are different from each other; names can be duplicated, while id
s cannot.
Here is an example code snippet showing how to fetch a dataset by its unique id
.
# fetch the dataset by ID
my_dataset = co.get_dataset("<DATASET_ID>")
# print each entry in the dataset
for record in my_dataset.open():
print(record)
# save the dataset as jsonl
my_dataset.save_jsonl('./path/to/new/file.jsonl')
# or save the dataset as csv
my_dataset.save_csv('./path/to/new/file.csv')
Deleting a dataset
Datasets are automatically deleted after 30 days, but they can also be deleted manually. Here's a code snippet showing how to do that:
co.delete_dataset("<DATASET_ID>")
Updated 6 days ago