Batch Embedding Jobs with the Embed API

This Guide Uses the Embed Jobs API.

You can find the API reference for the api here

The Embed Jobs API is only compatible with our embed v3.0 models

In this guide, we show you how to use the embed jobs endpoint to asynchronously embed a large amount of texts. This guide uses a simple dataset of wikipedia pages and its associated metadata to illustrate the endpoint’s functionality. To see an end-to-end example of retrieval, check out this notebook.

How to use the Embed Jobs API

The Embed Jobs API was designed for users who want to leverage the power of retrieval over large corpuses of information. Encoding hundreds of thousands of documents (or chunks) via an API can be painful and slow, often resulting in millions of http-requests sent between your system and our servers. Because it validates, stages, and optimizes batching for the user, the Embed Jobs API is much better suited for encoding a large number (100K+) of documents. The Embed Jobs API also stores the results in a hosted Dataset so there is no need to store the result of your embeddings locally.

The Embed Jobs API works in conjunction with the Embed API; in production use-cases, Embed Jobs is used to stage large periodic updates to your corpus and Embed handles real-time queries and smaller real-time updates.

Constructing a Dataset for Embed Jobs

To create a dataset for Embed Jobs, you will need to specify the embedding_types, and you need to set dataset_type as embed-input. The schema of the file looks like: text:string.

The Embed Jobs and Dataset APIs respect metadata through two fields: keep_fields, optional_fields. During the create dataset step, you can specify either keep_fields or optional_fields, which are a list of strings corresponding to the field of the metadata you’d like to preserve. keep_fields is more restrictive, since validation will fail if the field is missing from an entry. However, optional_fields, will skip empty fields and allow validation to pass.

Sample Dataset Input Format

JSONL

{
"wiki_id": 69407798,
"url": "https://en.wikipedia.org/wiki?curid=69407798",
"views": 5674.4492597435465,
"langs": 38,
"title":"Deaths in 2022",
"text": "The following notable deaths occurred in 2022. Names are reported under the date of death, in alphabetical order. A typical entry reports information in the following sequence:",
"paragraph_id": 0,
"id": 0
}
{
"wiki_id": 3524766,
"url": "https://en.wikipedia.org/wiki?curid=3524766",
"views": 5409.5609619796405,
"title": "YouTube",
"text": "YouTube is a global online video sharing and social media platform headquartered in San Bruno, California. It was launched on February 14, 2005, by Steve Chen, Chad Hurley, and Jawed Karim. It is owned by Google, and is the second most visited website, after Google Search. YouTube has more than 2.5 billion monthly users who collectively watch more than one billion hours of videos each day. , videos were being uploaded at a rate of more than 500 hours of content per minute.",
"paragraph_id": 0,
"id": 1
}

As seen in the example above, the following would be a valid create_dataset call since langs is in the first entry but not in the second entry. The fields wiki_id, url, views and title are present in both JSONs.

PYTHON

1 # Upload a dataset for embed jobs
2 ds = co.datasets.create(
3     name="sample_file",
4     # insert your file path here - you can upload it on the right - we accept .csv and jsonl files
5     data=open("embed_jobs_sample_data.jsonl", "rb"),
6     keep_fields=["wiki_id", "url", "views", "title"],
7     optional_fields=["langs"],
8     dataset_type="embed-input",
9     embedding_types=["float"],
10 )
11 
12 # wait for the dataset to finish validation
13 print(co.wait(ds))

Currently the dataset endpoint will accept .csv and .jsonl files - in both cases, it is imperative to have either a field called text or a header called text. You can see an example of a valid jsonl file here and a valid csv file here.

1. Upload your Dataset

The Embed Jobs API takes in dataset IDs as an input. Uploading a local file to the Datasets API with dataset_type="embed-input" will validate the data for embedding. The input file types we currently support are .csv and .jsonl. Here’s a code snippet of what this looks like:

PYTHON

1 import cohere
2 
3 co = cohere.Client(api_key="<YOUR API KEY>")
4 
5 input_dataset = co.datasets.create(
6     name="your_file_name",
7     data=open("/content/your_file_path", "rb"),
8     dataset_type="embed-input",
9 )
10 
11 # block on server-side validation
12 print(co.wait(input_dataset))

Upon uploading the dataset you will get a response like this:

Text

uploading file, starting validation...

Once the dataset has been uploaded and validated you will get a response like this:

TEXT

sample-file-m613zv was uploaded

If your dataset hits a validation error, please refer to the dataset validation errors section on the datasets page to debug the issue.

2. Kick off the Embed Job

Your dataset is now ready to be embedded. Here’s a code snippet illustrating what that looks like:

PYTHON

1 embed_job = co.embed_jobs.create(
2     dataset_id=input_dataset.id,
3     input_type="search_document",
4     model="embed-english-v3.0",
5     embedding_types=["float"],
6     truncate="END",
7 )
8 
9 # block until the job is complete
10 co.wait(embed_job)

Since we’d like to search over these embeddings and we can think of them as constituting our knowledge base, we set input_type='search_document'.

3. Save down the Results of your Embed Job or View the Results of your Embed Job

The output of embed jobs is a dataset object which you can download or pipe directly to a database of your choice:

PYTHON

1 output_dataset = co.datasets.get(id=embed_job.output.id)
2 co.utils.save(filepath="/content/embed_job_output.csv", format="csv")

Alternatively if you would like to pass the dataset into a downstream function you can do the following:

PYTHON

1 output_dataset = co.datasets.get(id=embed_job.output.id)
2 results = []
3 for record in output_dataset:
4     results.append(record)

Sample Output

The Embed Jobs API will respect the original order of your dataset and the output of the data will follow the text: string, embedding: list of floats schema, and the length of the embedding list will depend on the model you’ve chosen (i.e. embed-v4.0 will be one of 256, 512, 1024, 1536 (default), depending on what you’ve selected, whereas embed-english-light-v3.0 will be 384 dimensions).

Below is a sample of what the output would look like if you downloaded the dataset as a jsonl.

JSON

1 {
2   "text": "The following notable deaths occurred in 2022. Names are reported under the date of death, in alphabetical order......",
3   "embeddings": {
4     "float":[0.006572723388671875, 0.0090484619140625, -0.02142333984375,....],
5     "int8":null,
6     "uint8":null,
7     "binary":null,
8     "ubinary":null
9   }
10 }

If you have specified any metadata to be kept either as optional_fields or keep_fields when uploading a dataset, the output of embed jobs will look like this:

JSON

1 {
2   "text": "The following notable deaths occurred in 2022. Names are reported under the date of death, in alphabetical order......",
3   "embeddings": {
4     "float":[0.006572723388671875, 0.0090484619140625, -0.02142333984375,....],
5     "int8":null,
6     "uint8":null,
7     "binary":null,
8     "ubinary":null
9   }
10 	"field_one": "some_meta_data",
11 	"field_two": "some_meta_data",
12 }

Next Steps

Check out our end to end notebook on retrieval with Pinecone’s serverless offering.