Single Container on Private Clouds

This document walks through how to pull Cohere’s container images using a license, and provides steps for testing both Docker and Kubernetes images.

Before starting, ensure you have a license and image tag provided by Cohere.

Pull Container Images with A License

Cohere provides access to container images through a registry authenticated with a license. Users can pull these images and replicate them in their environment, as needed, to avoid runtime network access from inside the cluster.

Images will come through the proxy.replicated.com registry. Pulling the images will require firewall access open to proxy.replicated.com and proxy-auth.replicated.com. More information on these endpoints may be found here.

To test pulling images with a license, modify your docker CLI configuration to include authentication details for the registry. Note: docker login will not work.

The docker CLI is only an example; any tool which can pull images with credentials will work with the license ID configured as both username and password. Skopeo is another popular tool for copying images between registries which will work with this flow.

The following commands will overwrite your existing docker CLI configuration with authentication details for Cohere’s registry. If preferred, you can manually add the authentication details to preserve your existing configuration.

LICENSE_ID="<YOUR LICENSE ID>"
cat <<EOF > ~/.docker/config.json 
{
    "auths": {
        "proxy.replicated.com": {
            "auth": "$(echo -n "${LICENSE_ID}:${LICENSE_ID}" | base64 | tr -d '\n')"
        }
    }
}
EOF

Validate that the authenticated image pull works correctly using the docker CLI:

CUSTOMER_TAG=image_tag_from_cohere # provided by Cohere
docker pull $CUSTOMER_TAG

After pulling the image with your license, you can re-tag and replicate it within the scope of your license for air-gapped workflows.

Validate Workload Infrastructure

Once you can pull the image from the registry, run a test workload to validate the container’s functionality.

Docker/Containerd

To test the container image with Docker, you should have a machine with the following installed:

Nvidia drivers installed on host (minimum 525).
nvidia-container-toolkit and corresponding configuration for docker/containerd.

Example Usage

Different models have different inputs.

Embed models expect an array of texts and return the embeddings as output.
Rerank models expect a list of documents and a query, returning relevance scores for the top n results (the n parameter is configurable).
Command models expect a prompt and return the model response.

This section provides simple examples of using each primary Cohere model in a Docker container. Note that if you try these out and get an error like curl: (7) Failed to connect to localhost port 8080: Connection refused, the container has not yet fully started up. Wait a few more seconds and then try again.

Bash Commands for Running Cohere Models Through Docker

Here are the bash commands you can run to use the Embed v4, Embed Multilingual, Rerank English, Rerank Multilingual, and Command models through Docker.

docker run -d --rm --name embed-v4 --gpus=1 --net=host $IMAGE_TAG
# wait 5-10 seconds for the container to start
# you can use `curl http://localhost:8080/ping` to check for readiness
curl --header "Content-Type: application/json" --request POST http://localhost:8080/embed --data-raw '{"input_type": "search_query", "texts":["Why are embeddings good"], "embedding_types": ["float"]}'
{"id":"6d54d453-f2c8-44da-aab8-39e3c11d29d5","texts":["Why are embeddings good"],"embeddings":{"float":[[0.033935547,0.06347656,0.020263672,-0.020507812,0.014160156,0.0038757324,-0.07421875,-0.05859375,...
docker stop embed-v4

You’ll note that final example includes documents that the Command model can use to ground its replies. This functionality falls under retrieval augmented generation.

Kubernetes

Deploying to Kubernetes requires nodes with the following installed:

Nvidia drivers - latest tested version is currently 545.
nvidia-container-toolkit and corresponding configuration for docker/containerd.
nvidia-device-plugin to make GPUs available to Kubernetes.

To deploy the same image on Kubernetes, we must first convert the docker configuration into an image pull secret (see the Kubernetes documentation for more detail).

YAML

1 kubectl create secret generic cohere-pull-secret \
2     --from-file=.dockerconfigjson="~/.docker/config.json" \
3     --type=kubernetes.io/dockerconfigjson

With that done, fill in the environment variables and generate the application manifest:

APP=cohere # or any other name you want to use
IMAGE= <IMAGE_TAG_FROM_COHERE> # replace with the image cohere provided
GPUS= <Number of GPUs for the target model> 
cat <<EOF > cohere.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: ${APP}
  name: ${APP}
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ${APP}
  strategy: {}
  template:
    metadata:
      labels:
        app: ${APP}
    spec:
      imagePullSecrets:
        - name: cohere-pull-secret
      containers:
      - image: ${IMAGE}
        name: ${APP}
        resources:
          limits:
            nvidia.com/gpu: ${GPUS}
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: ${APP}
  name: ${APP}
spec:
  ports:
  - name: http
    port: 8080
    protocol: TCP
    targetPort: 8080
  selector:
    app: ${APP}
  type: ClusterIP
---
EOF

The manifest above does not account for air-gapped environments

Change this to the registry where you replicated the image previously pulled for an air-gapped deployment. Alternatively, to test in an internet-connected environment, create an image pull secret using the license ID as username/password as in the earlier step for the docker CLI for testing. Keep in mind you will need the firewall rules open mentioned in the image pull steps

Use the following to deploy the containers and run inference requests:

kubectl apply -f cohere.yaml

Be aware that this is a multi-gigabyte image, so it may take some time to download.

Once the pod is up and running, you should expect to see something like the following:

# once the pod is running
kubectl port-forward svc/${APP} 8080:8080
# Forwarding from 127.0.0.1:8080 -> 8080
# Forwarding from [::1]:8080 -> 8080
# Handling connection for 8080

Leave that running in the background, and up a new terminal session to execute a test request. In the next few sections, we’ll include examples of appropriate requests for the major Cohere models.

Example Usage

Here are the bash commands you can run to use the Embed v4, Embed Multilingual, Rerank English, Rerank Multilingual, and Command models through Kubernetes.

curl --header "Content-Type: application/json" --request POST http://localhost:8080/embed --data-raw '{"texts": ["testing embeddings in english"], "input_type": "classification"}'
# {"id":"2ffe4bca-8664-4456-b858-1b3b15411f2c","embeddings":[[-0.5019531,-2.0917969,-1.6220703,-1.2919922,-0.80029297,1.3173828,1.4677734,-1.7763672,0.03869629,1.9033203...}

Remember that this is only an illustrative deployment. Feel free to modify it as needed to accommodate your environment.

A Note on Air-gapped Environments

All images in the proxy.replicated.com registry are available to pull and copy into an air-gapped environment. These can be pulled using the license ID and steps previously provided by Cohere.