Models with vision capabilities can understand and interpret image data, map relationships between text and visual inputs, and handle many other tasks where a mix of images and text is involved.
Cohere has models capable of interacting with images, and they excel in enterprise use cases such as:
For more detailed breakdowns of these and other applications, check out our cookbooks.
These models are designed to work through an interface and API structure that looks almost exactly like all of our other Command models, making it easy to get started with our image-processing functionality. Take this image, for example, which contains a graph of earnings for various waste management companies:

We can have Command A Vision analyze this image for us with the following:
And you should get something like this:

The rest of this document fleshes out Cohere’s models work with image inputs, including information on limitations, token calculations, and more.
Cohere’s Vision capabilities are not currently offered on the North platform.
The Chat API allows users to control the level of image “detail” sent to the model, which can be one of “low”, “high”, or “auto” (the default).
Lower detail helps reduce the overall token count (and therefore price and latency), but may result in poorer performance. We recommend trying both levels of detail to identify whether the performance is sufficient at "low".
The detail property is specified for each image, here’s what that look like:
When detail is set to “low”:
When detail is set to “high”:
When detail is unspecified or is set to “auto”:
high detail will be used, otherwise detail will be set to low.Here’s an example calculation of how an image is processed into tokens:
Cohere supports images in two formats, base64 data URLs and HTTP image URLs.
A base64 data URL (e.g., "data:image/png;base64,...") has the advantage of being usable in deployments that don’t have access to the internet. Here’s what that looks like:
An HTTP image URL (e.g., “https://cohere.com/favicon-32x32.png”) is faster, but requires you to upload your image somewhere and is not available in outside platforms (Azure, Bedrock, etc.) HTTP image URLs make the API easy to try out, as data URLs are long and difficult to deal with. Moreover, including long data URLs in the request increases the request size and the corresponding network latency.
Here’s what that looks like:
For use cases like chatbots, where the images accumulate in the chat history, we recommend you use HTTP/HTTPs image URLs, since the request size will be smaller, and, with server-side caching, will result in faster response times.
The Cohere API has the following limitations with respect to image counts:
These are the supported file types:
.png).jpeg and .jpg).webp).gif)Performance may vary when processing images containing text in non-Latin scripts, like Japanese or Korean characters.
To enhance accuracy, consider enlarging small text in images while ensuring no crucial visual information is lost. If you’re expecting small text in images, set detail='high'.
A good rule of thumb is: ‘if you have trouble reading image in a text, then the model will too.‘
Image inputs don’t change rate limit considerations; for more detail, check out our dedicated rate limit documentation.
To understand how to calculate costs for a model, consult the breakdown above about how tokens are determined by the model, then consult our dedicated pricing page to figure out what your ultimate spend will be.
Please refer to our usage policy.
Prompting for text-generation and models that can work with images is very similar. If you’re having success with a prompt in one of Cohere’s standard language models, it should work for our image models as well.
If you’re working with images that are larger than the model can handle, consider resizing them yourself, as this will have positive impacts on latency, cost, and performance.
Many use cases (such as OCR) work best with Cohere’s structured output capabilities. To learn more about this, consult the structured output guide.
Here are some techniques for optimizing model outputs: