For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DASHBOARDPLAYGROUNDDOCSCOMMUNITYLOG IN
Guides and conceptsAPI ReferenceRelease NotesLLMUCookbooks
Guides and conceptsAPI ReferenceRelease NotesLLMUCookbooks
  • Cohere API
    • About
    • Teams and Roles
    • Errors
    • Migrating From API v1 to API v2
    • Using the OpenAI SDK
  • Endpoints
      • POSTChat
      • STREAMChat with Streaming
  • Deprecated
LogoLogodocs
DASHBOARDPLAYGROUNDDOCSCOMMUNITYLOG IN
Endpointsv2/chat

Chat

POST
https://api.cohere.com/v2/chat
POST
/v2/chat
1import cohere
2
3co = cohere.ClientV2()
4
5response = co.chat(
6 model="command-a-plus-05-2026",
7 messages=[{"role": "user", "content": "Tell me about LLMs"}],
8)
9
10print(response)
1{
2 "id": "c14c80c3-18eb-4519-9460-6c92edd8cfb4",
3 "finish_reason": "COMPLETE",
4 "message": {
5 "role": "assistant",
6 "content": [
7 {
8 "type": "text",
9 "text": "LLMs stand for Large Language Models, which are a type of neural network model specialized in processing and generating human language. They are designed to understand and respond to natural language input and have become increasingly popular and valuable in recent years.\n\nLLMs are trained on vast amounts of text data, enabling them to learn patterns, grammar, and semantic meanings present in the language. These models can then be used for various natural language processing tasks, such as text generation, summarization, question answering, machine translation, sentiment analysis, and even some aspects of natural language understanding.\n\nSome well-known examples of LLMs include:\n\n1. GPT-3 (Generative Pre-trained Transformer 3) — An open-source LLM developed by OpenAI, capable of generating human-like text and performing various language tasks.\n\n2. BERT (Bidirectional Encoder Representations from Transformers) — A Google-developed LLM that is particularly good at understanding contextual relationships in text, and is widely used for natural language understanding tasks like sentiment analysis and named entity recognition.\n\n3. T5 (Text-to-Text Transfer Transformer) — Also from Google, T5 is a flexible LLM that frames all language tasks as text-to-text problems, where the model learns to generate output text based on input text prompts.\n\n4. RoBERTa (Robustly Optimized BERT Approach) — A variant of BERT that uses additional training techniques to improve performance.\n\n5. DeBERTa (Decoding-enhanced BERT with disentangled attention) — Another variant of BERT that introduces a new attention mechanism.\n\nLLMs have become increasingly powerful and larger in scale, improving the accuracy and sophistication of language tasks. They are also being used as a foundation for developing various applications, including chatbots, content recommendation systems, language translation services, and more.\nThe future of LLMs holds the potential for even more sophisticated language technologies, with ongoing research and development focused on enhancing their capabilities, improving efficiency, and exploring their applications in various domains."
10 }
11 ]
12 },
13 "usage": {
14 "billed_units": {
15 "input_tokens": 5,
16 "output_tokens": 418
17 },
18 "tokens": {
19 "input_tokens": 71,
20 "output_tokens": 418
21 }
22 }
23}
Generates a text response to a user message and streams it down, token by token. To learn how to use the Chat API with streaming follow our [Text Generation guides](https://docs.cohere.com/v2/docs/chat-api). Follow the [Migration Guide](https://docs.cohere.com/v2/docs/migrating-v1-to-v2) for instructions on moving from API v1 to API v2.
Was this page helpful?
Previous

Chat with Streaming

Next
Built with

Generates a text response to a user message and streams it down, token by token. To learn how to use the Chat API with streaming follow our Text Generation guides.

Follow the Migration Guide for instructions on moving from API v1 to API v2.

Authentication

AuthorizationBearer

Bearer authentication of the form Bearer <token>, where token is your auth token.

Headers

X-Client-NamestringOptional
The name of the project that is making the request.

Request

This endpoint expects an object.
streamfalseRequired

Defaults to false.

When true, the response will be a SSE stream of events.

Streaming is beneficial for user interfaces that render the contents of the response piece by piece, as it gets generated.

modelstringRequired
The name of a compatible [Cohere model](https://docs.cohere.com/v2/docs/models).
messageslist of anyRequired
A list of chat messages in chronological order, representing a conversation between the user and the model. Messages can be from `User`, `Assistant`, `Tool` and `System` roles. Learn more about messages and roles in [the Chat API guide](https://docs.cohere.com/v2/docs/chat-api).
toolslist of objectsOptional
A list of tools (functions) available to the model. The model response may contain 'tool_calls' to the specified tools. Learn more in the [Tool Use guide](https://docs.cohere.com/docs/tools).
documentslist of strings or objectsOptional
A list of relevant documents that the model can cite to generate a more accurate reply. Each document is either a string or document object with content and metadata.
citation_optionsobjectOptional
Options for controlling citation generation.
response_formatobjectOptional
Configuration for forcing the model output to adhere to the specified format. Supported on [Command R](https://docs.cohere.com/v2/docs/command-r), [Command R+](https://docs.cohere.com/v2/docs/command-r-plus) and newer models. The model can be forced into outputting JSON objects by setting `{ "type": "json_object" }`. A [JSON Schema](https://json-schema.org/) can optionally be provided, to ensure a specific structure. **Note**: When using `{ "type": "json_object" }` your `message` should always explicitly instruct the model to generate a JSON (eg: _"Generate a JSON ..."_) . Otherwise the model may end up getting stuck generating an infinite stream of characters and eventually run out of context length. **Note**: When `json_schema` is not specified, the generated object can have up to 5 layers of nesting. **Limitation**: The parameter is not supported when used in combinations with the `documents` or `tools` parameters.
safety_modeenumOptional
Used to select the [safety instruction](https://docs.cohere.com/v2/docs/safety-modes) inserted into the prompt. Defaults to `CONTEXTUAL`. When `OFF` is specified, the safety instruction will be omitted. Safety modes are not yet configurable in combination with `tools` and `documents` parameters. **Note**: This parameter is only compatible newer Cohere models, starting with [Command R 08-2024](https://docs.cohere.com/docs/command-r#august-2024-release) and [Command R+ 08-2024](https://docs.cohere.com/docs/command-r-plus#august-2024-release). **Note**: `command-r7b-12-2024` and newer models only support `"CONTEXTUAL"` and `"STRICT"` modes.
Allowed values:
max_tokensintegerOptional
The maximum number of output tokens the model will generate in the response. If not set, `max_tokens` defaults to the model's maximum output token limit. You can find the maximum output token limits for each model in the [model documentation](https://docs.cohere.com/docs/models). **Note**: Setting a low value may result in incomplete generations. In such cases, the `finish_reason` field in the response will be set to `"MAX_TOKENS"`. **Note**: If `max_tokens` is set higher than the model's maximum output token limit, the generation will be capped at that model-specific maximum limit.
stop_sequenceslist of stringsOptional
A list of up to 5 strings that the model will use to stop generation. If the model generates a string that matches any of the strings in the list, it will stop generating tokens and return the generated text up to that point not including the stop sequence.
temperaturedoubleOptional

Defaults to 0.3.

A non-negative float that tunes the degree of randomness in generation. Lower temperatures mean less random generations, and higher temperatures mean more random generations.

Randomness can be further maximized by increasing the value of the p parameter.

seedintegerOptional0-18446744073709552000
If specified, the backend will make a best effort to sample tokens deterministically, such that repeated requests with the same seed and parameters should return the same result. However, determinism cannot be totally guaranteed.
frequency_penaltydoubleOptional

Defaults to 0.0, min value of 0.0, max value of 1.0. Used to reduce repetitiveness of generated tokens. The higher the value, the stronger a penalty is applied to previously present tokens, proportional to how many times they have already appeared in the prompt or prior generation.

presence_penaltydoubleOptional

Defaults to 0.0, min value of 0.0, max value of 1.0. Used to reduce repetitiveness of generated tokens. Similar to frequency_penalty, except that this penalty is applied equally to all tokens that have already appeared, regardless of their exact frequencies.

kintegerOptional0-500Defaults to 0

Ensures that only the top k most likely tokens are considered for generation at each step. When k is set to 0, k-sampling is disabled. Defaults to 0, min value of 0, max value of 500.

pdoubleOptionalDefaults to 0.75

Ensures that only the most likely tokens, with total probability mass of p, are considered for generation at each step. If both k and p are enabled, p acts after k. Defaults to 0.75. min value of 0.01, max value of 0.99.

logprobsbooleanOptional

Defaults to false. When set to true, the log probabilities of the generated tokens will be included in the response.

tool_choiceenumOptional
Used to control whether or not the model will be forced to use a tool when answering. When `REQUIRED` is specified, the model will be forced to use at least one of the user-defined tools, and the `tools` parameter must be passed in the request. When `NONE` is specified, the model will be forced **not** to use one of the specified tools, and give a direct response. If tool_choice isn't specified, then the model is free to choose whether to use the specified tools or not. **Note**: This parameter is only compatible with models [Command-r7b](https://docs.cohere.com/v2/docs/command-r7b) and newer.
Allowed values:
thinkingobjectOptional
Configuration for [reasoning features](https://docs.cohere.com/docs/reasoning).
priorityintegerOptional0-999Defaults to 0

Controls how early the request is handled. Lower numbers indicate higher priority (default: 0, the highest). When the system is under load, higher-priority requests are processed first and are the least likely to be dropped.

strict_toolsbooleanOptionalBeta
When set to `true`, tool calls in the Assistant message will be forced to follow the tool definition strictly. Learn more in the [Structured Outputs (Tools) guide](https://docs.cohere.com/docs/structured-outputs-json#structured-outputs-tools). **Note**: The first few requests with a new set of tools will take longer to process.

Response

idstring
Unique identifier for the generated reply. Useful for submitting feedback.
finish_reasonenum

The reason a chat request has finished.

  • complete: The model finished sending a complete message.
  • max_tokens: The number of generated tokens exceeded the model’s context length or the value specified via the max_tokens parameter.
  • stop_sequence: One of the provided stop_sequence entries was reached in the model’s generation.
  • tool_call: The model generated a Tool Call and is expecting a Tool Message in return
  • error: The generation failed due to an internal error
  • timeout: The generation was stopped because it exceeded the allowed time limit.
messageobject
A message from the assistant role can contain text and tool call information.
usageobject
logprobslist of objects

Errors

400
Bad Request Error
401
Unauthorized Error
403
Forbidden Error
404
Not Found Error
422
Unprocessable Entity Error
429
Too Many Requests Error
498
Invalid Token Error
499
Client Closed Request Error
500
Internal Server Error
501
Not Implemented Error
503
Service Unavailable Error
504
Gateway Timeout Error

The name of a compatible Cohere model.

A list of chat messages in chronological order, representing a conversation between the user and the model.

Messages can be from User, Assistant, Tool and System roles. Learn more about messages and roles in the Chat API guide.

A list of tools (functions) available to the model. The model response may contain ‘tool_calls’ to the specified tools.

Learn more in the Tool Use guide.

Configuration for forcing the model output to adhere to the specified format. Supported on Command R, Command R+ and newer models.

The model can be forced into outputting JSON objects by setting { "type": "json_object" }.

A JSON Schema can optionally be provided, to ensure a specific structure.

Note: When using { "type": "json_object" } your message should always explicitly instruct the model to generate a JSON (eg: “Generate a JSON …”) . Otherwise the model may end up getting stuck generating an infinite stream of characters and eventually run out of context length.

Note: When json_schema is not specified, the generated object can have up to 5 layers of nesting.

Limitation: The parameter is not supported when used in combinations with the documents or tools parameters.

Used to select the safety instruction inserted into the prompt. Defaults to CONTEXTUAL. When OFF is specified, the safety instruction will be omitted.

Safety modes are not yet configurable in combination with tools and documents parameters.

Note: This parameter is only compatible newer Cohere models, starting with Command R 08-2024 and Command R+ 08-2024.

Note: command-r7b-12-2024 and newer models only support "CONTEXTUAL" and "STRICT" modes.

The maximum number of output tokens the model will generate in the response. If not set, max_tokens defaults to the model’s maximum output token limit. You can find the maximum output token limits for each model in the model documentation.

Note: Setting a low value may result in incomplete generations. In such cases, the finish_reason field in the response will be set to "MAX_TOKENS".

Note: If max_tokens is set higher than the model’s maximum output token limit, the generation will be capped at that model-specific maximum limit.

Used to control whether or not the model will be forced to use a tool when answering. When REQUIRED is specified, the model will be forced to use at least one of the user-defined tools, and the tools parameter must be passed in the request. When NONE is specified, the model will be forced not to use one of the specified tools, and give a direct response. If tool_choice isn’t specified, then the model is free to choose whether to use the specified tools or not.

Note: This parameter is only compatible with models Command-r7b and newer.

Configuration for reasoning features.

When set to true, tool calls in the Assistant message will be forced to follow the tool definition strictly. Learn more in the Structured Outputs (Tools) guide.

Note: The first few requests with a new set of tools will take longer to process.