Cohere Text Generation Tutorial
Command is Cohere’s flagship LLM, able to generate a response based on a user message or prompt. It is trained to follow user commands and to be instantly useful in practical business applications, like summarization, copywriting, extraction, and question-answering.
Command R and Command R+ are the most recent models in the Command family. They strike the kind of balance between efficiency and high levels of accuracy that enable enterprises to move from proof of concept to production-grade AI applications.
This tutorial leans of the Chat endpoint to build an onboarding assistant for new hires at Co1t, a fictional company, and covers:
- Basic text generation
- Prompt engineering
- Parameters for controlling output
- Structured output generation
- Streaming output
Setup
To get started, first we need to install the cohere
library and create a Cohere client.
Basic text generation
To get started we just need to pass a single message
parameter that represents (you guessed it) the user message, after which we use the client we just created to call the Chat endpoint.
The response we get back contains several objects, but for the sake of simplicity we’ll focus for the moment on the text
object:
Here are some additional resources if you’d like to read further:
- Chat endpoint API reference
- Documentation on Chat fine-tuning
- Documentation on Command R+
- LLM University module on text generation
Prompt engineering
Prompting is at the heart of working with LLMs as it provides context for the text that we want the model to generate. Prompts can be anything from simple instructions to more complex pieces of text, and they are used to steer the model to producing a specific type of output.
This section examines a couple of prompting techniques, the first of which is adding more specific instructions to the prompt (the more instructions you provide in the prompt, the closer you can get to the response you need.)
The limit of how long a prompt can be is dependent on the maximum context length that a model can support (in the case Command R and Command R+, it’s 128k tokens).
Below, we’ll add one additional instruction to the earlier prompt, the length we need the response to be.
All our prompts so far use what is called zero-shot prompting, which means that provide instruction without any example. But in many cases, it is extremely helpful to provide examples to the model to guide its response. This is called few-shot prompting.
Few-shot prompting is especially useful when we want the model response to follow a particular style or format. Also, it is sometimes hard to explain what you want in an instruction, and easier to show examples.
Below, we want the response to be similar in style and length to the convention, as we show in the examples.
Further reading:
Parameters for controlling output
The Chat endpoint provides developers with an array of options and parameters.
For example, you can choose from several variations of the Command model. Different models produce different output profiles, such as quality and latency.
Often, you’ll need to control the level of randomness of the output. You can control this using a few parameters.
The most commonly used parameter is temperature
, which is a number used to tune the degree of randomness. You can enter values between 0.0 to 1.0.
A lower temperature gives more predictable outputs, and a higher temperature gives more “creative” outputs.
Here’s an example of setting temperature
to 0.
And here’s an example of setting temperature
to 1.
Further reading:
- Available models for the Chat endpoint
- Documentation on predictable outputs
- Documentation on advanced generation parameters
Structured output generation
By adding the response_format
parameter, you can get the model to generate the output as a JSON object. By generating JSON objects, you can structure and organize the model’s responses in a way that can be used in downstream applications.
The response_format
parameter allows you to specify the schema the JSON object must follow. It takes the following parameters:
message
: The user messageresponse_format
: The schema of the JSON object
Further reading:
Streaming responses
All the previous examples above generate responses in a non-streamed manner. This means that the endpoint would return a response object only after the model has generated the text in full.
The Chat endpoint also provides streaming support. In a streamed response, the endpoint would return a response object for each token as it is being generated. This means you can display the text incrementally without having to wait for the full completion.
To activate it, use co.chat_stream()
instead of co.chat()
.
In streaming mode, the endpoint will generate a series of objects. To get the actual text contents, we take objects whose event_type
is text-generation
.
Further reading:
Conclusion
In this tutorial, you learned about:
- How to get started with a basic text generation
- How to improve outputs with prompt engineering
- How to control outputs using parameter changes
- How to generate structured outputs
- How to stream text generation outputs
However, we have only done all this using direct text generations. As its name implies, the Chat endpoint can also support building chatbots, which require features to support multi-turn conversations and maintain the conversation state.
In Part 3, you’ll learn how to build chatbots with the Chat endpoint.