RAG Streaming
Overview
To enable response streaming in RAG, use the chat_stream
endpoint instead of chat
.
This allows your application to receive token streams as the model generates its response.
Events stream
In RAG, the events streamed by the endpoint follows the structure of a basic chat stream event but contains additional events for tool calling and response generation with the associated contents. This section describes the stream of events and their contents.
Event types
message-start
Same as in a basic chat stream event.
content-start
Same as in a basic chat stream event.
content-delta
Same as in a basic chat stream event.
citation-start
Emitted for every citation generated in the response. This event contains the details about a citation such as the start
and end
indices of the text that cites a source(s), the corresponding text
, and the list of sources
.
citation-end
Emitted to indicate the end of a citation. If there are multiple citations generated, the events will come as a sequence of citation-start
and citation-end
pairs.
content-end
Same as in a basic chat stream event.
message-end
Same as in a basic chat stream event.
Example stream
The following is an example stream with RAG.
Usage example
This section provides an example of handling streamed objects in the tool use response generation step.
Setup
First, import the Cohere library and create a client.
Cohere platform
Private deployment
Define documents
Next, define the documents to be passed to the endpoint.
Streaming the response
We can now stream the response using the chat_stream
endpoint.
The events are streamed as chunk
objects. In the example below, we pick content-delta
to display the text response and citation-start
to display the citations.
Example response: