Retrieval Augmented Generation (RAG)

What is Retrieval Augmented Generation?

Through retrieval augmented generation (RAG), models are able to utilize additional information to generate more accurate replies.

RAG is a feature of our co.chat() endpoint, and can be used very simply by slightly tweaking the arguments you pass in to the endpoint (code snippets below).

What Does RAG Make Possible?

Glossing over all the tricky technical details, what a large language model is doing when it generates an output mostly boils down to predicting the next-most-likely token, given the series of tokens that it has already seen.

Although the results of this process can often be astonishing, large language models are also well-known to hallucinate factually incorrect, nonsensical, or incomplete information in their replies, which can be problematic for certain use cases.

RAG substantially reduces this problem by giving the model source material to work with. Rather than simply generating an output based on the input prompt, the model can pull information out of this material and incorporate it into its reply.

This is comparable to the difference between casually asking a friend what they happen to know about penguins, versus giving them a stack of books on penguin science and (politely!) asking them to cite their sources.

As things stand, this source material can come from one of two places:

  • The user can directly provide context-rich documents to ground replies.
  • The user can specify the location of the documents (this mode operates via ‘connectors’, which we will cover in more detail below).

Importantly, if the model finds pertinent information with which to ground its output, it will include citations telling you where it got the information; if it finds nothing of relevance, no citations are added to the output. These citations give the user the opportunity to assess the veracity of responses for themselves – either by tracing information contained in the reply back to its source in a document, or by reading the source material directly to ensure that it actually does support the reply.

It’s worth underscoring this last part. RAG does not guarantee accuracy. It involves giving a model context which informs its replies, but if those documents are themselves out-of-date, inaccurate, or biased, whatever the model generates might be as well. What’s more, RAG doesn’t guarantee that a model won’t hallucinate. It reduces the problem, but doesn’t necessarily eliminate it altogether.

RAG can also be configured such that the model simply returns suggested search queries, rather than performing any actual searches itself.

How Can I Use RAG?

RAG exists as part of our co.chat() endpoint. The big difference between RAG and basic chat functionality is that we’ll be including a few special arguments in the json payload.

Conceptually, there are three modes that RAG can be in:

  • Document mode (in which the user’s message provides information for the model to use)
  • Query-generation mode (in which the user gets the model-generated query as output)
  • Connector mode (in which the user’s message points to external information for the model to use)

We’ll cover all three in the following sections.

Document Mode

The first mode we’ll discuss is “document mode”. Document mode involves users providing the model with their own documents directly in the message, which it can use to ground its replies.

A diagram of RAG's document mode

Here’s an example:

{
  "message": "Where do the tallest penguins live?",
  "documents": [
    {
      "title": "Tall penguins",
      "snippet": "Emperor penguins are the tallest."
    },
    {
      "title": "Penguin habitats",
      "snippet": "Emperor penguins only live in Antarctica."
    },
    {
      "title": "What are animals?",
      "snippet": "Animals are different from plants."
    }
  ],
  "prompt_truncation": "AUTO"
}

And here’s what the output looks like:

{  
    "response_id": "ea9eaeb0-073c-42f4-9251-9ecef5b189ef",  
    "text": "The tallest penguins, Emperor penguins, live in Antarctica.",  
    "generation_id": "1b5565da-733e-4c14-9ff5-88d18a26da96",  
    "token_count": {  
        "prompt_tokens": 445,  
        "response_tokens": 13,  
        "total_tokens": 458,  
        "billed_tokens": 20  
    },  
    "meta": {  
        "api_version": {  
            "version": "2022-12-06"  
        }  
    },  
    "citations": [  
        {  
            "start": 22,  
            "end": 38,  
            "text": "Emperor penguins",  
            "document_ids": [  
                "doc_0"  
            ]  
        },  
        {  
            "start": 48,  
            "end": 59,  
            "text": "Antarctica.",  
            "document_ids": [  
                "doc_1"  
            ]  
        }  
    ],  
    "documents": [  
        {  
            "id": "doc_0",  
            "title": "Tall penguins",  
            "snippet": "Emperor penguins are the tallest.",  
            "url": ""  
        },  
        {  
            "id": "doc_1",  
            "title": "Penguin habitats",  
            "snippet": "Emperor penguins only live in Antarctica.",  
            "url": ""  
        }  
    ],  
    "search_queries": []  
}

(NOTE: we include examples throughout this document for illustrative purposes, but you should be aware that our models are under active development, and your results may be different.)

If you’ve worked with the chat endpoint before most of this will be familiar to you, but there are a few things worth noting.

To begin with, observe that the payload includes a list of documents with a “snippet” field containing the information we want the model to use. The recommended length for the snippet of each document is relatively short, 300 words or less. We recommend using field names similar to the ones we’ve included in this example (i.e. “title” and “snippet” ), but RAG is quite flexible with respect to how you structure the documents. You can give the fields any names you want, and can pass in other fields as well, such as a “date” field. All field names and field values are passed to the model.

Next, we can clearly see that it has utilized the document. Our first document says that Emperor penguins are the tallest penguin species, and our second says that Emperor penguins can only be found in Antarctica. The model’s reply successfully synthesizes both of these facts: "The tallest penguins, Emperor penguins, live in Antarctica."

Finally, note that the output contains a citations object that tells us not only which documents the model relied upon (with the "text" and “document_ids" fields), but also the particular part of the claim supported by a particular document (with the “start” and “end” fields, which are spans that tell us the location of the supported claim inside the reply). This citation object is included because the model was able to use the documents provided, but if it hadn’t been able to do so, no citation object would be present.

You can experiment with RAG in the chat playground. Here’s a screenshot of what that looks like:

A basic screenshot of the RAG UI.

If you have any further questions about i.e. which fields are required and which are optional, we refer you to the API specs.

Query-generation Mode

There might also be situations in which you’d rather get the search queries the model recommends for finding the information, instead of its actual reply.

A diagram of RAG's query-generation mode.

This is done by setting the search_queries_only parameter to true (it’s false by default).

Here’s an example:

{  
  "message": "What are the tallest penguins?",  
  "search_queries_only": true  
}

And here’s the output:

{  
    "response_id": "25c15f38-c563-44f1-88f3-638224ffb75d",  
    "meta": {  
        "api_version": {  
            "version": "2022-12-06"  
        }  
    },  
    "is_search_required": true,  
    "search_queries": [  
        {  
            "text": "tallest penguins",  
            "generation_id": "06265f3a-1a6b-47d6-99a4-1efcc1ad3d3b"  
        }  
    ]  
}

In query-generation mode, the model responds with a “search_queries” field that suggests a query that you can input into a search engine or another natural language system to find useful information. The model generates its own query in both query-generation and connector mode, but in query-generation mode, that query is the output.

We include this feature to increase the modularity of our offering. We have users, for example, that chain together a series of API queries. In other words, they might 1) use Cohere to generate search queries, and 2) either run the queries themselves, or run them through a different natural-language system. We also have users that will make a call to co.chat() in query-generation mode, then make a second call to co.chat() to send in the documents obtained from an external search.

These kinds of workflows are easier to construct if there’s a way of just getting the search queries generated by our models, which is what query-generation mode is for.

You can also use query-generation mode to generate multiple queries. If you send in a message like “what are the tallest penguins, and what is the tallest cat”, the model will generate two separate queries:

{  
    "response_id": "f0640f9d-f373-465d-8a33-c5a4492fa143",  
    "meta": {  
        "api_version": {  
            "version": "2022-12-06"  
        }  
    },  
    "is_search_required": true,  
    "search_queries": [  
        {  
            "text": "tallest penguins",  
            "generation_id": "fcf6a711-95f7-475d-8e8b-c2fa82e04aed"  
        },  
        {  
            "text": "tallest cat",  
            "generation_id": "fcf6a711-95f7-475d-8e8b-c2fa82e04aed"  
        }  
    ]  
}

The“search_queries”object now contains two queries,“tallest penguins”and“tallest cat”.

You can see query generations in the chat playground. Here’s a screenshot of what that looks like:

A screenshot of query generation in the RAG UI.

Connector Mode

Finally, if you want to point the model at the sources it should use rather than specifying your own, you can do that through connector mode.

A diagram of RAG's connector mode.

Here’s an example:

{  
  "message": "What are the tallest living penguins?",  
  "connectors": [{"id": "web-search"}],  
  "prompt_truncation":"AUTO"  
}

And here’s what the output looks like:

{  
    "response_id": "a29d7080-11e5-43f6-bbb6-9bc3c187eed7",  
    "text": "The tallest living penguin species is the emperor penguin, which can reach a height of 100 cm (39 in) and weigh between 22 and 45 kg (49 to 99 lb).",  
    "generation_id": "1c60cb38-f92f-4054-b37d-566601de7e2e",  
    "token_count": {  
        "prompt_tokens": 1257,  
        "response_tokens": 38,  
        "total_tokens": 1295,  
        "billed_tokens": 44  
    },  
    "meta": {  
        "api_version": {  
            "version": "2022-12-06"  
        }  
    },  
    "citations": [  
        {  
            "start": 42,  
            "end": 57,  
            "text": "emperor penguin",  
            "document_ids": [  
                "web-search_1",  
                "web-search_8"  
            ]  
        },  
        {  
            "start": 87,  
            "end": 101,  
            "text": "100 cm (39 in)",  
            "document_ids": [  
                "web-search_1"  
            ]  
        },  
        {  
            "start": 120,  
            "end": 146,  
            "text": "22 and 45 kg (49 to 99 lb)",  
            "document_ids": [  
                "web-search_1",  
                "web-search_8"  
            ]  
        }  
    ],  
    "documents": [  
        {  
            "id": "web-search_1",  
            "title": "Emperor penguin - Wikipedia",  
            "snippet": "The emperor penguin (Aptenodytes forsteri) is the tallest and heaviest of all living penguin species and is endemic to Antarctica. The male and female are similar in plumage and size, reaching 100 cm (39 in) in length and weighing from 22 to 45 kg (49 to 99 lb).",  
            "url": "https://en.wikipedia.org/wiki/Emperor_penguin"  
        },  
        {  
            "id": "web-search_8",  
            "title": "The largest penguin that ever lived",  
            "snippet": "They concluded that the largest flipper bones belong to a penguin that tipped the scales at an astounding 154 kg. In comparison, emperor penguins, the tallest and heaviest of all living penguins, typically weigh between 22 and 45 kg.",  
            "url": "https://www.cam.ac.uk/stories/giant-penguin"  
        },  
        {  
            "id": "web-search_1",  
            "title": "Emperor penguin - Wikipedia",  
            "snippet": "The emperor penguin (Aptenodytes forsteri) is the tallest and heaviest of all living penguin species and is endemic to Antarctica. The male and female are similar in plumage and size, reaching 100 cm (39 in) in length and weighing from 22 to 45 kg (49 to 99 lb).",  
            "url": "https://en.wikipedia.org/wiki/Emperor_penguin"  
        },  
        {  
            "id": "web-search_1",  
            "title": "Emperor penguin - Wikipedia",  
            "snippet": "The emperor penguin (Aptenodytes forsteri) is the tallest and heaviest of all living penguin species and is endemic to Antarctica. The male and female are similar in plumage and size, reaching 100 cm (39 in) in length and weighing from 22 to 45 kg (49 to 99 lb).",  
            "url": "https://en.wikipedia.org/wiki/Emperor_penguin"  
        },  
        {  
            "id": "web-search_8",  
            "title": "The largest penguin that ever lived",  
            "snippet": "They concluded that the largest flipper bones belong to a penguin that tipped the scales at an astounding 154 kg. In comparison, emperor penguins, the tallest and heaviest of all living penguins, typically weigh between 22 and 45 kg.",  
            "url": "https://www.cam.ac.uk/stories/giant-penguin"  
        }  
    ],  
    "search_results": [  
        {  
            "search_query": {  
                "text": "tallest living penguins",  
                "generation_id": "12eda337-f096-404f-9ba9-905076304934"  
            },  
            "document_ids": [  
                "web-search_0",  
                "web-search_1",  
                "web-search_2",  
                "web-search_3",  
                "web-search_4",  
                "web-search_5",  
                "web-search_6",  
                "web-search_7",  
                "web-search_8",  
                "web-search_9"  
            ],  
            "connector": {  
                "id": "web-search"  
            }  
        }  
    ],  
    "search_queries": [  
        {  
            "text": "tallest living penguins",  
            "generation_id": "12eda337-f096-404f-9ba9-905076304934"  
        }  
    ]  
}

(NOTE: In this example, we’ve modified the query slightly to say “living” penguins, because “What are the tallest penguins?” returns a great deal of information about a long-extinct penguin species that was nearly seven feet tall.)

In connector mode, we tell the model to use the “web-search” connector to find out which breed of penguin is tallest, rather than pass in source material through the documents parameter. If you’re wondering how this works under the hood, we have more information in the next section.

You can experiment with this feature in the chat playground. Here is a screenshot of what that looks like:

A screenshot of the RAG UI.

As with document mode, when the chat endpoint generates a response using a connector it will include a citations object in its output. If the model is unable to find anything suitable, however, no such citation object will appear in the output.

A Note on Connectors

We’re working on a top-to-bottom breakdown of our connector system, but it’s worth briefly making a few comments in this document.

Connectors allow Coral users to initiate a search of a third-party application containing textual data, such as the internet, a document database, etc. The application will send relevant information back to Coral, and Coral will use it to generate a grounded response. Right now, the only connector supported by Cohere is the “web-search” connector, and it runs searches against a browser in safe mode.

We’re working on adding additional connectors, and we’re also working on enabling users to either register their own data sources or spin up their own connectors. For the details of that process you’ll need to refer to the larger connectors document when it’s made available.

Best Practices for Using RAG

We often find that it helps to iteratively build up a query by first giving the prompt to the base model, then passing in documents, then utilizing internet search. It looks something like this:

  • Start with a base request (co.chat(“What is the tallest penguin?”))
  • Then, pass in documents, with documents= [“”,...,””.]
  • Then, use internet search, with connectors=[{"id": "web-search}].

Note that these steps are sequential, and you can’t specify a documents argument and a connectors argument at the same time; for now, it’s one or the other.

An Example RAG Use Case

A common use case for the Cohere platform is building an application that allows a company’s own employees to query its data. A global company, for example, may observe different holidays for different countries, and a connector to internal documentation would generate an accurate reply to "what is the next observed holiday for UK employees?"

A Warning About Model-Generated Text

Language models are trained on vast quantities of human-written data, from the internet and other sources, and while in beta may replicate certain problematic features found in this data. Examples include:

  • Graphic content that is violent or erotic in nature, or relates to depression or self-harm.
  • The impression of being human, being sentient, or of having human qualifications related to medicine, law, etc.
  • Discrimination, hate speech, offensive jokes, and content that otherwise contains subtle biases, such as harmful stereotypes or the erasure of minority identity groups.
  • Misinformation or conspiracy theories.
  • Instructions that contain explicit directions to carry out criminal or malicious activities.

There are further risks that do not stem from training data, but are instead properties of all language models today. Examples include:

  • Incorrect facts, from wrong dates to historical events and landmarks that are entirely fabricated. The less well known the thing you’re asking about is, the more likely the model is to be incorrect. This is particularly dangerous for medical or legal questions, which we do not recommend asking.
  • Incorrect citations, which can be either cited material that is fictional or cited material that does exist but does not match the facts generated by the language model.
  • Incorrect math and code.

API

Check out the API spec for more granular information.