Predictable Outputs

There are a handful of model parameters that impact how predictable the model's generated output will be. These include the temperature, top-p, top-k, frequency_penalty, and presence_penalty parameters.

Temperature

Sampling from generation models incorporates randomness, so the same prompt may yield different outputs each time you hit "generate". Temperature is a number used to tune the degree of randomness.

How to pick temperature when sampling

A lower temperature means less randomness; a temperature of 0 will always yield the same output. Lower temperatures (less than 1) are more appropriate when performing tasks that have a "correct" answer, like question answering or summarization. If the model starts repeating itself this is a sign that the temperature may be too low.

High temperature means more randomness and less grounding. This can help the model give more creative outputs, but if you're using retrieval augmented generation, it can also mean that it doesn't correctly use the context you provide. If the model starts going off topic, giving nonsensical outputs, or failing to ground properly, this is a sign that the temperature is too high.

Adjusting the temperature setting

Adjusting the temperature setting

Temperature can be tuned for different problems, but most people will find that a temperature of 1 is a good starting point.

As sequences get longer, the model naturally becomes more confident in its predictions, so you can raise the temperature much higher for long prompts without going off topic. In contrast, using high temperatures on short prompts can lead to outputs being very unstable.

Top-p and Top-k

The method you use to pick output tokens is an important part of successfully generating text with language models. There are several methods (also called decoding strategies) for picking the output token, with two of the leading ones being top-k sampling and top-p sampling.

Let’s look at the example where the input to the model is the prompt The name of that country is the:

Example output of a generation language model.

Example output of a generation language model.

The output token in this case, United, was picked in the last step of processing -- after the language model has processed the input and calculated a likelihood score for every token in its vocabulary. This score indicates the likelihood that it will be the next token in the sentence (based on all the text the model was trained on).

The model calculates a likelihood for each token in its vocabulary. The decoding strategy then picks one as the output.

The model calculates a likelihood for each token in its vocabulary. The decoding strategy then picks one as the output.

1. Pick the top token: greedy decoding

You can see in this example that we picked the token with the highest likelihood, United.

Always picking the highest scoring token is called "Greedy Decoding". It's useful but has some drawbacks.

Always picking the highest scoring token is called "Greedy Decoding". It's useful but has some drawbacks.

Greedy decoding is a reasonable strategy, but has some drawbacks; outputs can get stuck in repetitive loops, for example. Think of the suggestions in your smartphone's auto-suggest. When you continually pick the highest suggested word, it may devolve into repeated sentences.

2. Pick from amongst the top tokens: top-k

Another commonly-used strategy is to sample from a shortlist of the top 3 tokens. This approach allows the other high-scoring tokens a chance of being picked. The randomness introduced by this sampling helps the quality of generation in a lot of scenarios.

Picking the top 3 tokens

Adding some randomness helps make output text more natural. In top-3 decoding, we first shortlist three tokens then sample one of them by considering their likelihood scores.

More broadly, choosing the top three tokens means setting the top-k parameter to 3. Changing the top-k parameter sets the size of the shortlist the model samples from as it outputs each token. Setting top-k to 1 gives us greedy decoding.

Adjusting to the top-k setting.

Adjusting to the top-k setting.

Note that when k is set to 0, the model disables k sampling and uses p instead.

3. Pick from amongst the top tokens whose probabilities add up to 15%: top-p

The difficulty of selecting the best top-k value opens the door for a popular decoding strategy that dynamically sets the size of the shortlist of tokens. This method, called Nucleus Sampling, creates the shortlist by selecting the top tokens whose sum of likelihoods does not exceed a certain value. A toy example with a top-p value of 0.15 could look like this:

In top-p, the size of the shortlist is dynamically selected based on the sum of likelihood scores reaching some threshold.

In top-p, the size of the shortlist is dynamically selected based on the sum of likelihood scores reaching some threshold.

Top-p is usually set to a high value (like 0.75) with the purpose of limiting the long tail of low-probability tokens that may be sampled. We can use both top-k and top-p together. If both k and p are enabled, p acts after k.

Frequency and Presence Penalties

The final set of parameters worth discussing in this context are frequency_penalty and presence_penalty, both of which work on the log probabilities of tokens (i.e. the "logits") in order to influence how often a given token appears in output.

The frequency penalty penalizes tokens that have already appeared in the preceding text (including the prompt), and scales based on how many times that token has appeared. So a token that has already appeared 10 times gets a higher penalty (which reduces its probability of appearing) than a token that has appeared only once.

The presence penalty, on the other hand, applies the penalty regardless of frequency. As long as the token has appeared once before, it will get penalized.

These settings are useful if you want to get rid of repetition in your outputs.