Kevin's Sampling Laboratory
Preface
Editing your story for correctness and quality will improve its outputs, because the model copies your writing level and style. If the model sees any grammar mistakes, repetitiveness, or poor style in your context, it will copy them too. And the model occasionally makes these mistakes on its own, so you must check its output. For example, if the model repeats a sentence once and you don't remove it, then it will later repeat a second sentence, then gradually repeat more and more, and then you'll ask, "Why is the model constantly repeating itself?" But if you had removed the first repetition, then all the other repetitions would not have appeared. Do not ignore errors or poor writing in the model's output; fix them or remove them. Otherwise the model will think you want those errors, so it will produce more of them.
That also works for everything else, like concise text, or descriptive text, or abstract metaphors, or concrete details. You can guide the model to write the way you want, by editing the existing story to follow your preferences. If you think the model follows [some writing style you don't like], then check if you are feeding the model a story of that style. The model is meant to replicate every type of author, so it's not fixed to any particular style. It will follow what you give it.
Basics
Randomness 1, without any other sampler active, is the neutral sampler. It doesn't change the model outputs. Any other sampler should be compared to this. The further a sampler deviates from neutral, the more exotic it is, and the more likely it will cause problems if not correctly designed.
However, in text quality benchmarks, Randomness alone is known to underperform Top-K, which is known to underperform Top-P (Nucleus). So it's worth deviating from the neutral sampler. The reason is that the lowest-probability tokens are low-quality, and the model doesn't know this. So you want a sampler that removes these tokens in the best way.
Samplers will underperform when their high and low probabilities are imbalanced:
- If the high-probability tokens are too emphasized, the output will be simplistic and repetitive.
- If the low-probability tokens are not sufficiently squashed, the output will contain bizarre text and mistakes in grammar and logic.
Samplers are controlled by the Slider section, which is opened by the icon at the top right. We have documentation that briefly describes each sampler.
Clicking Change Samplers will bring up the list of samplers, where they can be enabled, disabled, and reordered. When you click Send to generate text, the model assigns a probability to each token in the vocabulary. The first sampler in the list is applied to these probabilities, then the second sampler is applied to the changed probabilities, etc. In this way, the samplers are applied one by one from top to bottom. That means the identity, order, and values of the samplers all matter, leading to a paralysis of choices that nobody can understand.
But after removing the bad options, the remaining choices are actually simple.
Rule 1: Don't build a large chain of samplers.
The complex samplers don't work correctly in a chain (Tail Free Sampling, Typical-P, Mirostat). When they receive altered inputs, their supporting theories become invalid, and they do some other thing that doesn't make sense.
The other samplers do not break, but it’s rare that sampler mixtures complement their strengths and weaknesses. More typically, if one sampler is correct, the other is wrong.
Stick to 1 sampler, or 2 at most if you feel confident that you understand the math. Especially, don't build a chain of samplers step-by-step, adding one at a time and adjusting it. A stack of samplers is too tall if you cannot qualitatively predict how transferring strength between two sliders would impact mistakes, repetition, and simplicity.
In practice, presets with more samplers are worse, because they aren't tuned properly. This remains true for the preset designers with the most experience.
Rule 2: If you don't understand the math, use a closer-to-neutral sampler.
Our docs page summarizes the samplers in plain English. However, our summaries are not deep enough to understand their tradeoffs. These descriptions are also too optimistic, describing each sampler's hopes and dreams, rather than its deficiencies. It is hard to make decisions with this information.
If tuning a sampler, you should read and understand its algorithm, then view its effects using the Token Probabilities viewer (brain icon in the lower left). Compare how it transforms the Before and After probabilities to the properties you would like a sampler to have, described below.
Goals of a sampler
-
A sampler should cull the lowest-probability tokens. Tokens with very low Before probabilities can be catastrophically bad, and removing them improves quality. Allowing more tokens lets through more mistakes, but removing more tokens creates repetition and simplistic language. The sampler balances these issues by estimating the quality distribution and squashing any low-quality tokens whose output likelihood is too high. However, this quality distribution varies for different words and different stories, because it depends on the number of good token choices and the model's accuracy at identifying those good tokens. Differences between samplers correspond to different guesses of the quality distribution's behavior.
-
Some users, but not others, want to flatten input probabilities above 20% to similar output probabilities. This sacrifices faithfulness of the output distribution in exchange for increased word complexity. For example,
a
andan
frequently appear together as options, and flattening the distribution will improperly raise the less-frequentan
, which is not desired. On the other hand, common tokens are on average simpler than less-frequent tokens, so flattening increases word complexity. -
Similar Before probabilities should have similar After outputs. For example, if all Before probabilities below 5% have After probability 0, then the Before probability 6% should have an After probability close to 0.
-
Similar input distributions should produce similar output distributions. If the Before probabilities are perturbed slightly, then the After probabilities should also be perturbed only slightly.
-
A sampler should not change the order of tokens. In the Token Probabilities viewer, if A > B in the Before section, then you want A > B in the After section. (This rule only applies to the samplers in the Samplers list. Other transformations like Phrase Repetition Penalty, Repetition Penalty, or Phrase Bias are allowed to change token order, since they have different properties.)
Numbers to consider
Look at the Token Probabilities viewer, and compare the probabilities Before and After. The Token Probabilities viewer shows 10 tokens, which can be expanded to 30 using a secret NovelAI option. You should observe the above behaviors and these properties:
- Large probabilities: if there are two probabilities >25% in the Before column, the ratio of their After probabilities should either be equal to the ratio of their Before probabilities, or closer to 1. The After probabilities might be larger or smaller than the Before probabilities; either is fine.
- Middle probabilities (between 5% and 30%): After should be equal to or larger than Before.
- Small probabilities: Below some small percentage, After should be smaller than Before. Many samplers zero all probabilities below a threshold. The goal is to reduce probabilities when the quality of tokens starts to degrade.
If you ever see incoherent text with a Before probability above 0.5%, like Unicode symbols or nonsensical word completions like door[horse]
, then your context has a severe issue that must be fixed. You should make a copy of your story, then remove parts of your story until you figure out what is causing it. This incoherent token probability means all generated text is lower quality than it should be, even if the incoherent token wasn't chosen. You really should fix it instead of ignoring it. Incoherent text is a symptom of the model being "out of distribution", which means it has no idea how to handle its inputs.
Advice for specific samplers
The rest of this document will be somewhat unfair, since the research that created Unified will be used to evaluate the other samplers, so it will be biased toward Unified.
Unified: Disable Randomness, because it is redundant with Linear. Set Linear between 0 and 1 according to how unusual you want tokens to be. Lower numbers will produce more unusual/creative outputs, but you will have to reroll or edit more. As a starting point, set Quad = 1/3 - Linear * 4 / 15
, and Conf = -Quad / 2
.
For people who want to spend a lot of effort tinkering: Unified’s formula is raw value = (input log-probability) * (Linear + entropy * Conf) - (input log-probability)^2 * Quad
. Output probabilities are calculated by exponentiating the raw value
, and normalizing the sum to 1 (called "softmax"). Negative Linear is suboptimal but allowed; you can type a negative number. It causes a negative probability slope: a token’s output probability decreases if its input probability increases. If Linear or Conf is negative, then Quad must be positive or the output will explode with junk.
Quad can't make logits 0, but it can make them low enough to no longer matter. If you want a true cutoff, then pick either a very high Top P (Nucleus 0.999) or a very low Min-P (0.0001), and order it after Unified.
Entropy measures how diverse a distribution is: it's approximately the natural logarithm of the number of choices. Thus, in Unified's formula, Conf takes effect when a token has more choices. Conf's theoretical range is between -2 * Quad and 0. Imagine a token B where the model outputs only one high-probability choice, and a token C where the model outputs many low-probability choices. You would pick Conf = -2 * Quad if you believe that each of token C's many choices are as likely to be good as token B's single choice. You would pick Conf = 0 if you believe that C has many choices because the model isn't able to pick a good choice, rather than because there are many good choices. In practice, Min-P is the limit of what makes sense, and it corresponds to Conf = -0.8 * Quad
. So keep -0.8 * Quad < Conf < 0
.
To set Conf more accurately, you would look at the Token Probabilities viewer, then mark down each token's entropy and the log-probability at which its choices start to degrade in quality. (Log-probabilities are shown by clicking Aa
in this viewer.) Then, you'd do linear regression so that Conf * entropy + constant ≈ log-probability you wrote
. Unfortunately, the bad tokens are often beyond the 30 shown tokens, so this process fails for high-entropy tokens. To measure entropy, check the log-probabilities of two tokens with a pure Conf 1.0 sampler: entropy = (difference of After log-probs) / (difference of Before log-probs)
The goal of Unified is to map all logical sampler behavior to an understandable space. I left out some sliders for ease of use, but Unified should still outperform full sampler chains. It is built from statistics and optimization theory.
-
Mirostat's goal of perplexity control is questionable. Its method of Top-K plus Zipf is subpar. Its history tracker does not break repetitions because it does not change the slope of the top tokens. Its perplexity estimate will be distorted by high-confidence token sequences, such as idioms and character names. Do not use Mirostat.
-
Don't use Top-A. Its cutoff scales too much with the top logit. Replace it with Min-P and multiply the slider value by 1/2 or 1/3; this should be a strict improvement.
-
Tail-Free Sampling (TFS) calculates a tail of probabilities, but it is based on poor mathematics. It's also sensitive to logit quantization: insignificant changes in low probabilities cause large changes in second derivative. If TFS does something useful, Top-P would do it better with Linear X -> Top-P -> Randomness X, for some number X > 1.
-
Typical deletes both high- and low-probability tokens, which changes the order of tokens, violating Goal 5. Typical isn't given sufficient information for deleting high-probability tokens to make sense. If you for some reason really want to suppress the top logits, negative Linear is a better solution.
-
Top-K is the oldest sampler. It was created when people didn't have experience working with LLMs. It's generally obsolete except for Top-K 1, which can be used for debugging.
-
Top-P (Nucleus) was an advance over Top-K. It preserves diversity of the distribution, which is one of the two things you care about. (The other is quality.) Its good performance has been widely benchmarked, and it is the standard all samplers are compared to.
-
Min-P is a recent sampler. Although the probability of the top token is an imperfect metric of the token's certainty, Min-P works fine, and it is easy to understand.