Kevin's Sampling Laboratory

Randomness 1, without any other sampler active, is the neutral sampler. It doesn't change the model outputs. Any other sampler should be compared to this. The further a sampler deviates from neutral, the more exotic it is, and the more likely it will cause problems if not correct.

However, in text quality benchmarks, Randomness alone is known to underperform Top K, which is known to underperform Top P (Nucleus). So it's worth deviating from the neutral sampler. The reason is that the lowest-probability tokens are low-quality, and the model doesn't know this. So you want a sampler that removes these tokens in the best way.

Samplers will underperform when their balance of high and low probabilities is off:

  1. If the high-probability tokens are too emphasized, the output will fall into repetitive or too-simple text.
  2. If the low-probability tokens are not sufficiently squashed, the output will contain bizarre text and mistakes in grammar and logic.

It’s helpful to edit the AI’s generation for correctness and quality before adding to it, because the model tries to generate text similar to the existing story. That includes any grammar mistakes or repetition: if it sees a mistake, it will produce more mistakes of the same type. If enough mistakes accumulate within a context, the output becomes incoherent. You can prevent this by removing mistakes when they appear. Spelling mistakes won’t occur, so don’t worry about those.

Samplers are controlled by the Slider section, which is opened by the icon at the top right. We have documentation that briefly describes each sampler.

Clicking Change Samplers will bring up the list of samplers, where they can be enabled, disabled, and reordered. When you click Send to generate text, the model assigns a probability to each token in the vocabulary. The first sampler in the list is applied to these probabilities, then the second sampler is applied to the changed probabilities, etc. In this way, the samplers are applied one by one from top to bottom. That means the identity, order, and values of the samplers all matter, leading to a paralysis of choices that nobody can understand.

But after removing the bad options, the remaining choices are actually simple.

Rule 1: Don't build a large chain of samplers.

The complex samplers don't work correctly in a chain (Tail Free Sampling, Typical P, Mirostat). When they receive altered inputs, their supporting theories become invalid, and they do some other thing that doesn't make sense.

The other samplers do not break, but it’s rare that sampler mixtures complement their strengths and weaknesses. More typically, if one sampler is correct, the other is wrong. The exceptions are Top K at the end of some chains, and a very light thresholding sampler after Unified.

Stick to 1 sampler, or 2 at most if you feel confident that you understand the math. Especially, don't build a chain of samplers step-by-step, adding one at a time and adjusting it.

In practice, presets with more samplers are worse, because they aren't tuned properly. This remains true for the preset designers with the most experience.

Rule 2: If you don't understand the math, use a closer-to-neutral sampler.

Our docs page summarizes the samplers in plain English. However, our summaries are not deep enough to understand their tradeoffs. These descriptions are also too optimistic, describing each sampler's hopes and dreams, rather than its deficiencies. It is hard to make decisions with this information.

If you are stacking samplers, it's especially important to fully understand each one.

If tuning a sampler, you should read and understand its algorithm, then view its effects using the Token Probabilities viewer (brain icon in the lower left). Compare how it transforms the Before and After probabilities to the properties you would like a sampler to have, described below.

Goals of a sampler

  1. A sampler should cull the lowest-probability tokens. Tokens with very low Before probabilities are often bad, as an inherent problem with LLMs. Removing them improves quality. However, the Token Probabilities viewer only shows 10 tokens, so it's hard to tell where tokens stop making sense. Plus, this threshold is different for different words and different stories, because it depends on how in-distribution the model is. It's hard to pick a good culling point, but you can try. There's a secret NovelAI option to show 30 tokens.

  2. Some users, but not others, want to flatten the high probabilities. This means different input probabilities above 20% are squished to similar output probabilities. This elicits more unusual tokens. This is for users who want the AI to be explorative, and who are willing to retry bad rolls more often.

  3. A sampler should have no abrupt jumps. For example, if all Before probabilities below 0.05 are set to 0, then the Before probability 0.06 should have an After probability that is also close to 0. There should be a smooth ramp to 0, rather than a large jump.

  4. A sampler should not change the order of tokens. In the Token Probabilities viewer, if A > B in the Before section, then you want A > B in the After section. (This rule only applies to the samplers in the Samplers list. Other transformations like Phrase Repetition Penalty, Repetition Penalty, or Phrase Bias are allowed to change token order, since they have different properties.)

Numbers to consider

Look at the Token Probabilities viewer, and compare the probabilities Before and After. In addition to the above behaviors, you should observe these properties:

  • Large probabilities: if there are two probabilities >25% in the Before column, the ratio of their After probabilities should either be equal to the ratio of their Before probabilities, or closer to 1. The After probabilities might be larger or smaller than the Before probabilities; either is fine.
  • Middle probabilities (between 5% and 30%): After should be equal to or larger than Before.
  • Small probabilities: Below some small percentage, After should be smaller than Before. Many samplers use a threshold to set After probabilities to 0. The goal is to reduce probabilities when the quality of tokens starts to degrade.

Advice for specific samplers

The rest of this document will be somewhat unfair, since the research that created Unified will be used to evaluate the other samplers, so it will be biased toward Unified.

Unified: Disable Randomness, because it is redundant with Linear. Set Linear between 0 and 1 according to how unusual you want tokens to be: 0 means maximum exploration and 1 means normal output. Lower numbers will produce more unusual/creative outputs, but you will have to reroll or edit more. Set Quad = 0.25 - Linear / 5. Leave Conf alone.

For people who want to spend a lot of effort tinkering: Unified’s formula is output logit = (input log-probability) * (Linear + entropy * Conf) - (input log-probability)^2 * Quad. Output probabilities are calculated by exponentiating the output logits, and normalizing the sum to 1 (called “softmax”). Negative Linear is suboptimal but valid; you can type a negative number in the Linear field. The effect is a negative probability slope: a token’s output probability decreases if its input probability increases. If Linear or Conf is negative, then Quad must be positive or the output will explode with junk.

Quad can't make logits 0, but it can make them low enough to no longer matter. If you want a true cutoff, then pick either a very high Top P (Nucleus 0.999) or a very low Min P (0.0001), and order it after Unified.

Conf varies Randomness by the current token's uncertainty, which is measured by entropy. Theory predicts that a tiny positive Conf like 0.01 might be helpful, but this was only observed once, in a story with many grammar mistakes.

Unified’s goals are to be simple, and to outperform full presets. The hope is to replace complex sampler chains with Unified alone. It has fewer sliders and fewer pitfalls, so users can pick sensible options. Its math arose from theory rather than from experimentation.

Since Conf showed minimal benefits in our testing, the logical move would be to remove it and simplify Unified. But to replace a preset, Unified must be able to replicate any good behavior the preset might have. The other samplers have behavior that is neither promoted nor prohibited by theory, and Conf reproduces this unusual behavior. For example, Conf = -1.5 * Quad is a smooth Min P, and high Conf is a smooth Top K. So Conf’s purpose is to cover existing presets.

  • Mirostat's algorithm is odd. Dynamic sampling is neat, and could be useful to break out of repetition. But Mirostat's goal of perplexity control is questionable, and its method of Top K plus Zipf is very subpar. Any repetition-breaking effects would be minor at best. If you want dynamic sampling, you can instead use positive Conf and a lower Linear, although it will not be history-aware.

  • Don't use Top A. It is bad. Replace it with Min P and multiply the slider value by 1/2 or 1/3; this should be a strict improvement.

  • Tail-Free Sampling (TFS) calculates a tail of probabilities, but it is based on poor mathematics. It's also sensitive to logit quantization: insignificant changes in low probabilities cause large changes in second derivative. If TFS does something useful, Top P would do it better with Linear X -> Top P -> Randomness X, for some number X > 1.

  • Typical deletes both high- and low-probability tokens, which changes the order of tokens, violating Goal 4. Typical isn't given sufficient information for deleting high-probability tokens to make sense. If you for some reason really want to suppress the top logits, negative Linear is a better solution.

  • Top K is the oldest sampler. It was created when people didn't have experience working with LLMs. It's generally obsolete except for Top K 1, which can be used for debugging.

  • Top P (Nucleus) was an advance over Top K. It preserves diversity of the distribution, which is one of the two things you care about. (The other is quality.) Its good performance has been widely benchmarked, and it is the standard all samplers are compared to. Values near 1 make sense (like 0.9).

  • Min P is a recent sampler. Its threshold guarantees that at least one token exists, without needing special logic like Top P does. Although the probability of the top token is an imperfect metric of the token's certainty, Min P works fine, and it is easy to understand. Values near 0 make sense (like 0.04).