You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In #1880 and #1881 we discussed two methods of test-time compute to increase the quality of LLM output. (Snell et al. 2024)’s introduction presents a slew of others, including Best-of-N (generate N samples, keep the “best”) rather elaborate ones like (Lightman et al. 2023) which verify step-by-step, and various strategies of fine-tuning at test-time to refine the proposal.
Our board member @RoboTeddy from https://metr.org/ suggested an even simpler approach, which amounts to a first-success accept/reject sampling for conditional distributions: reject proposals until the sanity checks pass. In more details, assuming we have the automated sanity checks and evals described in #1879 , every time our LLM generates something, we can reject if those sanity checks aren’t met, or if those eval scores are too low.
This is essentially a “first-success” variant of Best-of-N.
There are a few subtleties to it, which are all connected:
How many times?
Finite time guarantee?
Reject all or part?
Speeding it up?
Combining with multi-sample?
Finally: do we need it, or does the Bitter lesson apply?
How many attempts TLDR: on average, it will take a number of steps equal to the inverse of the probability of success.
Indeed, unlike Best-of-N which has a fixed number of samples, the number of samples until first success is stochastic. Luckily, from the accept/reject literature, we know the distribution of a first-success: a geometric distribution.
If there is a probability $\alpha$ of the LLM meeting the criteria of interest (i.e. the sample being accepted), the number of samples to first success follows a Geometric distribution of parameter $\alpha$, whose expected value is $1/ \alpha$ and variance $(1-\alpha)/\alpha^2$. Intuitively it makes sense: the harder the test to pass, the more attempts needed.
Finite time guarantee TLDR: yes if the LLM _can_ pass the test, but finite doesn’t mean soon…
In some corner case, an infinite loop can occur: that’s if the tests cannot be passed. In technical terms: if there is no intersection between the support of the sampling distribution and the support of the test function. In that case, the probability of success is 0, try as you might you’ll never make it. It’s the asymptotic limit of the distribution of the number of attempts as $\alpha$ tends to $0$.
But, the converse is true: if the LLM can generate samples that pass the test, then it will pass in finite time. Unfortunately, finite does not necessarily mean “within human lifespan”. We would typically mitigate with:
or increasing the temperature of the LLM (i.e. the variance of the proposal) to explore broader faster. Still does not provide a guarantee of support intersection, so better combine the two.
Reject all or part TLDR: excellent question, can of worms :)
If I understand @RoboTeddy’s idea as described by @colinmegill , the idea is simply to reject the whole output of the LLM until it passes the tests. That works perfectly for short outputs, e.g. like topic categorization.
However, as we increase the size of the output, and e.g. the number of tests that we apply, the probability of success keeps decreasing. For example, for multiple independent comments being categorized into topics all at once, the probability of success (<1) for all is the product of the probabilities for each. Of course in that trivial topic categorization, it would be nonsensical to reject all the comments just because one comment’s categorization did not satisfy a sanity check. We would reject just the categorization of that one comment -- and this is for example what Jigsaw’s Sensemaking tool is rightly doing!
So the naive approach of “reject everything” could hurt us quickly as we grow bigger.
For topic models such as Sensemaker who split the task of topics discovery vs topics categorization, we might also sometimes need to revisit the discovery if the categorization fails too systematically (closing the loop categorization->discvoery).
This becomes even more important when we look at the summary, which has multiple clauses. We need to find ways to update some clauses, in-place, in a way that is coherent with the other clauses (think of it as Gibbs sampling, from the conditional marginal -- this is the same need in the extension of semantic entropy to mentioned at the bottom of #1881 ). We can do that, just need to be smart of it :)
Speeding it up
We can speed up the wall time to first-success by any arbitrary factor N if we run N calls at the same time and repeat until one of them succeeds. The downside is that we increase the minimum cost (*but not the average cost!*) by a factor N: we in effect round up the number of calls to the smallest multiple of N above what was strictly needed. It might or might not be a problem depending on the cost of a call and the value of N.
Combining with multi-sample
Issues #1880 and #1881 mention using multi-sample and entropy to remove hallucinations. This issue here is also about multi-sample, to match a given success. We can combine the two, and output the first successful not-hallucinated output, or the first not-hallucinated successful output (depends in which order you put the two loops). Again, it can be a bit costly, so might or might not be needed.
Do we really need it, or will the bitter lesson apply?
Even if we assume that we do really have a problem of quality/hallucinations, any remediation such as this one might vanish with the next open-source model coming out, as they keep getting better and better :-) That’s the “bitter lesson”: manual rules and elaborate algorithms get eaten as the data gets bigger / the model gets more powerful.
That’s why I would first prioritize anyway first having proper evals and sanity checks. Not only do they allow us to diagnose the problems, and to improve on them with techniques outlined here, but they also stay relevant regardless of the model growth. Once once the new models which incorporate already a lot of automatic test-time compute become easily accessible (e.g. Chat-GPT O1, O3, or even local-run Deepseek R1 (DeepSeek-AI 2025) that dropped two days ago and seems very promising), we can validate them with our evals -- and might not even have needed around to implementing this test-time compute ideas!
References:
DeepSeek-AI. 2025. ‘DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning’. DeepSeek-AI.
Lightman, Hunter, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. ‘Let’s Verify Step by Step’. arXiv. https://doi.org/10.48550/arXiv.2305.20050.
Snell, Charlie, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. ‘Scaling LLM Test-Time Compute Optimally Can Be More Effective than Scaling Model Parameters’. arXiv. https://doi.org/10.48550/arXiv.2408.03314.
The text was updated successfully, but these errors were encountered:
In #1880 and #1881 we discussed two methods of test-time compute to increase the quality of LLM output. (Snell et al. 2024)’s introduction presents a slew of others, including Best-of-N (generate N samples, keep the “best”) rather elaborate ones like (Lightman et al. 2023) which verify step-by-step, and various strategies of fine-tuning at test-time to refine the proposal.
Our board member @RoboTeddy from https://metr.org/ suggested an even simpler approach, which amounts to a first-success accept/reject sampling for conditional distributions: reject proposals until the sanity checks pass. In more details, assuming we have the automated sanity checks and evals described in #1879 , every time our LLM generates something, we can reject if those sanity checks aren’t met, or if those eval scores are too low.
This is essentially a “first-success” variant of Best-of-N.
There are a few subtleties to it, which are all connected:
How many attempts
TLDR: on average, it will take a number of steps equal to the inverse of the probability of success.
Indeed, unlike Best-of-N which has a fixed number of samples, the number of samples until first success is stochastic. Luckily, from the accept/reject literature, we know the distribution of a first-success: a geometric distribution.
If there is a probability$\alpha$ of the LLM meeting the criteria of interest (i.e. the sample being accepted), the number of samples to first success follows a Geometric distribution of parameter $\alpha$ , whose expected value is $1/ \alpha$ and variance $(1-\alpha)/\alpha^2$ . Intuitively it makes sense: the harder the test to pass, the more attempts needed.
Finite time guarantee
TLDR: yes if the LLM _can_ pass the test, but finite doesn’t mean soon…
In some corner case, an infinite loop can occur: that’s if the tests cannot be passed. In technical terms: if there is no intersection between the support of the sampling distribution and the support of the test function. In that case, the probability of success is 0, try as you might you’ll never make it. It’s the asymptotic limit of the distribution of the number of attempts as$\alpha$ tends to $0$ .
But, the converse is true: if the LLM can generate samples that pass the test, then it will pass in finite time. Unfortunately, finite does not necessarily mean “within human lifespan”. We would typically mitigate with:
Reject all or part
TLDR: excellent question, can of worms :)
If I understand @RoboTeddy’s idea as described by @colinmegill , the idea is simply to reject the whole output of the LLM until it passes the tests. That works perfectly for short outputs, e.g. like topic categorization.
However, as we increase the size of the output, and e.g. the number of tests that we apply, the probability of success keeps decreasing. For example, for multiple independent comments being categorized into topics all at once, the probability of success (<1) for all is the product of the probabilities for each. Of course in that trivial topic categorization, it would be nonsensical to reject all the comments just because one comment’s categorization did not satisfy a sanity check. We would reject just the categorization of that one comment -- and this is for example what Jigsaw’s Sensemaking tool is rightly doing!
So the naive approach of “reject everything” could hurt us quickly as we grow bigger.
For topic models such as Sensemaker who split the task of topics discovery vs topics categorization, we might also sometimes need to revisit the discovery if the categorization fails too systematically (closing the loop categorization->discvoery).
This becomes even more important when we look at the summary, which has multiple clauses. We need to find ways to update some clauses, in-place, in a way that is coherent with the other clauses (think of it as Gibbs sampling, from the conditional marginal -- this is the same need in the extension of semantic entropy to mentioned at the bottom of #1881 ). We can do that, just need to be smart of it :)
Speeding it up
We can speed up the wall time to first-success by any arbitrary factor N if we run N calls at the same time and repeat until one of them succeeds. The downside is that we increase the minimum cost (*but not the average cost!*) by a factor N: we in effect round up the number of calls to the smallest multiple of N above what was strictly needed. It might or might not be a problem depending on the cost of a call and the value of N.
Combining with multi-sample
Issues #1880 and #1881 mention using multi-sample and entropy to remove hallucinations. This issue here is also about multi-sample, to match a given success. We can combine the two, and output the first successful not-hallucinated output, or the first not-hallucinated successful output (depends in which order you put the two loops). Again, it can be a bit costly, so might or might not be needed.
Do we really need it, or will the bitter lesson apply?
Even if we assume that we do really have a problem of quality/hallucinations, any remediation such as this one might vanish with the next open-source model coming out, as they keep getting better and better :-) That’s the “bitter lesson”: manual rules and elaborate algorithms get eaten as the data gets bigger / the model gets more powerful.
That’s why I would first prioritize anyway first having proper evals and sanity checks. Not only do they allow us to diagnose the problems, and to improve on them with techniques outlined here, but they also stay relevant regardless of the model growth. Once once the new models which incorporate already a lot of automatic test-time compute become easily accessible (e.g. Chat-GPT O1, O3, or even local-run Deepseek R1 (DeepSeek-AI 2025) that dropped two days ago and seems very promising), we can validate them with our evals -- and might not even have needed around to implementing this test-time compute ideas!
References:
The text was updated successfully, but these errors were encountered: