Test-time sampling, which aims to sample multiple reasoning paths for a given input during inference, is a widely adopted technique to improve the reasoning performance of large language models (LLMs). However, despite its practical success, the theoretical foundations remain underexplored. In this paper, we provide the first theoretical framework for analyzing test-time sampling strategies, grounded in the perspective of confidence estimation. Based on the framework, we analyze two dominant paradigms: self-consistency and perplexity, and reveal key limitations: self-consistency suffers from high estimation error while perplexity exhibits substantial modeling error and possible degradation of the estimation error convergence. To address these limitations, we introduce Rpc, a hybrid method that leverages our theoretical insights through two key components: Perplexity Consistency and Reasoning Pruning. Perplexity Consistency combines the strengths of self-consistency and perplexity, boosting the convergence rate of estimation error from linear to exponential while preserving model error. Reasoning Pruning prevents degradation by eliminating low-probability reasoning paths. Both theoretical analysis and empirical results across seven benchmark datasets demonstrate that Rpc has a strong potential for reducing reasoning error. Notably, Rpc achieves reasoning performance comparable to self-consistency while not only enhancing confidence reliability but also reducing sampling costs by 50%.
The LLM reasoning process is illustrated in Figure 1, where the LLM samples several reasoning paths \(\hat{t}_1, \ldots, \hat{t}_n\) for a given question \(x\) with the ground-truth answer \(y\). A parsing function \(g(\cdot)\) then converts these reasoning paths into the corresponding candidate answers \(\hat{y}_1, \ldots, \hat{y}_n\). Test-time sampling methods, such as perplexity and self-consistency, are used to estimate the confidence of each candidate answer \(\hat{y}_i\), which is denoted as \(\hat{p} \left (\hat{y}_i \mid x \right )\). The ground-truth confidence is denoted by \(p \left (\hat{y}_i \mid x \right ) \).
The reasoning error of LLM for each sampled reasoning path \(\hat{t}_i\) or candidate answer \(\hat{y}_i\) is defined as follows: $$\begin{cases} \mathcal{E}_{\hat{p}}(\hat{t}) =& \mathbb{E} \left [ \big ( \hat{p}(\hat{t} \,|\, x) - \mathbb{I}[g(\hat{t}) = y] \big )^2 \right ], \\ \mathcal{E}_{\hat{p}}(\hat{y}) =& \mathbb{E} \left [ \big ( \hat{p}(\hat{y} \,|\, x) - \mathbb{I}[\hat{y} = y] \big )^2 \right ]. \end{cases}$$
First, we decompose the reasoning error of the LLM into two parts: Estimation error and Model error, as follows: $$ \mathcal{E}_{\hat{p}}(\hat{y}) = \underbrace{\mathbb{E} \left [\big ( \hat{p}(\hat{y} \,|\, x) - p(\hat{y} \,|\, x) \big )^2 \right ]}_{\text{Estimation Error}} + \underbrace{\big ( p(\hat{y} \,|\, x) - \mathbb{I}[\hat{y} = y] \big )^2}_{\text{Model Error}}. $$
Next, we analyze the typical self-consistency method (SC), which estimates the confidence using Monte Carlo estimation: $$ \mathcal{E}_{\hat{p}^{(SC)}}(\hat{y}) = \underbrace{\frac{1}{n} p(\hat{y} \,|\, x) (1- p(\hat{y}\,|\, x))}_{\text{Estimation Error}} + \underbrace{\big (p(\hat{y} \,|\, x) - \mathbb{I}[\hat{y} = y] \big )^2}_{\text{Model Error}}. $$
$$ \mathcal{E}_{\hat{p}^{(PPL)}}(\hat{t}) = \underbrace{(1 - p(\hat{t} \,|\, x))^n {p}(\hat{t} \,|\, x) ( 2 \mathbb{I}[\hat{y}_i = y] - p(\hat{t} \,|\, x) ) }_{\text{Estimation Error}} + \underbrace{\big ( p(\hat{t} \,|\, x) - \mathbb{I}[g(\hat{t}) = y] \big )^2}_{\text{Model Error}}. $$
Motivated by our theoretical analysis, we propose the Rpc method, which combines self-consistency and the internal probability of LLMs to achieve both low model error as in SC and a fast convergence rate as in PPL. The Rpc method consists of two key components: Perplexity Consistency (PC) and Reasoning Pruning (RP), as illustrated in Figure 2:
We first show that PC can increase the convergence rate of the estimation error to exponential when \(\alpha\) is not very small, while maintaining the same model error as SC: $$ \mathcal{E}_{\hat{p}^{(PC)}}(\hat{y}) = \underbrace{ \alpha^n p(\hat{y} \,|\, x) \big(2 \mathbb{I}[\hat{y}=y] - (1 + \alpha^n) p(\hat{y} \,|\, x) \big) }_{\text{Estimation Error}} + \underbrace{\left ( p(\hat{y} \,|\, x) - \mathbb{I}[\hat{y} = y] \right )^2}_{\text{Model Error}}, $$ where \(k = |\{\tilde{t} \mid g(\tilde{t}) = \hat{y}\}|\) and \(\alpha := 1 - \frac{1}{k} p(\hat{y} \,|\, x)\).
Then, our Theorem 7 proves that RP can eliminate candidate answers with low probabilities by directly pruning the reasoning paths, thereby eliminating the cases where \(\alpha \rightarrow 0\).
As shown in Table 1, Rpc achieves a 50% reduction in sampling costs compared to SC while maintaining the same reasoning performance. Therefore, Rpc offers an excellent computational trade-off, where minimal computational overhead of RP is exchanged for significant time savings by reducing the number of required LLM inferences.
We evaluate the performance of PC and RPC in Figure 3 across various sample budgets. The results show that Rpc consistently achieves better performance than both PPL and SC. The performance gap between Rpc and PC indicates the effectiveness of the RP module, while the gap between PC and SC indicates the effectiveness of the PC module.
As shown in Table 2, we report the ECE metric for each method, and the results show that Rpc achieves the lowest average ECE across all datasets. This indicates that Rpc not only improves the LLM reasoning accuracy but also enhances the reliability of the estimated confidence for each candidate answer.
@inproceedings{zhou24theoretical,
author = {Zhou, Zhi and Tan, Yuhao and Li, Zenan and Yao, Yuan and Guo, Lan-Zhe and Li, Yu-Feng and Ma, Xiaoxing},
title = {A Theorecial Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
}