A Theoretical Study on
Bridging Internal Probability and Self-Consistency for LLM Reasoning

Zhi Zhou^{1, †}, Yuhao Tan¹, Zenan Li^{2, †}, Yuan Yao¹, Lan-Zhe Guo¹, Yu-Feng Li^{1, ✉}, Xiaoxing Ma^{1, ✉}

¹State Key Laboratory of Novel Software Technology, Nanjing University, China
²Department of Computer Science, ETH Zurich, Switzerland
^✉ Corresponding Author ^† Project Leader
zhouz@lamda.nju.edu.cn
NeurIPS 2025

Paper arXiv Code 💻 Demo
🤗 HF #1 Paper of the Day 👀 View:

TL;DR: We introduce the first theoretical framework for analyzing LLM reasoning errors, and bridge typical sampling-based test-time scaling methods to achieve both low error and fast convergence.

Abstract

Test-time scaling seeks to improve the reasoning performance of large language models (LLMs) by adding computational resources. A prevalent approach within the field is sampling-based test-time scaling methods, which enhance reasoning by generating multiple reasoning paths for a given input during inference. However, despite its practical success, the theoretical foundations remain underexplored. In this paper, we provide the first theoretical framework for analyzing sampling-based test-time scaling methods, grounded in the perspective of confidence estimation. Based on the framework, we analyze two dominant paradigms: self-consistency and perplexity, and reveal key limitations: self-consistency suffers from high estimation error while perplexity exhibits substantial modeling error and possible degradation of the estimation error convergence. To address these limitations, we introduce Rpc, a hybrid method that leverages our theoretical insights through two key components: Perplexity Consistency and Reasoning Pruning. Perplexity Consistency combines the strengths of self-consistency and perplexity, boosting the convergence rate of estimation error from linear to exponential while preserving model error. Reasoning Pruning prevents degradation by eliminating low-probability reasoning paths. Both theoretical analysis and empirical results across seven benchmark datasets demonstrate that Rpc has a strong potential for reducing reasoning error. Notably, RPC achieves reasoning performance comparable to self-consistency while not only enhancing confidence reliability but also reducing sampling costs by 50%.

1. Theoretical Framework

1.1. Theorecial Definition

LLM Reasoning Process Diagram showing sampling paths and candidate answers

Figure 1: LLM Reasoning

The LLM reasoning process is illustrated in Figure 1, where the LLM samples several reasoning paths $\hat{t}_1, \ldots, \hat{t}_n$ for a given question $x$ with the ground-truth answer $y$. A parsing function $g(\cdot)$ then converts these reasoning paths into the corresponding candidate answers $\hat{y}_1, \ldots, \hat{y}_n$. Sampling-based test-time scaling methods, such as perplexity and self-consistency, are used to estimate the confidence of each candidate answer $\hat{y}_i$, which is denoted as $\hat{p} \left (\hat{y}_i \mid x \right )$. The ground-truth confidence is denoted by $p \left (\hat{y}_i \mid x \right ) $.

The reasoning error of LLM for each sampled reasoning path $\hat{t}_i$ or candidate answer $\hat{y}_i$ is defined as follows: $$\begin{cases} \mathcal{E}_{\hat{p}}(\hat{t}) =& \mathbb{E} \left [ \big ( \hat{p}(\hat{t} \,|\, x) - \mathbb{I}[g(\hat{t}) = y] \big )^2 \right ], \\ \mathcal{E}_{\hat{p}}(\hat{y}) =& \mathbb{E} \left [ \big ( \hat{p}(\hat{y} \,|\, x) - \mathbb{I}[\hat{y} = y] \big )^2 \right ]. \end{cases}$$

1.2. Theorecial Analysis

General Decomposition (Proposition 1)

First, we decompose the reasoning error of the LLM into two parts: Estimation error and Model error, as follows: $$ \mathcal{E}_{\hat{p}}(\hat{y}) = \underbrace{\mathbb{E} \left [\big ( \hat{p}(\hat{y} \,|\, x) - p(\hat{y} \,|\, x) \big )^2 \right ]}_{\text{Estimation Error}} + \underbrace{\big ( p(\hat{y} \,|\, x) - \mathbb{I}[\hat{y} = y] \big )^2}_{\text{Model Error}}. $$

The Estimation Error depends solely on the sampling size and the confidence estimation strategy, which allows us to further analyze different sampling-based test-time scaling methods.
The Model Error is invariant and is determined by the reasoning capability of the LLM, which eliminates the influence of different LLMs on reasoning in our theoretical analysis.

Analysis of Self-Consistency (Proposition 2)

Next, we analyze the typical self-consistency method (SC), which estimates the confidence using Monte Carlo estimation: $$ \mathcal{E}_{\hat{p}^{(SC)}}(\hat{y}) = \underbrace{\frac{1}{n} p(\hat{y} \,|\, x) (1- p(\hat{y}\,|\, x))}_{\text{Estimation Error}} + \underbrace{\big (p(\hat{y} \,|\, x) - \mathbb{I}[\hat{y} = y] \big )^2}_{\text{Model Error}}. $$

SC only achieves a linear convergence rate of the estimation error corresponding to the sampling size, which results in substantial reasoning error when sampling is limited.

Analysis of Perplexity (Proposition 3)

Finally, we analyze the typical perplexity method (PPL), which estimates the confidence using the internal probability of the LLM:

$$ \mathcal{E}_{\hat{p}^{(PPL)}}(\hat{t}) = \underbrace{(1 - p(\hat{t} \,|\, x))^n {p}(\hat{t} \,|\, x) ( 2 \mathbb{I}[\hat{y}_i = y] - p(\hat{t} \,|\, x) ) }_{\text{Estimation Error}} + \underbrace{\big ( p(\hat{t} \,|\, x) - \mathbb{I}[g(\hat{t}) = y] \big )^2}_{\text{Model Error}}. $$

Compared with SC, the estimation error of PPL decreases exponentially, and the rate depends on the value of the ground-truth confidence $p(\hat{t}\,|\, x)$.
The model error of PPL is not satisfactory because it is computed using the raw reasoning paths, which ignore the structure of the answer space defined by the parsing function $g(\cdot)$.

2. Rpc Method

2.1. Illustration of Rpc

Motivated by our theoretical analysis, we propose the Rpc method, which combines self-consistency and the internal probability of LLMs to achieve both low model error as in SC and a fast convergence rate as in PPL. The Rpc method consists of two key components: Perplexity Consistency (PC) and Reasoning Pruning (RP), as illustrated in Figure 2:

Perplexity Consistency integrates internal LLM probabilities into the self-consistency framework, which ensures that the convergence rate of the estimation error improves from linear to exponential in most cases.
Reasoning Pruning (RP) automatically eliminates reasoning paths with low probabilities to prevent the convergence rate of the estimation error from reverting to linear in the remaining cases.

RPC Method Framework Diagram showing Perplexity Consistency and Reasoning Pruning components

Figure 2: RPC Method

2.2. Analysis of Rpc (Theorem 4 and Theorem 7)

We first show that PC can increase the convergence rate of the estimation error to exponential when $\alpha$ is not very small, while maintaining the same model error as SC: $$ \mathcal{E}_{\hat{p}^{(PC)}}(\hat{y}) = \underbrace{ \alpha^n p(\hat{y} \,|\, x) \big(2 \mathbb{I}[\hat{y}=y] - (1 + \alpha^n) p(\hat{y} \,|\, x) \big) }_{\text{Estimation Error}} + \underbrace{\left ( p(\hat{y} \,|\, x) - \mathbb{I}[\hat{y} = y] \right )^2}_{\text{Model Error}}, $$ where $k = |\{\tilde{t} \mid g(\tilde{t}) = \hat{y}\}|$ and $\alpha := 1 - \frac{1}{k} p(\hat{y} \,|\, x)$.

Then, our Theorem 7 proves that RP can eliminate candidate answers with low probabilities by directly pruning the reasoning paths, thereby eliminating the cases where $\alpha \rightarrow 0$.

3. Empirical Results

3.1. Efficiency

Table 1: Efficiency of Rpc Method Efficiency comparison table showing 50% reduction in sampling costs for RPC method

As shown in Table 1, Rpc achieves a 50% reduction in sampling costs compared to SC while maintaining the same reasoning performance. Therefore, Rpc offers an excellent computational trade-off, where minimal computational overhead of RP is exchanged for significant time savings by reducing the number of required LLM inferences.

3.2. Efficacy

Efficacy comparison chart showing RPC method performance across different sample budgets

Figure 3: Efficacy of Rpc Method

We evaluate the performance of PC and RPC in Figure 3 across various sample budgets. The results show that Rpc consistently achieves better performance than both PPL and SC. The performance gap between Rpc and PC indicates the effectiveness of the RP module, while the gap between PC and SC indicates the effectiveness of the PC module.

3.3. Reliability

Table 2: Reliability of Rpc Method Reliability comparison table showing ECE metrics for different methods

As shown in Table 2, we report the ECE metric for each method, and the results show that Rpc achieves the lowest average ECE across all datasets. This indicates that Rpc not only improves the LLM reasoning accuracy but also enhances the reliability of the estimated confidence for each candidate answer.

4. Future Directions

Analyzing Sampling-Based Test-Time Scaling Methods: Our paper shows that our theoretical framework is general enough to analyze two typical sampling-based test-time scaling methods. It can also be easily extended to analyze other advanced sampling-based test-time scaling methods that originate from these two types of methods.
Exploring Advanced Sampling Strategies: Our theoretical results indicate that better convergence requires an effective sampling strategy to sample sufficiently diverse reasoning paths, which is not deeply investigated in this paper. Improved sampling strategies could be designed by drawing inspiration from our theoretical framework.
Applying to Diverse Reasoning Tasks: In our paper, we show that the Rpc method is effective on math, code, and logical reasoning tasks without exploiting task-specific properties. Task-specific methods could be developed based on our theoretical framework to achieve better performance (e.g., by considering the properties of the parsing function $g$).

5. Demo

BibTeX

@inproceedings{zhou24theoretical,
      author    = {Zhou, Zhi and Tan, Yuhao and Li, Zenan and Yao, Yuan and Guo, Lan-Zhe and Li, Yu-Feng and Ma, Xiaoxing},
      title     = {A Theorecial Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning},
      booktitle = {Advances in Neural Information Processing Systems},
      year      = {2025},
    }