|

On the Learnability of Test-Time Adaptation:
A Recovery Complexity Perspective

Zhi Zhou, Ming Yang, Shi-Yu Tian, Kun-Yang Yu, Lan-Zhe Guo, Yu-Feng Li
State Key Laboratory of Novel Software Technology, Nanjing University
Corresponding Author
zhouz@lamda.nju.edu.cn, liyf@nju.edu.cn
To appear in ICML 2026
TL;DR The first learnability theory for TTA on non-stationary streams. We define (ε, δ)-Recovery Complexity and (ε, ρ)-TTA Learnability, prove order-wise matching minimax lower / algorithmic upper bounds, and expose an intrinsic adaptivity–information trade-off.

Abstract

Test-time adaptation (TTA) aims to adapt models to maintain reliable performance on non-stationary test streams without requiring labeled data. Despite its empirical success, the learnability of TTA under non-stationary streams remains unexplored. A key challenge is the lack of a principled theoretical framework that simultaneously aligns with the TTA objective and captures both continuously evolving distribution shifts and intrinsic information constraints. To address this gap, we propose the first theoretical framework for studying the learnability of TTA and introduce (ε, δ)-Recovery Complexity and (ε, ρ)-TTA Learnability. Recovery complexity measures the post-shift time needed to maintain excess risk below a target level with high probability, and is further extended to TTA learnability, which measures the long-term reliability of TTA. Within this framework, we introduce a novel discrete surrogate for non-stationary test streams, enabling a unified and tractable analysis of both gradual and abrupt shifts. We derive order-wise matching lower and upper bounds on recovery complexity, revealing fundamental limits of TTA and an intrinsic adaptivity–information trade-off. These results provide unified learnability guarantees for TTA that complement regret-based analyses.

Framework Overview

Theoretical framework overview (Figure 1 of the paper)
Figure 1: Overview of the theoretical modeling and analysis. A unified model of non-stationary test streams underpins a local analysis via (ε, δ)-Recovery Complexity, which is lifted to a global (ε, ρ)-TTA Learnability.

Motivation

Test-time adaptation adapts a source model to distribution shifts at test time using only unlabeled data through a proxy loss (e.g., entropy), and has demonstrated empirical success across vision, tabular and language domains. Yet recent studies show that TTA can silently fail under complex, non-stationary streams — motivating a theoretical understanding of when TTA can remain reliable.

Existing theory falls short on two fronts: (i) online-learning regret controls only average performance, not the per-step, post-shift reliability TTA demands; and (ii) prior TTA analyses rely on assumptions about specific algorithms or label availability, none of which jointly handles unlabeled supervision, non-stationarity, and temporal correlation. This work asks: when is TTA even learnable?

Modeling the Non-Stationary Stream

We model the test stream $\mathcal{S}=\{D_t\}_{t=1}^T$ along two complementary axes — a global distribution-shift axis and a local temporal-correlation axis — captured by two analytically tractable assumptions.

Global · Δ_W-Quantized Distribution Shift

We adopt the Wasserstein-1 distance on $\mathcal{X}\times\mathcal{Y}$ to discretize the continuous trajectory $\{\mathcal P_t\}$ into a bounded piecewise-constant surrogate at resolution $\Delta_W$, which unifies gradual and abrupt shifts. The number of shifts is bounded by the path variation $V_T$:

$$ V_T = \sum_{t=1}^{T-1} W_1(\mathcal P_t, \mathcal P_{t+1}), \qquad \tilde K_S(T) \;\le\; \left\lceil \frac{2 V_T}{\Delta_W} \right\rceil. $$

Local · φ-Mixing Temporal Dependence

We measure the strength of temporal dependence by a $\phi$-mixing coefficient with geometric decay $\phi(i)\le\varrho^{i}$. Under this condition, temporal correlation effectively shrinks the batch size:

$$ B_{\text{eff}} = \frac{B}{C_\phi} \le B, \qquad C_\phi = 1 + \frac{4\sqrt{\varrho}}{1-\sqrt{\varrho}}. $$

A Well-Defined Target: Proxy-Optimal Competitor

Perfect adaptation is unattainable when the proxy loss $\psi$ does not coincide with the task loss $\ell$. We therefore compete against the best proxy-optimal model within a local neighborhood $\mathcal N_r(\theta_1)$ of the source parameters, and take the resulting excess risk $\mathcal E_t$ as the central quantity throughout:

$$ \theta_t^{\star} \in \arg\min_{\theta \in \mathcal N_r(\theta_1)} \psi_t(\theta), \qquad R_t := \ell_t(\theta_t^{\star}), \qquad \mathcal E_t := \ell_t(\theta_t) - R_t. $$

Our analysis hinges on a local (α, ζ)-alignment between the proxy and task gradients — α > 0 is the alignment strength and ζ ≥ 0 an irreducible misalignment. This guarantees the proxy gradient carries a constructive descent signal for the task loss, which is the very foundation for TTA to be feasible:

$$ \bigl\langle \nabla \psi_t(\theta),\, \nabla \ell_t(\theta) \bigr\rangle \;\ge\; \alpha \,\|\nabla \ell_t(\theta)\|^2 - \zeta. $$

(ε, δ)-Recovery Complexity

We introduce (ε, δ)-Recovery Complexity to measure the minimum number of time steps required for a TTA algorithm to reduce — and keep — its excess risk below ε with probability at least 1−δ after a distribution shift. Unlike regret, this controls the length of unsatisfactory periods.

$$ \tau(\epsilon, \delta) \;:=\; \inf\Bigl\{\, t \in \mathbb N \;:\; \sup_{u \ge t}\, \mathbb P\!\left( \mathcal E_u > \epsilon \right) \le \delta \,\Bigr\}, \quad \mathcal E_t := \ell_t(\theta_t) - R_t. $$

We establish order-wise matching minimax lower and upper bounds:

$$ \underbrace{\tau \;\ge\; \Omega\!\left( \frac{C_\phi}{B} \cdot \frac{1}{\alpha\bigl(\sqrt{\zeta + 2\alpha\epsilon} + \sqrt{\zeta}\bigr)^2} \right)}_{\text{any algorithm}} \qquad \underbrace{\tau \;\le\; \mathcal O\!\left( \frac{C_\phi}{B\,\alpha^2 \epsilon} \log\frac{\Delta_W + \epsilon}{\epsilon} \right)}_{\text{simple SGD baseline}} $$

The matching bounds pin down what fundamentally governs recovery:

  • Error floor. ζ > 0 leaves a non-vanishing term as ε → 0 — misalignment caps achievable accuracy.
  • Speed. τ ∝ 1/α² under a perfect proxy, and τ ∝ Cφ/B through batch size and temporal correlation.
  • Near-optimal. The bounds match up to a log factor, so the simple SGD baseline is already near-minimax-optimal — leaving little room for algorithmic improvement without stronger feedback.

Together, these expose an intrinsic adaptivity–information trade-off at the heart of TTA.

(ε, ρ)-TTA Learnability

A problem is (ε, ρ)-learnable if some algorithm keeps the fraction of time steps violating the target ε below ρ, formalizing long-term reliability over the entire stream:

$$ \frac{1}{T} \sum_{t=1}^{T} \mathbb P\!\left( \ell_t(\theta_t) - R_t > \epsilon \right) \;\le\; \rho. $$

We bridge local recovery to global learnability: faster recovery or fewer shifts translate directly into a smaller violation rate ρ — a strong transfer principle.

$$ \rho \;\le\; \delta \;+\; \frac{(\tilde K_S(T)+1)\,\tau(\epsilon',\delta)}{T}, \qquad \epsilon' = \epsilon - \Lambda \Delta_W. $$

Finally, (ε, ρ)-learnability implies a bound on the expected cumulative dynamic regret, $\mathrm{Reg}(T) \le T(\epsilon + M\rho)$. This clarifies both the connection and the two intrinsic differences from online learning: persistent shifts ($\Delta_W=\Omega(1)$) force linear regret, and unlabeled supervision injects an alignment bias no recovery time can remove — while in the benign regime ($\Delta_W\to 0$) we recover the classical sublinear $o(T)$ regret.

Conclusion

We introduced the first learnability framework for test-time adaptation under unlabeled non-stationary streams, modeling global shifts via a quantized Wasserstein approximation and local dependence via a $\phi$-mixing process. Within this framework, we defined (ε, δ)-Recovery Complexity and derived nearly matching minimax lower and upper bounds, revealing how proxy alignment, target accuracy, and batch size govern recovery. Building on this, we introduced (ε, ρ)-TTA Learnability and connected it to dynamic regret, showing that persistent shifts force linear regret and that unlabeled supervision incurs an intrinsic cumulative cost. We hope this framework provides a principled foundation for future theoretical and algorithmic developments in test-time adaptation.

Takeaway. Recovery complexity is the building block of TTA learnability — it pins down when adaptation is possible and what fundamentally limits it.

Acknowledgements

This research was supported by the Jiangsu Science Foundation (BK20243012, BG2024036, BK20232003), the National Natural Science Foundation of China (Grant No. 624B2068, 62576162), the Fundamental Research Funds for the Central Universities (022114380023), and the "111 Center" (No. B26023).

BibTeX

@inproceedings{zhou26learnability,
  author       = {Zhi Zhou and Ming Yang and Shi-Yu Tian and Kun-Yang Yu and Lan-Zhe Guo and Yu-Feng Li},
  title        = {On the Learnability of Test-Time Adaptation: A Recovery Complexity Perspective},
  booktitle    = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year         = {2026}
}