NSCORE: Nonparametric Sequential Comparison for Rigorous Evaluation

Paper
Code

TL;DR: We introduce NSCORE, a finite-sample-valid nonparametric sequential testing procedure for determining statistically significant performance improvements between robotic policies. NSCORE requires a near-minimal number of evaluation trials in expectation while allowing for tunable levels of controlled Type-I error risk. Despite requiring only minimal assumptions over the distribution of the evaluation metric, it can rapidly adapt to parametric structure without sacrificing validity.

The Policy Comparison Problem

We motivate the importance of NSCORE and related evaluation procedures by making concrete the specific research questions NSCORE answers, and the specific error risks NSCORE allows the researcher to control.

Assume that a robotics researcher proposes a novel methodology: a new training dataset, a new architecture, a new surrogate loss or regularization, etc. Using this new method, they will synthesize a robot policy, termed either π₁ or π_B. For a given (set of) task(s), there will be a natural baseline -- the previous state-of-the-art policy, termed π₀ or π_A. To assess the result of complex or abstract changes in methodology, the researcher must demonstrate empirical improvement of π₁ over π₀. There are a natural set of objectives in making this comparison:

Type-1 Error: The risk that the researcher concludes π₁ is better than π₀, when it is not better (FALSE POSITIVE).
Type-2 Error: The risk that the researcher concludes π₁ is not better than π₀, when it is better (FALSE NEGATIVE).
Expected Sample Complexity: The number of evaluation trials needed (on average) to make a decision at a given level of Type-1 and Type-2 Error.

In terms of optimization objectives the researcher needs an evaluation method that is:

Not too optimistic (controls Type-1 Error).
Not too pessimistic (minimizes Type-2 Error).
Not too slow (minimizes expected sample complexity).

NSCORE formulation

We consider the problem of policy comparison for two robot policies using any bounded evaluation metric. We consider the quantity of interest to be the mean performance of the policy; this yields a canonical statistical testing problem of mean comparison, corresponding to null (red) and alternative (blue) hypotheses as shown below:

Approach overview

NSCORE utilizes intuition from safe, anytime-valid inference (SAVI) procedures to construct a dynamical system based upon the evaluation outcomes. This system is designed to be stable (in expectation) under the null hypothesis and unstable under the alternative. Then, Ville's Inequality provides a rigorous method to translate instability of the system into interpretable evidence for the null and alternative, respectively. Efficiency resides in constructing a `good' dynamical system.

Intuitively: NSCORE creates an adaptive dynamical system that grows maximially quickly when the novel policy mean is greater (i.e., in the blue region) at a rate dependent on how close the means are, and decays maximally quickly when the novel policy mean is lower (i.e., in the red region), again at a rate dependent upon how close the means are. This is accomplished via a betting mechanism that allows the system to adapt online:

NSCORE and Baselines

Statistically rigorous policy comparison procedures can be organized according to the degree of structural assumptions required for statistical validity (i.e., Type-1 Error Control). We will show that, excepting the specific case of binary success/failure metrics (considered in prior work), NSCORE matches or exceeds existing SOTA evaluation methods.

First, we demonstrate that NSCORE₂ closely approximates the structured Θ-SAVI method when structure is present. We illustrate this first on synthetic Bernoulli data, where we confirm that STEP is SOTA, but illustrate that NSCORE_∞ (which requires no structural assumptions) approximates the performance in terms of expected time-to-decision. NSCORE_∞ denotes the case with minimal structural prior, where an approximately 4-5% hit is taken in terms of time-to-decision. NSCORE₂ denotes the case where NSCORE is given the same parametric information as Θ-SAVI, and we observe that performance is exactly recovered (these times are the same up to statistical noise). Note, however, that ₂ remains valid if the data is not binary (it is simply less efficient), whereas Θ-SAVI does not remain valid in that setting.

We next compare NSCORE₂ and Θ-SAVI on the TRI LBM hardware data from Barreiros, et. al., 2025, using the structured binary success-failure metrics (right-hand side). We observe exact overlap in time-to-decision for each method on each task, illustrating the near-quivalence of the methods when the data is known to be Bernoulli a priori.

Next, we compare fully nonparametric methods on general progress metrics by comparing NSCORE_∞ with the WSR procedure. On simulated data from progress metrics of policies trained on the RL Mujoco baselines (using the CleanRL Implementation), we observe the hypothesized Pareto dominance: when the comparisons are easy, NSCORE_∞ and WSR are similar; when the comparisons are hard, NSCORE_∞ significantly outperforms WSR.

Finally, we compare NSCORE_∞ and WSR on the RoboArena hardware data from Atreya, et. al., 2025, simultaneously comparing four policies to effect a complete, statistically valid ranking. Each policy has 641 rollouts in the dataset; data for the policy is added until each pairwise comparison including that policy is terminated. Note how quickly PG-Bin is separated from every other method (18 trials or fewer per comparison); sequential comparison allows for active decisions over which data to collect, and the benefits can be amplified in multihypothesis settings (i.e., with more than 2 policies). The total number of requisite evaluations can be taken by summing over the numbers in each figure. From 2560 total evals in the dataset, NSCORE_∞ requires only 1440 for a full ranking; WSR requires 1880, and fails to separate PG-Diff from π₀.

Sensitivity to α

The dynamical system intuition allows us to formalize the logarithmic dependence of the expected time-to-decision on the Type-1 Error level, α. For a given evaluation problem (or distribution of evaluation problems), there exists an aggregate expected growth rate. Thus, the expected scaling of time-to-decision is logarithmic in α. We show this for synthetic nonparametric data, using NSCORE_∞ to regress on time-to-decision. We estimate for this distribution of evaluation problems an approximate expected growth rate of 1.103, resulting in an expected increase of 23.4 additional evals per 10x reduction in α.

Citation

    @inproceedings{snyder_beyond_2026,
        title = {Beyond {Binary} {Success}: {Sample}-{Efficient} and {Statistically} {Rigorous} {Robot} {Policy} {Comparison}},
        author = {Snyder, David and Badithela, Apurva and Matni, Nikolai and Pappas, George and Majumdar, Anirudha and Itkina, Masha and Nishimura, Haruki},
        booktitle={arXiv preprint arXiv:2603.13616}
        year = {2026},
}