NSCORE: Nonparametric Sequential Comparison for Rigorous Evaluation
TL;DR: We introduce NSCORE, a finite-sample-valid nonparametric sequential testing procedure for determining statistically significant performance improvements between robotic policies. NSCORE requires a near-minimal number of evaluation trials in expectation while allowing for tunable levels of controlled Type-I error risk. Despite requiring only minimal assumptions over the distribution of the evaluation metric, it can rapidly adapt to parametric structure without sacrificing validity.
The Policy Comparison Problem
We motivate the importance of NSCORE and related evaluation procedures by making concrete the specific research questions NSCORE answers, and the specific error risks NSCORE allows the researcher to control.
Assume that a robotics researcher proposes a novel methodology: a new training dataset, a new architecture, a new surrogate loss or regularization, etc. Using this new method, they will synthesize a robot policy, termed either π1 or πB. For a given (set of) task(s), there will be a natural baseline -- the previous state-of-the-art policy, termed π0 or πA. To assess the result of complex or abstract changes in methodology, the researcher must demonstrate empirical improvement of π1 over π0. There are a natural set of objectives in making this comparison:
- Type-1 Error: The risk that the researcher concludes π1 is better than π0, when it is not better (FALSE POSITIVE).
- Type-2 Error: The risk that the researcher concludes π1 is not better than π0, when it is better (FALSE NEGATIVE).
- Expected Sample Complexity: The number of evaluation trials needed (on average) to make a decision at a given level of Type-1 and Type-2 Error.
- Not too optimistic (controls Type-1 Error).
- Not too pessimistic (minimizes Type-2 Error).
- Not too slow (minimizes expected sample complexity).
NSCORE formulation
We consider the problem of policy comparison for two robot policies using any bounded evaluation metric. We consider the quantity of interest to be the mean performance of the policy; this yields a canonical statistical testing problem of mean comparison, corresponding to null (red) and alternative (blue) hypotheses as shown below:
Approach overview
NSCORE utilizes intuition from safe, anytime-valid inference (SAVI) procedures to construct a dynamical system based upon the evaluation outcomes. This system is designed to be stable (in expectation) under the null hypothesis and unstable under the alternative. Then, Ville's Inequality provides a rigorous method to translate instability of the system into interpretable evidence for the null and alternative, respectively. Efficiency resides in constructing a `good' dynamical system.
Intuitively: NSCORE creates an adaptive dynamical system that grows maximially quickly when the novel policy mean is greater (i.e., in the blue region) at a rate dependent on how close the means are, and decays maximally quickly when the novel policy mean is lower (i.e., in the red region), again at a rate dependent upon how close the means are. This is accomplished via a betting mechanism that allows the system to adapt online:
NSCORE and Baselines
Statistically rigorous policy comparison procedures can be organized according to the degree of structural assumptions required for statistical validity (i.e., Type-1 Error Control). We will show that, excepting the specific case of binary success/failure metrics (considered in prior work), NSCORE matches or exceeds existing SOTA evaluation methods.
First, we demonstrate that NSCORE2 closely approximates the structured Θ-SAVI method when structure is present. We illustrate this first on synthetic Bernoulli data, where we confirm that STEP is SOTA, but illustrate that NSCORE∞ (which requires no structural assumptions) approximates the performance in terms of expected time-to-decision. NSCORE∞ denotes the case with minimal structural prior, where an approximately 4-5% hit is taken in terms of time-to-decision. NSCORE2 denotes the case where NSCORE is given the same parametric information as Θ-SAVI, and we observe that performance is exactly recovered (these times are the same up to statistical noise). Note, however, that 2 remains valid if the data is not binary (it is simply less efficient), whereas Θ-SAVI does not remain valid in that setting.
We next compare NSCORE2 and Θ-SAVI on the TRI LBM hardware data from Barreiros, et. al., 2025, using the structured binary success-failure metrics (right-hand side). We observe exact overlap in time-to-decision for each method on each task, illustrating the near-quivalence of the methods when the data is known to be Bernoulli a priori.
Next, we compare fully nonparametric methods on general progress metrics by comparing NSCORE∞ with the WSR procedure. On simulated data from progress metrics of policies trained on the RL Mujoco baselines (using the CleanRL Implementation), we observe the hypothesized Pareto dominance: when the comparisons are easy, NSCORE∞ and WSR are similar; when the comparisons are hard, NSCORE∞ significantly outperforms WSR.
Finally, we compare NSCORE∞ and WSR on the RoboArena hardware data from Atreya, et. al., 2025, simultaneously comparing four policies to effect a complete, statistically valid ranking. Each policy has 641 rollouts in the dataset; data for the policy is added until each pairwise comparison including that policy is terminated. Note how quickly PG-Bin is separated from every other method (18 trials or fewer per comparison); sequential comparison allows for active decisions over which data to collect, and the benefits can be amplified in multihypothesis settings (i.e., with more than 2 policies). The total number of requisite evaluations can be taken by summing over the numbers in each figure. From 2560 total evals in the dataset, NSCORE∞ requires only 1440 for a full ranking; WSR requires 1880, and fails to separate PG-Diff from π0.
Sensitivity to α
The dynamical system intuition allows us to formalize the logarithmic dependence of the expected time-to-decision on the Type-1 Error level, α. For a given evaluation problem (or distribution of evaluation problems), there exists an aggregate expected growth rate. Thus, the expected scaling of time-to-decision is logarithmic in α. We show this for synthetic nonparametric data, using NSCORE∞ to regress on time-to-decision. We estimate for this distribution of evaluation problems an approximate expected growth rate of 1.103, resulting in an expected increase of 23.4 additional evals per 10x reduction in α.