Dimitriadis et al. propose several tests based on comparing actual predictions to predictions when the probabilities are calibrated. This yields several possible tests of correctly calibrated predictions (i.e. that the expected proportion of true values matches the predicted probability).

brier_resampling_test(x, y, alpha = 0.05, B = 10000)

brier_resampling_p(x, y, B = 10000)

binary_miscalibration(x, y)

miscalibration_resampling_p(x, y, B = 10000)

miscalibration_resampling_test(x, y, alpha = 0.05, B = 10000)

Arguments

x

the predicted success probabilities

y

the actual observed outcomes (just 0 or 1)

alpha

the type I error rate for the test

B

number of boostrap samples for the null distribution

Value

brier_resampling_test and miscalibration_resampling_test return an object of class htest, brier_resampling_p and miscalibration_resampling_p

return just the p-value (for easier use with automated workflows). binary_miscalibration computes just the miscalibration component using the PAV (pool adjacent violators) algorithm.

Details

The brier_ functions represent a test based on brier score, while the miscalibration_ functions represent a test based on miscalibration. In both cases we evaluate the null distribution via bootstrapping.

References

T. Dimitriadis, T. Gneiting, & A.I. Jordan,

Stable reliability diagrams for probabilistic classifiers, Proc. Natl. Acad. Sci. U.S.A. 118 (8) e2016191118, https://doi.org/10.1073/pnas.2016191118 (2021).