In some cases, one may want to exclude extreme datasets from SBC (e.g. because those datasets create divergences). It is best to use prior predictive checks to examine your priors and change them to avoid the extreme datasets. In some cases, this may however be impractical/impossible to do via prior choice - one example are regression coefficients, where once we have many predictors, any independent prior that is not very strict will lead to unrealistic predictions. Joint priors are needed in such case, but those are not well understood and easy to use. See Paul Bürkner’s talk on SBC StanConnect for more context.

An alternative is to use rejection sampling i.e. we repeatedly generate a dataset and only accept it as a dataset when it passes a certain condition we impose (e.g. that no observed count is larger than \(10^8\)). But does rejection sampling when generating datasets affect the validity of SBC?

Thanks to forum user Niko Huurre who derived the necessary math at Stan Discourse discussion of the topic we know exactly when it is OK. Briefly: for algorithms that only need to know the posterior density up to a constant (which includes Stan and many others), it is OK as long as the rejection criterion only uses observed data and not the unobserved parameters.

We’ll first walk through the math and then show examples of both OK and problematic rejection sampling.

The math

Let \(f\left(y\right)\) be the probability that the simulated dataset \(y\) is rejected (usually a 0-1 function if you have a clear idea what a “bad” dataset looks like, but could be probabilistic if you’re relying on finicky diagnostics). The important numbers are the probability of rejection for parameter \(\theta\)

\[ L\left(\theta\right)=\int f\left(y\right)\pi\left(y|\theta\right)\mathrm{d}y \]

and the total rate of rejections from the prior

\[ R=\iint f\left(y\right)\pi\left(y|\theta\right)\pi\left(\theta\right)\mathrm{d}y\mathrm{d}\theta=\int L\left(\theta\right)\pi\left(\theta\right)\mathrm{d}\theta \]

Rejecting the parameter draw when it generates a “bad” dataset effectively distorts the prior

\[ \pi\left(\theta\right)\to\frac{L\left(\theta\right)}{R}\pi\left(\theta\right) \]

and of course rejections change the generating distribution

\[ \pi\left(y|\theta\right)\to\frac{f\left(y\right)}{L\left(\theta\right)}\pi\left(y|\theta\right) \]

but crucially these changes cancel out when computing the posterior. Before rejections we have:

\[ \pi(\theta | y) \propto \pi(y | \theta) \pi(\theta) \]

After rejections we have

\[ \pi(\theta | y) \propto \frac{L(\theta)}{R} \pi(y | \theta) \frac{f(y)}{L(\theta)} \pi(\theta) = \frac{f(y)}{R} \pi(y | \theta) \pi(\theta) \]

And since \(\frac{f(y)}{R}\) is a constant for any given dataset (and hence the fit), the overall posterior for Stan (and most other MCMC algorithms) is the same, because Stan only needs the posterior density up to a constant. So whether we take rejection into account or not, the model will match the generating process. However, if \(f\) also depended on \(\theta\), it would no longer contribute a constant and we’ll get a mismatch between the generator and model.

Practical examples

So let’s see if that also happens in practice. Let’s setup our environment:

library(SBC)

use_cmdstanr <- TRUE # Set to false to use rstan instead

if(use_cmdstanr) {
  library(cmdstanr)
} else {
  library(rstan)
}

library(bayesplot)
library(posterior)

library(future)
plan(multisession) 

options(SBC.min_chunk_size = 10)

# Setup caching of results
cache_dir <- "./rejection_sampling_SBC_cache"
if(!dir.exists(cache_dir)) {
  dir.create(cache_dir)
}

We’ll use a very simple model throughout this vignette:

cat(readLines("stan/rejection_sampling.stan"), sep = "\n")
data {
   int<lower=0> N;
   real y[N];
}

parameters {
   real mu;
}

model {
   mu ~ normal(0, 2);
   y ~ normal(mu, 1);
}
if(use_cmdstanr) {
  backend <- SBC_backend_cmdstan_sample(cmdstan_model("stan/rejection_sampling.stan"), iter_warmup = 800, iter_sampling = 800)
} else {
  backend <- SBC_backend_rstan_sample(stan_model("stan/rejection_sampling.stan"), iter = 1600, warmup = 800)
}

No rejections

First, we’ll use a generator that matches the model exactly.

N <- 10
generator <- SBC_generator_function(function() {
   mu <- rnorm(1, 0, 2)
   list(
     parameters = list(mu = mu),
     generated = list(N = N, y = rnorm(N, mu, 1))
   )
})

So we expect the SBC to pass even with a large number of fits.

set.seed(2323455)
datasets <- generate_datasets(generator, 1000)
results <- compute_results(datasets, backend, keep_fits = FALSE, 
                    cache_mode = "results", 
                    cache_location = file.path(cache_dir, "no_rejections"))
## Results loaded from cache file 'no_rejections'
##  - 1 (0%) fits had at least one Rhat > 1.01. Largest Rhat was 1.011.
## Not all diagnostics are OK.
## You can learn more by inspecting $default_diagnostics, $backend_diagnostics 
## and/or investigating $outputs/$messages/$warnings for detailed output from the backend.

Indeed, all looks good.

Rejection based on parameter values

Now let us modify the generator to reject based on parameter values.

generator_reject_param <- SBC_generator_function(function() {
   repeat {
    mu <- rnorm(1, 0, 2)
    if(mu > 3) {
      break
    }
   }
   list(
     parameters = list(mu = mu),
     generated = list(N = N, y = rnorm(N, mu, 1))
   )
})

We don’t even need to run very many fits to see the problem.

set.seed(21455)
datasets_reject_param <- generate_datasets(generator_reject_param, 200)
results_reject_param <- compute_results(datasets_reject_param, backend, keep_fits = FALSE, 
                    cache_mode = "results", 
                    cache_location = file.path(cache_dir, "reject_param"))
## Results loaded from cache file 'reject_param'
plot_ecdf_diff(results_reject_param)

plot_rank_hist(results_reject_param)

Indeed, we see a clear failure.

Rejecting based on data

But what if we reject based on the values of data? This should in theory result in just a constant change in posterior density and not affect SBC. (SBC will however then check only the non-rejected parts of the data space). We will do a relatively aggressive rejection scheme (reject more than 50% of datasets).

generator_reject_y <- SBC_generator_function(function() {
   repeat {
    mu <- rnorm(1, 0, 2)
    y <- rnorm(N, mu, 1)
    if(mean(y) > 5) {
      break
    }
   }
   list(
     parameters = list(mu = mu),
     generated = list(N = N, y = y)
   )
})
set.seed(369654)
datasets_reject_y <- generate_datasets(generator_reject_y, 1000)
results_reject_y <- compute_results(datasets_reject_y, backend, keep_fits = FALSE, 
                    cache_mode = "results", 
                    cache_location = file.path(cache_dir, "reject_y"))
## Results loaded from cache file 'reject_y'
##  - 1 (0%) fits had at least one Rhat > 1.01. Largest Rhat was 1.01.
## Not all diagnostics are OK.
## You can learn more by inspecting $default_diagnostics, $backend_diagnostics 
## and/or investigating $outputs/$messages/$warnings for detailed output from the backend.
plot_rank_hist(results_reject_y)

plot_ecdf_diff(results_reject_y)

We see that even with quite heavy rejection based on y, SBC to a high resolution passes.

Take home message

If our priors can sometimes generate datasets that are unrealistic, but we are unable to specify a better prior directly (e.g. because we would need to define some sort of joint prior), we can use rejection sampling to prune unrealistic datasets as long as we only filter by the observed data and don’t directly use any unobserved parameter values. Notably, filtering based on divergences or other fitting issues is also just a function of data and thus permissible. The resulting SBC will however provide guarantees only for datasets that would not be rejected by the same criteria.