Error Bars as Degrees of Belief

The goal in this writing is to understand how error bars can represent degrees of belief rather than just measurement uncertainty. Bayesian statistics represents a broader philosophy about reasoning under uncertainty, modeling how rational agents should update their beliefs when encountering new evidence. We start with prior beliefs, observe data, and arrive at posterior beliefs through a principled updating process. This framework makes assumptions explicit in a way that standard error bars often do not.

Specifically, we seek to understand this concept by interpreting this figure from the Anti-Scheming Training paper:

Error Bars as Probability Statements

The bottom panels show error bars around each measurement. These bars represent something different from what we typically encounter. Standard error bars often implicitly assume normally distributed sampling errors, with formulas that rely on asymptotic approximations. These assumptions can break down with small samples or extreme probabilities, and they remain hidden unless you try hard to reason about them.

The error bars in the figure above take a Bayesian approach. To understand them, we start with Bayes' theorem: $$P(\theta | \text{data}) = \frac{P(\text{data} | \theta) \cdot P(\theta)}{P(\text{data})}$$

Here, $\theta$ represents the parameter we want to estimate (in this case, the true covert action rate), and the equation describes how we update our beliefs after observing data. The left side, $P(\theta | \text{data})$, is the posterior distribution. Asking for this posterior distribution is roughly equal as asking "given what I observed, what values of $\theta$ are plausible?" This perspective differs fundamentally from frequentist statistics, which treats $\theta$ as a fixed unknown value rather than a random variable with a distribution.

The Beta Distribution and Belief Updates

The paper uses Beta distributions to model uncertainty about covert action rates. A Beta distribution is a continuous probability distribution over the interval [0, 1]. For a rate parameter $\theta$, the distribution is defined by: $$\text{Beta}(\theta | \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \theta^{\alpha-1} (1-\theta)^{\beta-1}$$ It's helpful to understand the fraction in front ($\frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}$) as a normalizing constant that ensures the distribution integrates to 1.

The key part is $\theta^{\alpha-1} (1-\theta)^{\beta-1}$, which determines the shape. We can think of $\alpha$ as a count of observed successes and $\beta$ as a count of observed failures. The mean of the distribution is $\frac{\alpha}{(\alpha + \beta)}$, the proportion of successes. As $\alpha + \beta$ grows, the distribution concentrates more tightly around this mean.

The Beta distribution has a special property that makes it particularly useful here. It is a conjugate prior for binomial data, which simply means that if we start with a Beta distribution as our prior belief and then observe binomial data (successes and failures), the posterior belief is also a Beta distribution. This means that it's quite simple to update our beliefs based on observations. If we start with a Beta prior and observe $k$ successes out of $n$ trials, the posterior is: $$\text{Beta}(\alpha_0 + k, \beta_0 + n - k)$$ We just add the observed successes to $\alpha_0$ and the failures to $\beta_0$.

In the paper, the authors use Beta(1, 1) as their prior, representing a uniform distribution where all rates from 0% to 100% are equally plausible before seeing data. This choice is transparent and can be scrutinized. The uniform prior adds minimal information (equivalent to observing one success and one failure), letting the data dominate. After observing 8 covert actions in 100 trials for example, the posterior becomes Beta(1+8, 1+92) = Beta(9, 93).

The 95% credible interval bounds 95% of the probability mass in this distribution. For Beta(9, 93), this interval spans roughly 4% to 15%, meaning there is a 95% probability the true rate lies in this range given the observed data. This is a fundamentally different interpretation than a frequentist confidence interval. A frequentist 95% confidence interval means "if I repeated this experiment many times and calculated an interval each time, 95% of those intervals would contain the true parameter." A Bayesian 95% credible interval means "there is a 95% probability that the parameter lies in this interval, given the data I observed." The Bayesian statement is a direct probability claim about the parameter itself.

Using Monte Carlo Sampling to Aggregate Uncertainty

The top panels in the paper's figure also follow this Bayesian intuition. They represent the distribution of average covert action rates across multiple test environments. The paper evaluates the model on many different test scenarios, each producing its own Beta posterior distribution. To understand the overall performance, we need to aggregate these distributions while properly accounting for uncertainty in each one.

This is where Monte Carlo sampling becomes useful. For each test environment, we have a Beta distribution representing our belief about that environment's true rate. We repeatedly sample one plausible rate from each environment's distribution, average these samples together, and record the result. After many iterations, the collection of these averages forms a distribution that represents our uncertainty about the overall average rate.

More precisely, the process works as follows.

Suppose we have three test environments with posteriors Beta(9, 93), Beta(15, 35), and Beta(81, 921). Observe how we have significantly different degrees of "uncertainty" in each because we have different number of observations.
In one Monte Carlo iteration, we might sample 0.087 from the first, 0.31 from the second, and 0.080 from the third, giving an average of 0.159.
In another iteration, we might sample 0.095, 0.28, and 0.082, giving 0.152.
After one million such iterations, we have one million plausible values for the overall average.

Plotting these values produces the density curves shown in the top panels.

Benefits of This Approach

The error bars in the paper reflect this Bayesian perspective. Each bar shows a range of plausible values for the true covert action rate, with the width of the bar representing our degree of uncertainty. Wider bars indicate more uncertainty, narrower bars indicate more certainty.

The Bayesian approach handles small sample sizes gracefully. When observing 0 covert actions out of 10 trials, a frequentist estimate gives 0% with a standard error of 0, suggesting perfect certainty. The Bayesian approach with a Beta(1,1) prior yields Beta(1, 11), which has mean 8.3% and substantial uncertainty. This reflects the reality that 10 trials provide limited evidence.

The conjugacy of the Beta distribution with binomial data makes updates computationally trivial, which matters when aggregating results across many test environments. Monte Carlo sampling provides a flexible way to propagate uncertainty through simple aggregations, which scales naturally to more complex scenarios with wildly varying degrees of uncertainty.