Importance Sampling (and why the name is misleading)

We call it ‘importance,’ but the algorithm never picks ‘important’ points, it re-weights ordinary ones. It should have been called occurrence adjustment or something like that and base that on the relationship between samples/sampling and likelihood and why we are doing what we are doing.

When we are averaging stuff (say $f(X)$ ) in real life, we are actually estimating an expectation of a function, $f(\cdot)$ , with respect to a random variable, $X$ , that is distributed by $P(X)$ . In equal but different words, (if you can collect/create observations), you are sampling $X$ from $P(X)$ . That literally means you will have more observations at some points that is governed by $P(X)$ .

But what if you can’t sample from $P(X)$ directly? If you sample from some other distribution $Q(X)$ instead, you’re getting the “wrong” occurrence rates. Values that should appear frequently under $P$ might appear rarely under $Q$ , and vice versa.

The fix is simple: adjust each sample by the ratio $\frac{P(X)}{Q(X)}$ .

This ratio tells you how much more (or less) likely that value is under $P$ compared to $Q$ . If $\frac{P(X)}{Q(X)} = 3$ , that sample should “count 3 times more” to correct for it being underrepresented in your $Q$ -samples. This is literally occurrence adjustment, you’re fixing the mismatch between how often values actually appeared (under $Q$ ) versus how often they should have appeared (under $P$ ).

Hidden Requirements

This method implicitly assumes two critical things that are often glossed over:

Matching support: If $Q$ assigns zero probability where $P$ doesn’t, the weight blows up and the estimator is undefined. .

Computable likelihoods: You need to evaluate both $P(X)$ and $Q(X)$ for every sample. This seems obvious but it’s a huge constraint we often resort to sampling precisely because we can’t compute these densities easily, things like MCMC.

There’s a beautiful duality here: if you can compute $P(X)$ but can’t sample from it, importance sampling provides a backdoor. Just find ANY distribution $Q$ that you can both sample from AND compute the likelihood, and you’ve converted your “likelihood evaluation ability” into “sampling ability.” In theory, this creates an equivalence between these two capabilities.

Life is not so easy though. While you keep all samples, samples with tiny importance weights $\frac{P(X)}{Q(X)} \approx 0$ contribute essentially nothing. You’ve spent computational effort generating them, evaluating densities, computing $f(X)$ , only to multiply by $\approx 0$ . These “zombie samples” exist but don’t matter, making your effective sample size much smaller than your actual sample count.

$\text{ESS} = \frac{\bigl(\sum_i w_i\bigr)^2}{\sum_i w_i^2}\quad\text{with }w_i=\frac{P(X_i)}{Q(X_i)}.$

In policy-gradient RL, $P$ is the new policy, $Q$ the behavior policy, so IS lets us reuse old roll-outs.

The name “importance sampling” obscures this elegant mechanism. We’re not sampling “important” regions, we’re correcting occurrence rates. “Likelihood-weighted sampling” or “distribution correction sampling” would immediately convey what’s actually happening: you sampled from the wrong distribution, so you reweight by $\frac{P(X)}{Q(X)}$ to fix it.

The Math (or: How this magic actually works)

So we want $\mathbb{E}_P[f(X)]$ but can only sample from $Q$ . Here’s the trick in all its glory:

Start with what you want:

$\mathbb{E}_P[f(X)] = \int f(X)P(X)\,dX$

Now multiply by 1 $(Q(X)/Q(X))$ :

$\mathbb{E}_P[f(X)] = \int f(X)P(X) \cdot \frac{Q(X)}{Q(X)}\,dX = \int f(X) \cdot \frac{P(X)}{Q(X)} \cdot Q(X)\,dX$

The integral is just an expectation with respect to $Q$ :

$\mathbb{E}_P[f(X)] = \mathbb{E}_Q\left[ f(X) \cdot \frac{P(X)}{Q(X)} \right]$

In practice with samples $X_1, X_2, \ldots, X_n$ from $Q$ :

$\hat{\mathbb{E}}_P[f(X)] \approx \frac{1}{n} \sum_{i=1}^n f(X_i) \cdot \frac{P(X_i)}{Q(X_i)}$