Skip to content

Importance Sampling (and why the name is misleading)

Importance Sampling (and why the name is misleading)

We call it ‘importance,’ but the algorithm never picks ‘important’ points—it re-weights ordinary ones. It should have been called occurrence adjustment or something like that and base that on the relationship between samples/sampling and likelihood and why we are doing what we are doing.

When we are averaging stuff (say f(X)f(X)) in real life, we are actually estimating an expectation of a function, f()f(\cdot), with respect to a random variable, XX, that is distributed by P(X)P(X). In equal but different words, (if you can collect/create observations), you are sampling XX from P(X)P(X). That literally means you will have more observations at some points that is governed by P(X)P(X).

But what if you can’t sample from P(X)P(X) directly? If you sample from some other distribution Q(X)Q(X) instead, you’re getting the “wrong” occurrence rates. Values that should appear frequently under PP might appear rarely under QQ, and vice versa.

The fix is simple: adjust each sample by the ratio P(X)Q(X)\frac{P(X)}{Q(X)}.

This ratio tells you how much more (or less) likely that value is under PP compared to QQ. If P(X)Q(X)=3\frac{P(X)}{Q(X)} = 3, that sample should “count 3 times more” to correct for it being underrepresented in your QQ-samples. This is literally occurrence adjustment — you’re fixing the mismatch between how often values actually appeared (under QQ) versus how often they should have appeared (under PP).

Hidden Requirements

This method implicitly assumes two critical things that are often glossed over:

Matching support: If QQ assigns zero probability where PP doesn’t, the weight blows up and the estimator is undefined. .

Computable likelihoods: You need to evaluate both P(X)P(X) and Q(X)Q(X) for every sample. This seems obvious but it’s a huge constraint — we often resort to sampling precisely because we can’t compute these densities easily, things like MCMC.

There’s a beautiful duality here: if you can compute P(X)P(X) but can’t sample from it, importance sampling provides a backdoor. Just find ANY distribution QQ that you can both sample from AND evaluate, and you’ve converted your “likelihood evaluation ability” into “sampling ability.” In theory, this creates an equivalence between these two capabilities.

Life is not so easy though. While you keep all samples, samples with tiny importance weights P(X)Q(X)0\frac{P(X)}{Q(X)} \approx 0 contribute essentially nothing. You’ve spent computational effort generating them, evaluating densities, computing f(X)f(X) — only to multiply by 0\approx 0. These “zombie samples” exist but don’t matter, making your effective sample size much smaller than your actual sample count.

ESS=(iwi)2iwi2with wi=P(Xi)Q(Xi).\text{ESS} = \frac{\bigl(\sum_i w_i\bigr)^2}{\sum_i w_i^2}\quad\text{with }w_i=\frac{P(X_i)}{Q(X_i)}.

In policy-gradient RL, PP is the new policy, QQ the behavior policy, so IS lets us reuse old roll-outs.

The name “importance sampling” obscures this elegant mechanism. We’re not sampling “important” regions — we’re correcting occurrence rates. “Likelihood-weighted sampling” or “distribution correction sampling” would immediately convey what’s actually happening: you sampled from the wrong distribution, so you reweight by P(X)Q(X)\frac{P(X)}{Q(X)} to fix it.

The Math (or: How this magic actually works)

So we want EP[f(X)]\mathbb{E}_P[f(X)] but can only sample from QQ. Here’s the trick in all its glory:

Start with what you want:

EP[f(X)]=f(X)P(X)dX\mathbb{E}_P[f(X)] = \int f(X)P(X)\,dX

Now multiply by 1 (Q(X)/Q(X))(Q(X)/Q(X)):

EP[f(X)]=f(X)P(X)Q(X)Q(X)dX=f(X)P(X)Q(X)Q(X)dX\mathbb{E}_P[f(X)] = \int f(X)P(X) \cdot \frac{Q(X)}{Q(X)}\,dX = \int f(X) \cdot \frac{P(X)}{Q(X)} \cdot Q(X)\,dX

The integral is just an expectation with respect to QQ:

EP[f(X)]=EQ[f(X)P(X)Q(X)]\mathbb{E}_P[f(X)] = \mathbb{E}_Q\left[ f(X) \cdot \frac{P(X)}{Q(X)} \right]

In practice with samples X1,X2,,XnX_1, X_2, \ldots, X_n from QQ:

E^P[f(X)]1ni=1nf(Xi)P(Xi)Q(Xi)\hat{\mathbb{E}}_P[f(X)] \approx \frac{1}{n} \sum_{i=1}^n f(X_i) \cdot \frac{P(X_i)}{Q(X_i)}


Previous Post
GSPO Gradient Derivation
Next Post
Likelihood Ratio Trick a.k.a REINFORCE