We idealize it as generating independent uniform random variable U_1, U_2, \cdots on the interval [0,1]. In practice we have only finite precision and the random number generator is pseudo-random.
How do we use random numbers to generate other random variables? This problem usually goes under the name of simulation of random variables.
There are a few general principles and lots of clever tricks to generate efficiently specific distributions.
We give some examples and general in this section and we will add more along the way.
1.2 Generating discrete distributions
Theorem 1.1 (Generating discrete distributions) Suppose X is a discrete random variable taking values x_1, x_2, x_3, \cdots with p.d.f
p(j) = P(X=x_j)\,.
To simulate X
Generate a random number U.
Set
X\,=\, \left\{ \begin{array} {ll}
x_1 & \textrm{ if } U < p(1) \\
x_2 & \textrm{ if } p(1) \le U < p(1) + p(2) \\
\vdots & \vdots \\
x_n & \textrm{ if } p(1) + \cdots + p(n-1) \le U < p(1) + \cdots p(n) \\
\vdots & \vdots \\
\end{array}
\right.
Then X has the desired distribution.
Proof. This follows simply from P(a \le U \le b) = b-a if 0\le a \le b \le 1.
1.3 Examples
Roll a dice
# Generating the roll of n diceimport numpy as npn=10np.ceil(6*np.random.rand(n))
array([3., 3., 4., 2., 1., 3., 5., 4., 4., 2.])
Geometric RV: If X geometric random variable X with parameter p
p(n) =p (1-p)^{n-1} \quad \quad P(X \le n) = 1 - (1-p)^n \quad n=1,2,\cdots
Then
1- (1-p)^{n-1} \le U \le 1- (1-p)^{n-1} \iff (1-p)^n \le (1-U) \le (1-p)^{n-1}
or equivalently since (1-U) has the same distribution as U
\textrm{Set } X=n \quad \textrm{ if } \quad(1-p)^n \le U \le (1-p)^{n-1}
In general it might be not easy to compute the CDF P(X \le n) so for specific distributions (say Poisson distributions) we will use special tricks.
1.4 Inversion method
Theorem 1.2 (Inversion method)
Suppose X is a continuous random variable with CDF F_X(t) and F_X(t) is invertible. Then X and F_X^{-1}(U) have the same distribution.
To simulate X generate a random number U and set X=F_X^{-1}(U).
Proof. Since P(U\le a)=a for 0\le a \le 1
P(F_X^{-1}(U)<t) = P(U \le F_X(t)) = F_X(t)
and thus F_X^{-1}(U) has distribution F_X(t).
Example: If T has an exponential distribution with paramter \lambda then the CDF is F(t)=P(T \le t) = 1 -e^{-\lambda t} and
F(t) = 1-e^{-\lambda t}=u \iff t = -\frac{1}{\lambda} \ln(1-u) = F^{-1}(U)
So if U is uniform on [0,1] then T=-\frac{1}{\lambda} \ln(1-U) has an exponential dsitribution with parameter \lambda. Since U and 1-U have the same distribution then T=-\frac{1}{\lambda} \ln(U) has also an exponential distribution with parameter \lambda.
Downside: in general F and F^{-1} are not computable (solve them numerically).
1.5 Rejection method
The next method permits to simulate X provided we can simulate Y (X and Y should have the same support).
Theorem 1.3 (Rejection Method) Suppose that X has pdf f(x) and Y has pdf g(x), and that there exists a constant C such that
\frac{f(y)}{g(y)} \le C \quad \quad \textrm{ for all } y
To generate X do
Step 1: Generate the random variable Y.
Step 2: Generate a random number U.
Step 3: If U \le \frac{f(Y)}{g(Y)C} set X=Y otherwise reject and go back to Step 1.
The number of times the algorithm runs until a value for X is accepted is a geometric random variable with paramter \frac{1}{C}.
Proof. To obtain one value of X we need iterate the algorithm a random number of times, let us call it N, until the value is accepted. That is we generate random variables Y_1, \cdots, Y_N until Y_N is accepted and then set X=Y_N.
Let us compute the CDF of Y_N.
We have, by conditioning on Y (i.e. P(A)=\int_y P(A|Y=y)g(y) dy for some event A).
\begin{aligned}
P \{ Y_N \le x \} = P \left\{ Y \le x \,|\, U \le \frac{f(Y)}{C g(Y)}\right\}
&= \frac{ P \left\{ Y \le x \,, U\le\frac{f(Y)}{C g(Y)}\right\} }
{P\left\{U \le \frac{f(Y)}{C g(Y)}\right\}} \\
&= \frac{ \int_{-\infty}^{\infty} P \left\{ Y \le x \,,
U \le \frac{f(Y)}{C g(Y)}
\,|\, Y=y \right\} g(y) \, dy } {P \left\{ U \le \frac{f(Y)}{C g(Y)}\right\}}
\\
&= \frac{ \int_{-\infty}^{x} P\left(U\le \frac{f(y)}{C g(y)}\right)g(y)\,dy}
{P \left( U \le \frac{f(Y)}{C g(Y)}\right)} \\
&= \frac{ \int_{-\infty}^{x} \frac{f(y)}{C g(y)} g(y) \, dy }
{P \left( U \le \frac{f(Y)}{C g(Y)}\right)}
= \frac{ \int_{-\infty}^{x} f(y) \, dy }
{ C P\left(U \le \frac{f(Y)}{C g(Y)}
\right)} \,.
\end{aligned}
If we let x \to \infty we obtain that C P \left( U \le \frac{f(Y)}{C g(Y)}\right)=1 and thus, as desired
P(Y_N \le x ) \,=\, \int_{-\infty}^x f(x) \,dx \,.
The above argument shows that at each iteration, a value for X is accepted with probability P\left( U \le \frac{f(Y)}{C g(Y)} \right) = \frac{1}{C} independently of the other iterations. Therefore the number of iterations needed is a geometric random with mean C.
1.6 Example
Suppose X has pdf f(x)=20x(1-x)^3 for 0 \le x\le 1. Since X is supported on [0,1] we pick Y uniform on [0,1] with g(y)=1. Then
C = \max \frac{f(x)}{g(x)} = \max_{x\in [0,1]} 20x(1-x)^3 = \frac{135}{64}
So generate two random numbers U_1 and U_2 and if U_1 \le \frac{256}{27}U_2 (1-U_2)^3 set X=U_2. The average proportion of accepted values is \frac{64}{135}=.4747..
import numpy as npimport matplotlib.pyplot as pltdef accept_reject(N): # Generate N samples n_accept=0 x_list = [] while n_accept < N: a=np.random.rand(2)if a[0] < (256/27)*a[1]*(1-a[1])**3 : n_accept +=1 x_list.append(a[1])return x_listplt.figure(figsize=(5,3)) plt.hist(accept_reject(100000), bins=100, density='true')t = np.arange(0., 1., 0.02)plt.plot(t, 20*t*(1-t)**3, 'r--' )plt.xlabel('x')plt.ylabel('f(x)')plt.title('Histogram vs the exact pdf')plt.show()
1.7 Normal RV with the rejection method
We use the fact that we know how to generate an exponential random variable (we will take \lambda=1)
If X is a standard normal then \sigma X + \mu has mean \mu and variance \sigma^2 so it is enough to consider \mu=0 and \sigma^2=1.
If X is standard normal then, by symmetry, X and -X have the same distribution and the random variables Z=|X| has the pdf
f_Z(x) = \frac{2}{\sqrt{2\pi}}e^{-\frac{x^2}{2}} \quad \textrm{ for } x\ge 0 \,.
Conversely we can write X = V Z where V is a Rademacher random variable, that is V the RV with pdf P(V= 1) = P(V= -1) = \frac{1}{2}. For example we can write V = 2 \lfloor 2U \rfloor -1 in terms of a random number U.
Algorithm: generate 3 random numbers U_1, U_2, U_3. If U_1 \le e^{-\frac{1}{2}(-\ln(U_2)-1)^2} set X= (2 \lfloor 2 U_3 \rfloor -1)(-\ln(U_2)), otherwise reject.
1.8 Box-Muller algorithm
Generate 2 independent standard normal using 2 random numbers.
Use the change of variable formula. Suppose X and Y are IID standard normal. Then for any (bounded) h
\begin{aligned}
E[ h(X,Y)] &= \int_{-\infty}^\infty \int_{-\infty}^\infty h(x,y) \frac{1}{2\pi} e^{-\frac{x^2 + y^2}{2}} \, dx \, dy \\
&= \int_0^{2\pi} \int_0^\infty h(r \cos(\theta), r \sin(\theta)) r e^{-\frac{r^2}{2}} dr \frac{1}{2\pi} d\theta \quad \quad x=r \cos(\theta), y=r \sin(\theta) \\
&= \int_0^{2\pi} \int_0^\infty h(\sqrt{s} \cos(\theta), \sqrt{s} \sin(\theta)) \frac{1}{2} e^{-\frac{s}{2}} ds \frac{1}{2\pi} d\theta \quad \quad s=r^2
\end{aligned}
So if S is exponential with paramter \frac{1}{2} and \Theta is uniform on [0,2\pi] then
X = \sqrt{S}\cos(\Theta), \quad \quad Y= \sqrt{S} \sin(\Theta)
are 2 independent standard normal random variables.
We have then a (rejection-free) algorithm: if U_1 and U_2 are two random numbers then
X = \sqrt{-2 \ln(U_1)}\cos( 2 \pi U_2), \quad \quad Y= \sqrt{-2 \ln(U_1)}\sin( 2 \pi U_2)
are 2 independent standard normal random variables.
1.9 Generating a random vector on the unit ball
Rejection method: the unit ball B_d=\{ x_1^2 + \cdots x_d^2 \le 1\} is contained in the cube C_d =\{ \max_{i} |x_i|\le 1 \}
Generate \textbf{Y}=(Y_1, Y_2, \cdots, Y_d) uniform on C_d (using d random numbers)
If \textbf{Y} \in B_d, set \textbf{X}=\textbf{Y}, otherwise reject and repeat.
We have
\textrm{ Acceptance probability } = \frac{\textrm{volume of the sphere}}{\textrm{volume of the cube}}= \frac{\frac{ \pi^{d/2}}{d\Gamma(d/2)}}{2^d} \underbrace{=}_{d=2l} \frac{\frac{ \pi^{l}}{2l (l-1)!}}{2^{2l}} = \frac{1}{2 l!} \left(\frac{\pi}{4}\right)^l
This is awful, for d=6 we have an acceptance probability of \frac{1}{12} \left(\frac{\pi}{4}\right)^3=0.0403 so 96\% of vectors are rejected!
A rejection free method:
Generate a uniform distribution on the sphere S_{d-1}=\{ x_1^2 + \cdots x_d^2 \le 1\}: Take \textbf{ Y}=(Y_1, \cdots, Y_d) to be IID standard normal. The joint PDF, \propto e^{-\frac{1}{2}(x_1^2 + \cdots + x_d^2)}, is rotation invariant and so \textbf{ Z}=\frac{\textbf{ Y}}{\|\textbf{ Y}\|} is uniformly distributed on S_{d-1}.
If \textbf{ X} is uniformly distributed on the unit ball, then using spherical coordinates we have for, r\le 1, P(\|\textbf{ X}\|\le r)= \int_0^1 r^{d-1} dr \int_{S_{d-1}} d \sigma = r^d.
Then we claim that if U is a random number then \textbf{ X}= U^{1/d}Z is uniformmyl distributed on the unit ball, since the distribution of \textbf{ X} is rotation invariant and for r\le 1
P(\|\textbf{ X}\|\le r) =P( U^{\frac1d} \|\textbf{ Z}\| \le r) =P(U \le r^d) =r^d\,.
1.10 Generating random permutations
Our goal is to generate a random permtuation of n elements (1,2, \cdots, n). We write P=(p(1), p(2), \cdots, p(n)) for such a permutation. There are n! such permutation which makes enumeration a stupid idea.
The following algorithm does the following: after k steps it creates a random permutation of k elements, so you can use it for permutations of any size.
Step 1: Set k=1
Step 2: Set p(1)=1
Step 3: If k=n stop. Otherwise set k=k+1
Step 4: Generate a random number U and set
p(k) = p( \lceil kU \rceil) \quad \textrm{ and } \quad p(\lceil kU \rceil) = k
then go to step 3.
Since \lceil kU \rceil takes values in \{1,2,\cdots, k\} with probability \frac{1}{k} the algorithm assign the value k randomly to one the k position and assigns the values it replaces to the k^{th} position.
By induction, let us assume that after k-1 iteration we have obtained a random permutation of k-1 elements. Then by construction we have
P \left( i_1, \cdots, i_{j-1}, k, i_{j+1}, \cdots, i_{k-1}, i\right)= P\left(i_1, \cdots, i_{j-1}, i , i_{j+1}, \cdots, i_{k-1}\right)\frac{1}{k} = \frac{1}{k!}
So we need, for example, 52 steps to create a perfectly shuffled set of cards.
2 Monte-Carlo methods
2.1 Law of large numbers
We start by recalling some basic limit theorems from probability theory.
X_1, X_2, \cdotsindependent and identically distributed random variables (IID). Think of repeating a random experiment n times and recording the results. We denote by S_n = X_1 + X_2 + \cdots + X_n the sum.
The Law of Large numbers shows that if X_i have a finite mean \mu then the empirical average \frac{S_n}{n} converges to \mu.
Theorem 2.1 (Law of Large numbers) Suppose that X_1, X_2, \cdots are IID random variables with E[X_i]=\mu.
Weak LLN: The average \frac{S_n}{n} converges to \mu in probability:
\textrm{For any } \varepsilon>0 \quad \lim_{n\to \infty} P\left( \left| \frac{S_n}{n}- \mu\right| \ge \varepsilon \right) =0\,.
Strong LLLN: The average \frac{S_n}{n} converges to \mu almost surely.
\lim_{n\to \infty}\frac{S_n}{n} = \mu \quad \textrm{ with probability } 1
Although the strong law of large numbers implies the weak law of large numbers (almost sure convergence implies convergence in probability), they both have their different practical use as we shall see.
2.2 The Monte-Carlo method
The (simple) Monte-Carlo method is a probabilistic algorithm using sums of independent random variables the law of large numbers to estimate a (deterministic) quantity \mu \in \mathbb{R} (or \mathbb{R}^d).
The basic idea is to express \mu as the expectation of some random variable \mu = E [ h(X) ] and then use the law of large numbers to build up an estimator for \mu.
Find a random variable h(X) such that \mu= E[h(X)].
I_N = \frac{1}{N} \sum_{k=1}^N h(X_k) is an unbiased estimator for \mu, that is we have
For all N we have E[I_N] = \mu (unbiased).
With probability 1, \lim_{N \to \infty} I_N = \mu.
An interesting part is that there are, in general, many ways to find the random variables h(X) and we will play with this, e.g. for importance sampling.
Conversely in many problems the random variable h(X) is given but the expection is too diffcult to compute, so we rely on the LLN to compute \mu.
2.3 Buffons needles
This seems to be the first example of a rejection sampling used to solve a mathematical problem, by Le Comte de Buffon (see Bio in Wikipedia).
A needle of length l is thrown at random on floor made on floorboards of width d and we assume l \le d. We want to compute the probability that the needle does intersect two floor boards.
Denote by X the distance to the nearest intersection (this is uniformly distributed on [0,\frac{d}{2}]) and by \Theta the acute angle between the needle and an horizontal line( this is uniformly distributed on [0,\frac{\pi}{2}].
For the needle to intersect we must have x \le \frac{l}{2} \sin(\theta) and thus
P\left(X \le \frac{l}{2} \sin(\Theta)\right) = \int_0^{\frac{\pi}{2}} \int_0^{\frac{l}{2}\sin(\theta)} \frac{2}{d} dx \, \frac{2}{\pi} d\theta = \frac{2l}{d\pi}
So in order estimate \pi you shall throw N needles on the floors at random and
\pi \approx \frac{2lN}{d} \frac{1}{ \# \textrm { of needles intersecting two floor boards}}
No random number generator needed.
2.4 Computing \pi
Use the rejection method: Enclose a disk of radius 1 in a square of side length 2 and consider the following Bernoulli random variable X.
Generate 2 independent vectors V_1,V_2 uniformly distributed on [-1,1].
If V_1^2 + V_2^2 \le 1, set X=1, otherwise set X=0.
X is Bernoulli with p=\frac{\textrm{area of the disk}}{\textrm{ area of the square}}=\frac{\pi}{4} so \frac{4}{N}\sum_{k=1}^N X_i \to \pi with probability 1
import numpy as npN =1000000dart_positions =2* np.random.rand(N, 2) -1# generates numbers in [-1, 1]Ncircle = [0] # start the count with 0 to make our loop-logic easierfor x,y in dart_positions:if x**2+ y**2<1: Ncircle.append(Ncircle[-1] +1) # another dart has fallen in the circleelse: Ncircle.append(Ncircle[-1]) # the dart fell outside of the circle - Ncircle is unchangedrunning_estimate = []for n_total, n_circle inenumerate(Ncircle[1:]): # skip the inital 0-count# n_total starts at 0, so we need to add 1 running_estimate.append(4* n_circle / (n_total +1))np.transpose(running_estimate[-20:])
Goal is to computes integral: \displaystyle I\,=\, \int_{S} h( {\bf x}) d {\bf x} for some function h:S\subset \mathbb{R}^d \to R.
To solve an integral by quadrature with a grid with side length \varepsilon you need O\left(\frac{1}{\varepsilon^d}\right) grid points (“curse of dimensionality”).
Monte Carlo: Rewrite the integral as an expectation of some random variables you now how to simulate, simulate N of those, and then use the estimator
\displaystyle I\,=\, \int_{S} h( {\bf x}) d {\bf x} = E[g(Y)] \approx I_N=\frac{1}{N}\sum_{k=1}^N g(Y_k)
If \displaystyle I_1 \,=\, \int_0^1 \frac{ e^{\sqrt x}-e^{cos(x^3)}}{3 + \cos(x)} \, dx = E[h(U)] where U is a random number and h(x)=\frac{ e^{\sqrt x}-e^{cos(x^3)}}{3 + \cos(x)}
In general pick a random variable with pdf f(x) which is non-zero on S and extend h to \mathbb{R}^d by setting it to zero ousisde S. Then
I\,=\, \int_{S} h( {\bf x}) d {\bf x}\,=\, \int_{\mathbb{R}^d} \frac{h( {\bf x})}{f(\mathbf{x})}f(\mathbf{x}) d {\bf x} = E\left[ \frac{h( {\bf X})} {f( {\bf X})} \right]
\quad X \textrm{ with pdf } f(x)
2.6 Goals
Performance gurantees: How many sample should we generate to obtain a given precision for the computation of \mu = E[h(X)]
Use central limit theorems type results to obtain asymptotic confidence intervals.
Simple and useful but somewhat unsatisfying.
Use concentration inequalities to obtain non-asymptotic confidence intervals.
True performance guarantess but harder to get and often too pessimistic.
How to we design better MC methods since there are many ways….
2.7 Central limit theorem
IID random variables X_1, X_2, \cdots *independent with \mu=E[X_i] and \sigma^2 = \textrm{Var}(X_i).
The central limit theorem tells us about the form fluctuation of the empirical average \frac{S_n}{n} around \mu as n \to \infty. Informally
\frac{S_n}{n} \approx \mu + \frac{\sigma}{\sqrt{n}} Z \quad \textrm{ as } n\to \infty \quad \textrm{ with } Z \textrm { a standard normal RV.}
The meaning of \approx is convergence in distribution: . . .
Theorem 2.3 (Central limit theorem) Let X_1, X_2, \cdots be a sequence of independent identically distributed random variables with mean E[X_1]=\mu and variance \textrm{Var}(X_i)=\sigma^2. Then for any - \infty \le a \le b \le \infty we have
\lim_{N \to \infty} P \left( a \le \frac{ S_N - N \mu}{ \sqrt{N} \sigma} \le b \right)
= \frac{1}{\sqrt{2\pi}} \int_a^b e^{-\frac{x^2}{2}} \, dx \,.
\tag{2.1}
The CLT is very general (it holds for any distribution) and very easy to use.
But is also asymptotic, true only as N\to \infty. Notot clear how large N we should pick for the asymptotic beahvior to be accurate!
We will build later non-asymptotic confidence interval
2.8 Confidence intervals (version 1)
We build build confidence interval for a Monte-Carlo estimator I_N=\frac{1}{N}\sum_{k=1}^N h(X_k), since by the Central limit theorem \frac{\sqrt{N}}{\sigma} ( I_N - \mu) is asymptotically normal.
To build a \alpha-confidence interval we let z_\alpha the number defined by
The issue with that formula is that we are trying to compute \mu so there is no reason to believe that the variance \sigma^2 should be known! To remedy this issue we built an estimator for \sigma^2 from our samples X_1, X_2, \cdots.
2.9 Central limit theorem with Slutsky theorem
Building an estimator for the variance
Theorem 2.4 (Estimator for the variance) Given IID random variables X_1, \cdots, X_N with variance \textrm{Var}(X_i)=\sigma^2
V_N = \frac{1}{N-1} \sum_{k=1}^N \left(X_k -\frac{S_N}{N}\right)^2
\tag{2.2} is an unbiased estimator for \mu, i.e. E[V_{N}]=\sigma^2 and \lim_{N \to \infty} V_N = \sigma^2 with probability 1.
Proof. Details in homework. It relies on the LLN and the following continuity properties: \frac{S_n}{n}\to \mu with prob. 1 and f continuous implies that f(\frac{S_n}{n}) \to f(\mu).
A theorem on weak convergence (Slutsky’s Theorem) implies the following useful version of CLT
Theorem 2.5 (CLT + Slutsky’s Theorem) Suppose X_1, X_2, \cdots are IID random variables with E[X_i]=\mu and \textrm{Var}(X_i)=\sigma^2. hen for any - \infty \le a \le b \le \infty we have
\lim_{N \to \infty} P \left( a \le \frac{ S_N - N \mu}{ \sqrt{N V_N}} \le b \right)
= \frac{1}{\sqrt{2\pi}} \int_a^b e^{-\frac{x^2}{2}} \, dx \,.
\tag{2.3}
2.10 Confidence intervals (version 2)
We can get an asymptotic confidence interval using Theorem 2.5
Theory: build the Monte-Carlo estimator with the smallest possible variance long\rightarrow Variance Reduction
For a precision \varepsilon ne needs N \sim \frac{1}{\varepsilon^2} samples.
Practice: With your code always compute both I_N and V_N!
2.11 Examples
Compute \pi with accept/reject. Estimator I_N=\frac{4}{N} \sum_{k=1}^N {\bf 1}_{\{V_1^2 + V_2^2 \le 1\}} where V_i, i=1,2 is uniform on [-1,1] so we get
\textrm{Var}(I_N) \,=\, \frac{16}{N}\frac{\pi}{4}\left(1-\frac{\pi}{4}\right)= \pi(4-\pi) = \frac{1}{N} 2.6967
To get \pi with a precision of \varepsilon=10^{-3} and 99% confidence we need at least N = 2.576 \times 2.6967 \times 10^6= 6.946,699 samples.
Let h(x)= \sqrt{1 - x^2} for x \in [0,1]. Then \int_0^1 h(x) dx = \frac{\pi}{4} and a estimator for \pi is I_N=\frac{16}{N} \sum_{k=1}^N h(U_k) with variance
\textrm{Var}(I_N) \,=\, \frac{16}{N} \left( E[h(U)^2] -E[h(U)]^2 \right)= \frac{1}{N}\left(\frac{2}{3}-\left(\frac{\pi}{2}\right)^2\right)= \frac{1}{N} 0.797
Better, roughly 4 times less sample (plus using only half as many random numbers).
3 Variance reduction techniques
We discuss two examples on how to reduce the variance of an algorithm. This is a vast subject and there are many other technique to perform variance reduction e.g. conditional MC, control variates, stratified sampling, etc….. Some of them will appear in the exercises. Recommended readings are (Ross 2019), (Ross 2013), (Rubinstein and Kroese 2017) where you will find much more information.
3.1 Antithetic random variables and common random numbers
The idea is to use dependent random variables or dependent random numbers to decrease the variance. Recall the formula for variance: for random variables X and Y we have
\textrm{Var}(X+Y) = \textrm{Var}(X)+ \textrm{Var}(Y)+ 2 \textrm{Cov}(X,Y) \,.
If X and Y are negatively correlated, i.e. \textrm{Cov}(X,Y) < 0 then we can possibly reduce the variance.
To build up our intuition consider the random variables U and V=1-U, they are identically distributed and negatively correlated because if U \le \frac{1}{2} then V \ge \frac{1}{2} and vice versa:
\left(U - \frac{1}{2}\right)\left((1-U)-\frac{1}{2}\right) = U(1-U) - \frac{1}{4} \le 0 \quad \textrm{ since } \max_{x} x(1-x) = \frac{1}{4}
The key idea for the general case is contained in the following theorem
Theorem 3.1 Suppose X_1, \cdots, X_n are independent random variables and f and g are increasing function of n variables. Then
\textrm{Cov}( f(X_1, \cdots X_n) , g(X_1, \cdots X_n)) \ge 0 \,.
%= E[f(X_1, \cdots X_n) g(X_1, \cdots X_n)] - E[f(X_1, \cdots X_n)] E[ g(X_1, \cdots X_n)] \ge 0
\tag{3.1}
Proof. By induction. For n=1, since f and g are increasing, for any x,y we have 0 \le (f(x)-f(y))(g(x)-g(y)) and thus for any random variable X, Y
0 \le E [ f(X)-f(Y))(g(X)-g(Y)) ] = E [f(X)g(X)]+ E[ f(Y)g(Y) ] - E [f(X)g(Y)] + E [f(Y)g(X)]
If, in addition, X and Y are IID we get
0 \le 2 E[f(X) g(X)] - 2 E[f(X)] E[g(X)] = 2 \textrm{Cov}(f(X),g(X))
which proves Equation 3.1 for the case for n=1.
Assuming now that Equation 3.1 is true for n-1 we condition on the value of X_n and
Compute \mu = E\left[ e^U \right] use MC (of course the answer is e-1…).
Since h(x)=e^x is increasing and g(x)=e^{1-x} is decreasing we can use antithetic RV.
We have the covariance
\textrm{Cov}\left(e^U, e^{1-U}\right)=E\left[e^U e^{1-U}\right]-E\left[e^U\right] E\left[e^{1-U}\right]=e -(e-1)^2=-0.2342
We also have
\textrm{Var}\left(e^U\right)= E\left[e^{2U}\right]- E\left[e^U\right]^2 = \frac{e^{2}-1}{1}-(e-1)^2=0.2420
To compare:
Independent random numbers U_1,U_2 give the variance
\textrm{Var}\left( \frac{e^{U_1}+e^{U_2}}{2} \right) = \frac{ \textrm{Var}\left(e^{U}\right)}{2}= 0.1210
Antithetic random variables U, 1-U give the variance
\textrm{Var}\left( \frac{e^{U_1}+e^{U_2}}{2} \right) = \frac{ \textrm{Var}\left(e^{U}\right)}{2} + \frac{\textrm{Cov}\left(e^U, e^{1-U}\right) }{2}=.0039
A variance reduction of 96.7 percent (not bad…)
3.3 Reliability function
A system consists of n components, each of which is either works or fails. We set
x_i= \left\{ \begin{array}{cl} 1 & \textrm{ if component } i \textrm { works} \\
0 & \textrm{ if component } i \textrm { fails}
\end{array}
\right. \,.
We call \mathbf{x}=(x_1, \cdots, x_n) the state vector describing the state of the system.
A structure functionh(x_1, \cdots, x_n) describe the state of the system
h(\mathbf{x})= \left\{ \begin{array}{cl} 1 & \textrm{ if the system is functioning} \\
0 & \textrm{ if the system fails}
\end{array}
\right. \,,
It is natural to assume that h is an increasing function of x_i (turning a system on cannot make the system fail)
Probabilistic description: assume that the components are independent and are described a Bernoulli random variable X_i with
p_i = P( X_i=1 ) = P( \textrm{ component } i \textrm{ works}) = 1- P( \textrm{ component } i \textrm{ fails})
The RV h(\mathbf{X})=h(X_1, \cdots, X_n) is also Benoulli and we have
E[h(\mathbf{X})] = P(\textrm{ system works })
and we may want to use Monte-Carlo
Examples of systems
The series structure: h(x_1,\cdots,x_n)=\min_{i} x_i (all components must work)
The parallel structure: h(x_1,\cdots,x_n)=\max_{i} x_i (at least one component must work)
The k-out-of-n structure: h(x_1,\cdots,x_n)= \left\{ \begin{array}{cl} 1 &\textrm{if }\sum_{i=1}^n x_i\ge k\\0 & \textrm{otherwise} \end{array} \right. (at least k components must work)
General principled way to compute the structure function
A pathA is a subset A \subset \{1,\cdots, n\} such that h(\mathbf{x})=1 if x_i=1 for all i\in A, that is it a subset of components which will ensure that the system works.
A minimal pathA is a path A such that no subset of A is itself a path, that is it a minimal subset of components which will ensure that the system works.
Theorem 3.2 Using the notation x_A = \prod_{i\in A} x_i for any subset A \subset \{1,\cdots,n\}, the the structure function for an arbitrary system is given by
Proof. For the system to work we need to have all components working for at least one minimal path A and so X_A=1. If this is the case \max_{A \,\,\textrm{minimal path}} X_A=1 and \prod_{A \textrm{ minimal path}} (1 - X_A)=0. \quad \blacksquare
The theorem is of limited usability since it requires enumeration of all minimal paths. If the system is moderately big, this is very difficult and/or tedious.
3.5 Monte-Carlo for structure functions
If we assume a probabilistic model, which is pretty reasonable. Think for example, of a routing network, where each node have a certain probability of failure.
MC for system reliability: Consider a system with stucture function h(\mathbf{x}) and assume the components X_k, k=1,\cdots, n are independent Bernoulli RB with with P(X_k=1)=p_k. To compute E[h(\mathbf{X})]=P(\textrm{system works})
1.Generate nN independent random numbers U_{ik} with 1\le i \le N and 1\le k \le n and set X_{ik}=\mathbf{1}_{U_{ik}\le p_k}.
Evaluate H(\mathbf{X_i}) where \mathbf{X_i}=(X_{i1}, \cdots, X_{in})
Set I_N \,=\, \frac{1}{N}\sum_{k=1}^N H(\bf X_i) is an estimator for P(\textrm{system works}).
It is important to note that often it easier to evaluate H( {\bf X}) (that is check if the system functions) rather than evaluate the structure function (well-known algorithm can be used).
The reliability function g(U_1, \cdots, U_n)= h\left( \mathbf{1}_{\{U_1\le p_1\}}, \cdots, \mathbf{1}_{\{U_n\le p_n\}} \right) is an increasing function so
\textrm{Cov}(g(U_1, \cdots, U_n),g(1-U_1, \cdots, 1-U_n) ) <0
and we can reuse each random number and reduce the variance.
3.6 Importance sampling
Consider continuous RVs but the same ideas works for discrete RVs. Suppose X has pdf f(x) and we want to compute \mu= E[h(X)] = \int h(x) f(x) dx. One can always use the simple MC algorithm I_N =\frac{1}{N} \sum_{k=1}^N h(X_k).
The idea behind importance sampling is to sample from another random variable Y with density g and write
\mu= E[h(X)] = \int h(x) f(x) dx = \int h(x) \frac{f(x)}{g(x)} g(x) dx = E\left[ \frac{h(Y) f(Y)}{g(Y)}\right] \,.
For the integral to make sense we need to pick g such that g(x) = 0 is allowed only if f(x)=0 or h(x)f(x) =0.
Often in practice if X belongs to some family, say an exponential RV, we will pick Y from the same family, say an exponential or a gamma random variable but with other paramters which we should pick carefully.
Importance sampling Monte-Carlo estimator: To estimate \mu = E\left[ h(X) \right] the importance sampling algorithm use the random variable Y with pdf g and has the estimator
J_N = \frac{1}{N} \sum_{k=1}^{N} \frac{h(Y_k) f(Y_k)}{g(Y_k)}
Notice the similarity (in spirit) with rejection method.
How to pick g?
The variance of the simple MC estimator I_N is \textrm{Var}(I_N) = \frac{1}{N} ( E[ h(X)^2] - E[h(X)]^2).
For the importance sampling estimator J_N we have
\begin{aligned}
\textrm{Var}(J_N) =\frac{1}{N} \textrm{Var}\left( \frac{ h(Y) f(Y)}{g(Y)} \right)
&=\frac{1}{N} \left[
\int \frac{ h(x)^2 f(x)^2}{g^2(x)} g(x) \, dx - \left( \int h(x) \frac{f(x)}{g(x)} g(x) dx \right)^2 \right]
\\
&= E \left[ h(X)^2 \frac{f(X)}{g(X)}\right] - E[h(X)]^2
\end{aligned}
We need
E\left[ h(X)^2 \frac{f(X)}{g(X)}\right]\le E \left[ h(X)^2\right] \iff \frac{f(x)}{g(x)} \textrm{ "small" when } h(x) \textrm{ is "big" },.
that is we want to sample more (with Y and pdf g than with X and pdf f) in the region where h is big. Hence the name importance sampling.
The next result shows what is the best choice of g, even if it is practically not very useful since it requires the knowledge of what we are trying to compute in the first place (either \mu=E[h(X)] or E[|h(X)|].)
Theorem 3.3 (Optimal importance sampling)
If h(x)\ge 0 then the optimal importance sampling distribution is \displaystyle g_*(x) \,=\, \frac{h(x) f(x)}{\mu} = \frac{h(x) f(x)}{E[h(X)]}\,, and the corresponding importance sampling estimator J_N has 0 variance.
For general h the optimal importance sampling distribution is \displaystyle g_*(x) \,=\, \frac{|h(x)| f(x)}{E[ |h|(X)] }.
Proof.
If h(x) \ge 0 then with g_*(x) \,=\, \frac{h(x) f(x)}{\mu} we have \int g(x) dx =1 and
N \textrm{Var} (J_N) = E \left[ h(X)^2 \frac{f(X)}{g_*(X)}\right] - E[h(X)]^2 = E \left[ \frac{h(X)^2 f(X) \mu}{ h(X) f(X) }\right]- E[h(X)]^2 \,=\, 0 \,.
and thus it is optimal.
For general h the optimal variance does not vanish but using Cauchy-Schwartz inequality in the form E[Z]^2 \le E[Z^2] we obtain for any choice of g,
\begin{aligned}
E \left[ h(X)^2 \frac{f(X)}{g_*(X)}\right] & = E \left[ \frac{ h(X)^2 f(X)}{|h|(X) f(X)} \right] E[ |h|(X)] = E \left[ |h|(X) \right]^2 = E \left[ |h|(Y) \frac{ f(Y) }{g(Y)} \right]^2 \\ & \le E \left[ h(Y)^2 \frac{f(Y)^2}{g(Y)^2}\right]
= E \left[ h(X)^2 \frac{f(X)}{g(X)}\right]
\end{aligned}
and thus g_* gives the optimal variance.
3.7 Examples
Suppose we want to sample the mean of standard normal RV X using an importance sampling estimator (it may look like a trivial example but wait…).
So we take h(x)=x and let us pick Y to be a normal RV with \mu=0 and variance \sigma^2
E\left[h(x)^2 \frac{f(x)}{g(x)}\right] = \int_{-\infty}^\infty x^2 \frac{ (e^{-x^2/2}/\sqrt{2 \pi})^2}{e^{-x^2/2\sigma}/\sqrt{2 \pi} \sigma } \, dx
= \sigma \int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi}} x^2e^{-x^2( 2 - \sigma^{-2})/2} \, dx
= \left\{ \begin{array}{cl} \frac{\sigma}{(2-\sigma^{-2})^{3/2}} & \textrm{ if } \sigma^2 > 1/2 \\
+\infty & \textrm{otherwise} \end{array} \right.
An elementary computation shows that the optimal choice is \sigma^2 = 5/2.
Suppose X is exponential with paramter \lambda and h = {\bf I}_{[a,\infty)} so that we want to estimate
\mu= E[ h(X)] = P( X \ge a) = e^{-\lambda a}\,.
Note e^{-\lambda a} maybe very small, e.g. 10^{-5}. The variance is e^{-\lambda a}(1- e^{-\lambda a}) \approx e^{-\lambda a} and so we need the precision to be at least \varepsilon=10^{-6} to have a meaningful estimate. We need then N = O\left(\frac{\sigma^2}{\epsilon^2}\right)= O(10^7)….
The optimal importance sampling is
g_*(x) = \frac{h(x) f(x)}{ e^{-\lambda a}} = 1_{[a,\infty)} \lambda e^{-\lambda(x-a)}
which is a shifted exponential random variable. Note that if sample from X we will mostly obtain h(X)=0 and most samples are “useless” while if we sample from the importance sampling distribution every single samples contribute a non-zero term to the estimator.
3.8 Tilted distribution
Tilted distributions are a natural tool for importance sampling. If f(x) is the pdf of X consider the random variable X^{(t)} with the new distribution
f_t(x) = \frac{e^{tx}}{M(t)}f(x) \quad \quad \textrm{ where } M(t)=E[e^{tX}]=\int e^{tx} f(x) dx
that is M(t) is the moment generating function of the random variable X. We have M'(t)=E[Xe^{tX}], M''(t)=E[X^2e^{tX}], \cdots
Example: If f(x)=e^{-\lambda x} (exponential distribution) then M(t)=\frac{1}{\lambda -t}, for t < \lambda and thus
f_t(x)=(\lambda-t) e^{-(\lambda-t)x}\,
X^{(t)} an exponential distribution with parameter \lambda -t.
Example: If p(k)=p^k(1-p)^{1-k}, k=0,1 is a Bernoulli distribution then M(t)=pe^{t} + (1-p) and thus
p_t(k)= \frac{e^{tk}}{M(t)}p^k(1-p)^{1-k} = \left(\frac{pe^{t}}{pe^{t} + (1-p)}\right)^k \left(\frac{1-p}{pe^{t} + (1-p)}\right)^{1-k}.
X^{(t)} is again a Bernoulli distribution with success probability p_t \equiv \frac{pe^{t}}{pe^{t} + (1-p)}
and so on
3.9 Rare event simulation
Suppose X_1, \cdots X_n are IID random variables with joint densities f(x_1)f(x_2)\cdots f(x_n) and S_n=X_1+\cdots+ X_n the sum. We want to estimate the (high-dimensional integral)
P\left( \frac{S_n}{n} \ge a\right) = E\left[1_{\{S_n \ge n a\}}\right] = \int 1_{\{S_n \ge na\}} f(x_1)f(x_2)\cdots f(x_n) dx_1 \cdots dx_n
If a > E[X_1] this probability is very small (exponentially small in n, see Chapter 4) and thus the simple Monte-Carlo will fail miserably.
Tilting the measure to do importance sampling we now have to sample the random variable
1_{\{S_n \ge na\}} \prod_{i=1}^n \frac{f(X^{(t)}_i)}{f_t(X^{(t)}_i)} =
1_{\{S_n \ge na\}} \prod_{i=1}^n \frac{M(t)}{e^{t X^{(t)}_i}} =
1_{\{S_n \ge na\}} e^{-tS_n} M(t)^n \le
\left( M(t)e^{-ta}\right)^n
It makes sense to make this random variable as small as possible thus minimize the bound M(t)e^{-ta}. Differentiating we find M'(t)e^{-ta} - a M(t) e^{-ta}=0 or
a = \int x \frac{e^{tx}}{M(t)} f(x) dx = E[X_t], \quad X_t \textrm{ with pdf } f_t(x)
\tag{3.2} That is we tilt the probability f so that a is the mean of the tilted pdf. So when sampling X_t we should expect to see often values around a making the estimation much easier!
Numerical example:
X_i is binomial with p=0.4, n=20 and a=0.8.
Since E[X]=p for a binomial the equation for the optimal t, Equation 3.2, gives
\frac{pe^{t}}{pe^{t} + (1-p)} = a \quad \iff e^{t}= \frac{a}{1-a}\frac{1-p}{p}
We find e^t=6 and M(t)=(0.4 e^t +0.6 )= 3 and thus
1_{\{S_{20} \ge 16\}} \prod_{i=1}^{20} \frac{p(X^{(t)}_i)}{p_t(X^{(t)}_i)} =1_{\{S_{20} \ge 16\}} \left(\frac{1}{6}\right)^{S_{20}} 3^{20} \le \left(\frac{1}{6}\right)^{16} 3^{20}= 0.001236
We can compute here explicitly that P(S_{20} \ge 16)=0.000317 so we have
\textrm{ Simple MC: } \textrm{Var}\left( 1_{\{S_{20} \ge 16\}}\right)= 0.000317(1-0.000317)= 3.169 \times 10^{-4}
On the other for the importance sampling the new random variable we sample takes values between 0 and 0.001236 and so
\textrm{ IS MC: } \textrm{Var}\left((1_{\{S_{20} \ge 16\}} \left(\frac{1}{6}\right)^{S_{20}} 3^{20}\right) \le \frac{(0.001236)^2}{4}= 3.81 \times 10^{-7}
(We have used that 0\le Y\le a implies that \textrm{Var}(Y) \le \frac{a^2}{4}, see Chapter 4 for a proof. )
3.10 A reliability example
Each edge fails with probability q=10^{-2} and the edges are independent. \mathbf{X}=(X_1, \cdots, X_{22}) has pdf (product of Bernoulli)
p(\mathbf{x}) \,\equiv\, P( {\bf X}= {\bf x}) \,=\, \prod_{i=1}^{22} q^{x_i} (1-q)^{1 - x_i} \,=\, q^{|{\bf x}|} (1-q)^{22 - |{\bf x}|}
with |x|=\sum_{i=1}^{22}x_i. h(\mathbf{x})=1 if there is no path from 1 to 22 along working edges and we want \mu=E[h(X)] (failure probability)
Figure 3.2: Connecting node 1 to node 22
At least three edges must fail for the system to fail so we have ther bound
\mu \le P(|X| \ge 3) = 1 - \sum_{j=0}^{2} { 22 \choose j} q^j (1-q)^{22-j} = 0.00136 \,.
To get a lower bound for \mu by noting that \mu \ge P( e_1, e_2, e_3 {\rm ~fail}) = q^3 \,=\, 10^{-6}.
So the failure probability os between 10^{-6} and 10^{-3}.
Use importance sampling with Y with q(\mathbf{y}) \,\equiv\, P( {\bf Y}= {\bf y}) \,=\, \prod_{i=1}^{22} \theta^{y_i} (1-\theta)^{1 - y_i} \,=\, \theta^{|{\bf y}|} (1-q)^{22 - |{\bf y}|}.
How to pick \theta?
Since we need at least 3 non-working edges we may want to pick \theta such that the average nnumber of non-working edges is 3 for \mathbf{Y}, that is we choose \theta=\frac{3}{22} (since |Y| is a binomial random variable, E[|Y|]= 22 \theta).
IS estimator J_N = \frac{1}{N} \sum_{i=1}^{N} \frac{ k(\mathbf{Y}_i) p(\mathbf{Y}_i)}{q(\mathbf{Y}_i)} with variance
\textrm{Var}(J_N) = \frac{1}{N} \left( \sum_{\mathbf{y}} \frac{ k(\mathbf{y} )^2 p(\mathbf{y})^2}{q(\mathbf{y})^2} q(\mathbf{y}) -\mu^2 \right) =
\frac{1}{N} \left( \sum_{ \mathbf{y}: h({\bf y})=1} \frac{ p(\mathbf{y})}{q(\mathbf{y})} p(\mathbf{y}) -\mu^2 \right)
\tag{3.3} Note that
\frac{ p(\mathbf{y}) }{q(\mathbf{y})} \,=\, \frac{q^{| \mathbf{y}|}(1-q)^{22- |\mathbf{y}|}}{\theta^{|\mathbf{y}|}(1-\theta)^{22- |\mathbf{y}|}} =
\left(
\frac{ 1-q}{1-\theta}\right)^{22} \left(\frac{ q (1-\theta) }{\theta(1-q)} \right)^{|\mathbf{y}|}= 20.2 \times (0.064)^{|{\bf y}|}
In Equation 3.3 all terms with k({\bf y})=1 have |{\bf y}|\ge 3 and for those {\bf y} we have \frac{ p({\bf y})}{\phi({\bf y})}\,\le\, 20.2 \times (0.064)^{3} \le 0.0053. So we get
\textrm{Var}(J_N) \,\le\, \frac{1}{N} \sum_{{\bf y}\,:\, k({\bf y})=1} 0.0053 \,\,p({\bf y}) = \frac{ 0.0053\,\mu}{N}
Reduced the variance by a factor approximately of 200 so weed need \sqrt{200} \cong 14 times less samples.
4 Concentration inequalities and non-asymptotic confidence intervals
We discuss in this section a few basic concentration inequalties, Chernov bounds, Hoeffding bounds and McDiarmid bounds and give some applications. Concentration inequalties are a very important tools to provide performance guarantees for Monte-Carlo methods, but also in a variety of other contexts in Uncertainty Quantification and Machine Learning. For further reading we recommend the excellent book (Boucheron, Lugosi, and Massart 2013).
4.1 Chernov bounds
We start with a basic but fundamental bound
Theorem 4.1 (Markov and Chernov inequalities)
Markov inequality: Suppose Y\ge 0 is a non-negative random variable and a>0 a positive number. Then
P(Y \ge a) \le \frac{E[Y]}{a}\,.
\tag{4.1}
Chernov inequality. Suppose X is a random variable and a \in \mathbb{R} then
\begin{aligned}
& P(X \ge a) \le \inf_{t \ge 0} \frac{E[e^{tX}]}{e^{ta}} \\
& P(X \le a) \le \inf_{t \le 0} \frac{E[e^{tX}]}{e^{ta}}
\end{aligned}
\tag{4.2}
Proof. If Y\ge 0 then \displaystyle Y=Y 1_{\{Y\ge a\}} + Y1_{\{Y\le a\}} \ge Y1_{\{Y\ge a\}} \ge a 1_{\{Y\ge a\}} and thus \displaystyle E[Y] \ge a P(Y\ge a) which is Equation 4.1.
Using Markov inequality we have for t \ge 0
P(X \ge a) = P\left( e^{tX} \ge e^{ta} \right) \le \frac{E[e^{tX}]}{e^{ta}} \,.
Optimizing over t gives the first inequality in Equation 4.2. The second inequality is obtained by replacing X by -X.
4.2 Examples of Chernov bounds
If X is normal with mean \mu and varaince \sigma^2, then M(t)=e^{\mu t + \frac{\sigma^2 t^2}{2}} and, for a = \mu + \varepsilon, Chernov gives
P(X \ge \mu + \varepsilon) \le \inf_{t \ge 0} \frac{e^{\mu t + \frac{\sigma^2 t^2}{2}} }{e^{t(\mu + \varepsilon)}}
= e^{ -\sup_{t\ge 0}\left\{ t\varepsilon - \frac{\sigma^2 t^2}{2}\right\} } = e^{-\frac{\varepsilon^2}{2\sigma^2}}
This is sharp: one can show (see exercises) that
\left(\frac{1}{\varepsilon}-\frac{1}{\varepsilon^3\sigma^2} \right)\frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{\varepsilon^2}{2\sigma^2}} \le P(X \ge \mu + \varepsilon) \le \frac{1}{\varepsilon}\frac{1} {\sqrt{2\pi}\sigma} e^{-\frac{\varepsilon^2}{2\sigma^2}}
Consider X_n is binomial with parameters (n,p) then M(t)=\left(pe^{t}+(1-p)\right)^n. We will use that
\sup_{t} \left\{ tz - \ln(pe^t + (1-p)) \right\} =
\left\{
\begin{array}{cl}
+ \infty & \textrm{ if } z<0 \textrm{ or } z>1 \\
z \ln \frac{z}{p} + (1-z) \ln \frac{1-z}{1-p} := H(z||p) & \textrm{ if } 0 \le z \le 1
\end{array}
\right.
H(z||p)=z \ln \frac{z}{p} + (1-z) \ln \frac{1-z}{1-p} is the KL-divergence or (relative entropy) between Bernoulli distributions with paramters z and p respectively.
Chernov bounds are very sharp but requires knowledge of the mgf M_X(t).
One of the main idea behind concentration inequalties: given X bound M_X(t) \le M(t) by the mfg of a random variable Y which you know explcitly. Mostly here we take Y a Gaussian but one also uses other, Bernoulli, Poisson, Gamma, etc…
The following elementary bound will be used repeatedly.
Lemma 4.1 (Hoeffding’s bound) Suppose a \le X \le b with probability 1. Then for any \varepsilon > 0
Bound on the variance \displaystyle \textrm{Var}(X) \le \frac{(b-a)^2}{4}
Bound on the mgf \displaystyle E\left[ e^{tX} \right] \le e^{t E[X]} e^{\frac{t^2 (b-a)^2}{8}}
Proof. For the bound on the variance a \le X \le b implies that \displaystyle -\frac{a-b}{2} \le X - \frac{a+b}{2} \le \frac{a-b}{2} and therefore
\textrm{Var}(X) = \textrm{Var}\left(X - \frac{a+b}{2}\right) \le E\left[\left(X - \frac{a+b}{2}\right)^2\right] \le \frac{(b-a)^2}{4}\,.
Since X is bounded the moment generating function M(t) =\frac{e^{tX}}{E[e^{tX}]} exists for any t \in \mathbb{R}. To bound the M(t) let us consider instead its logarithmx u(t)=\ln M(t). We have
\begin{aligned}
u'(t) &= \frac{M'(t)}{M(t)} &=& E\left[ X \frac{e^{tX}}{E[e^{tX}]} \right] \\
u''(t) &= \frac{M''(t)}{M(t)}- \left(\frac{M'(t)}{M(t)}\right)^2 &=& E\left[ X^2 \frac{e^{tX}}{E[e^{tX}]} \right] - E\left[ X \frac{e^{tX}}{E[e^{tX}]} \right]^{2}
\end{aligned}
We recognize u''(t) as the variance under the tilted measure with tilted density \frac{e^{tX}}{E[e^{tX}]} and thus by part 1. (applied to the tilted measure) we have u''(t) \le \frac{(b-a)^2}{4}.
Using the Taylor expansion with remainder we have, for some \xi between 0 and t
\ln M(t)= u(t) = u(0) + u'(0)t + u''(\xi) \frac{t^2}{2} \le t E[X] + \frac{t^2(b-a)^2}{8}\,.
This concludes the proof.
Remark: The bound on the variance in 1. is optimal. Indeed taking without loss of generality a=0 and b=1 then the variance is bounded by 1/4 and this realized by taking X to be a Bernoulli with p=\frac12. This bound says that the RV with the largest variance is the one where the mass is distributed at the end point.
The bound in 2. is optimal only in the sense that it is the best quadratic bound on u(t). For example for a Bernoulli with a=0 and b=1 we have M(t)=\ln (\frac{1}{2}e^{t} + \frac{1}{2})= \frac{1}{2}t + \ln \cosh\left(\frac{t}{2}\right) which is much smaller (for large t). There is room sometimes for better bounds (see e.g. (Boucheron, Lugosi, and Massart 2013)) but ising Gaussian is computationally convenient.
4.4 Hoeffding’s theorem
Theorem 4.2 (Hoeffding’s Theorem) Suppose X_1, \cdots, X_N are independent random variables such that a_i \le X \le b_i (almost surely). Then
Proof. Using independence the Hoeffding’s bound we have
e^{t (X_1 + \cdots + X_n - E[X_1 + \cdots + X_n])} =\prod_{i=1}^n e^{t(X_i-E[X_i])} \le \prod_{i=1}^n e^{\frac{t^2 (b_i-a_i)^2}{8}} = e^{ \frac{t^2 \sum_{i}(b_i-a_i)^2}{8}}
and using Chernov bound (for a Gaussian RV with variance \displaystyle \sum_{i}\frac{(b_i-a_i)^2}{4}) gives the first bound in Equation 4.3.
Corollary 4.1 (non-asymptotic confidence interval) Suppose X_1, \cdots, X_N are independent random variables such that a \le X \le b (almost surely) and \mu=E[X_i].
Comparison to asymptotic confidence interval, take X_i to be Bernoulliwith unknown \mu=p so we bound the variance by p(1-p) \le \sigma_{max}=\frac{1}{4} (this is what we did in Hoeffding with a=0,b=1. So we have
A very useful extension of Hoeffding’s applies to suitable function h(X_1, X_2, \cdots, X_n) other than just the sum.
We say that h(x_1, x_2, \cdots, x_n) satisifies the bounded difference property if there exist constants c_k such that for all x_k, x_k'
|h(x_1,\cdots, x_k, \cdots, x_n) - h(x_1,\cdots, x_k', \cdots, x_n)| \le c_k
Theorem 4.3 (McDiarmid Theorem) Suppose X_1, \cdots X_n are independent RVs and f(x_1, \cdots, x_n) satisfies the bounded difference property (almost surely). Then we have
\begin{aligned}
P\left( h(X_1,\cdots, X_n) - E[h(X_1, \cdots, X_n)] \ge \varepsilon \right) \le & e^{- \frac{\varepsilon^2}{2\sum_{k=1}^n c_k^2}} \\
P\left( h(X_1,\cdots, X_n) - E[h(X_1, \cdots, X_n)] \le -\varepsilon \right) \le & e^{- \frac{\varepsilon^2}{2\sum_{k=1}^n c_k^2}}
\end{aligned}
Proof. The proof use the basic properties of conditional expectation.
“Martingale trick”: Let us define random variable Y_k by Y_0=E[f(X_1, \cdots, X_n)] and, for 1\le k \le n,
Y_k =E[h(X_1,\cdots, X_n)|X_1, \cdots, X_k]
Note that Y_n=h(X_1,\cdots, X_n) and also E[Y_k| X_1, \cdots, X_{k-1}]= Y_{k-1} (the martingale property). If we define further Z_k=Y_k-Y_{k-1} then we have the telescoping sum h(X_1,\cdots, X_n) - E[h(X_1, \cdots, X_n)] = \sum_{k=1}^n Z_k and E[Z_k|X_1,\cdots,X_{k-1}]=0.
Duplication trick: Let \widehat{X}_k be an independent copy of the random variable X_k. Then by the bounded difference property we have, almost surely,
\left| E[h(X_1, \cdots, X_k, \cdots, X_n)|X_1, \cdots, X_k] - E[h(X_1, \cdots, \widehat{X}_k, \cdots, X_n)|X_1, \cdots, X_k]\right| \le c_k
\tag{4.4} Furthermore we have
\begin{aligned}
E[h(X_1, \cdots, X_k, \cdots, X_n)|X_1, \cdots, X_{k-1}]&=E[h(X_1, \cdots, \widehat{X}_k, \cdots, X_n)|X_1, \cdots, X_{k-1}]\\
&=E[h(X_1, \cdots, \widehat{X}_k, \cdots, X_n)|X_1, \cdots, X_K]
\end{aligned}
\tag{4.5} The first equality holds because X_k and \widehat{X}_{k} are identically distributed and left-hand side is a function of X_1, \cdots, X_{k-1}. The second equality holds because X_k and \widehat{X}_{k} are idependent. Combining Equation 4.4 and Equation 4.5 shows that |Z_k| \le c_k almost surely.
Conditioning and Hoeffding: We have, using conditioning and Hoeffding’s bound from Lemma 4.1
\begin{aligned}
E\left[ e^{t\sum_{k=1}^n Z_k} \right] &= E \left[ E \left[ e^{t\sum_{k=1}^{n-1} Z_k} e^{tZ_n} | X_1, \cdots X_{n-1}\right] \right]
= E \left[ e^{\sum_{k=1}^{n-1} Z_k} E \left[ e^{Z_n} | X_1, \cdots X_{n-1}\right] \right] \le e^{\frac{t^2 c_n^2}{2}} E \left[ e^{t \sum_{k=1}^{n-1} Z_k}\right]
\end{aligned}
where we used that E[Z_n|X_1, \cdots, X_{n-1}]=0 and \textrm{ Var}(Z_n|X_1, \cdots, X_{n-1}) \le c_n^2. By induction we find
E\left[ e^{t h(X_1, \cdots, X_n) -E[h(X_1, \cdots, X_n)]} \right] \le e^{\frac{t^2 \sum_{k=1}^n c_k^2}{2}}
and the rest follows as usual.
4.6 Confidence interval for variance estimator
We show how to build confidence interval for the unbiased variance estimator
V_N(X_1, \cdots, X_N)=\frac{1}{N-1} \sum_{k=1}^N\left(X_i - \frac{S_N}{N}\right)^2 = \frac{1}{N-1}\sum_{i=1}^N X_i^2 - \frac{S_N^2}{(N-1)N}
If we change X_1 into \widehat{X}_1 then S_N(X_1, \cdots, X_N ) - S_N(\widehat{X}_1, \cdots, X_N ) = X_1 - \widehat{X}_1 and so
V_N(X_1, \cdots, X_N)- V_N(\widehat{X}_1, \cdots, X_N) = \frac{X_1^2 - \widehat{X}_1^2}{N-1} - \frac{X_1 - \widehat{X}_1}{N-1} \left(\frac{S_N(X_1, \cdots, X_N )}{N} + \frac{S_N(\widehat{X}_1, \cdots, X_N )}{N} \right)
So if we assume |X_i|\le c we have the bound
|V_N(X_1, \cdots, X_N)- V_N(\widehat{X}_1, \cdots, X_N)| \le \frac{5c^2}{N-1}
and thus by McDiarmid we get
P( V_N - \sigma^2 \ge \varepsilon ) \, \le \, e^{ - \frac{(N-1)^2}{N}\frac{\varepsilon^2}{50c^4}} \,\le\, e^{ -(N-2)\frac{\varepsilon^2}{50c^4}}
and this decay exponentially in N again and can be used for a confidence interval for the variance which is very similar to the one for the mean.
4.7 Bernstein bound
In Hoeffding’s bound we use, in an essential way, a bound on the variance. If the variance is small then one should expect the bound to be poor. Bernstein bound, which can be used if we have some a-priori knowledge about the variance, improves.
Theorem 4.4 (Bernstein Bound) Suppose X is a random variable such that |X-E[X]| \le c and \textrm{var}(X) \le \sigma^2. Then
E[e^{t X}]\le e^{t E[X]+ \frac{\sigma^2 t^2}{2(1 -c|t|/3)}} \,.
Proof. We expand the exponential and use that for k\ge 2,
E\left[(X-\mu)^k\right] \le E\left[(X-\mu)^2|X-\mu|^{k-2}\right]\le E[(X-\mu)^2] c^{k-2} \le \sigma^2 c^{k-2}
and get
\begin{aligned}
E\left[e^{t(X-\mu)}\right]=1+ \sum_{k=2}^\infty \frac{t^k}{k!} E[(X-\mu)^k] &\le 1 + \frac{t^2\sigma^2}{2}\sum_{k=2}\frac{2}{k!}(|t|c)^{k-2} \\
&\le 1 + \frac{t^2\sigma^2}{2}\sum_{k=2}^\infty \left(\frac{|t|c}{3}\right)^{k-2} \quad \textrm{ since } \frac{k}{2!}\ge 3^{k-2}\\
&\le 1 + \frac{t^2\sigma^2}{2(1-\frac{|t|c}{3})} \le e^{\frac{t^2\sigma^2}{2(1-\frac{|t|c}{3})}} \quad \textrm{ since } 1+ x \le e^x
\end{aligned}
and this concludes the proof.
4.8 Bernstein confidence interval
To obtain a confidence interval we need to solve an optimization problem: after some straightforward computation we find
\sup_{t \ge 0} \left\{ \varepsilon t - \frac{a t^2}{2(1-bt)} \right\} = \frac{a}{b^2}h\left( \frac{b\epsilon}{a}\right) \quad \textrm{ where } h(u) = 1 +u - \sqrt{1 + 2u}
and we note that the h^{-1}(z)= z + \sqrt{2z}. We obtain the exact same bound for the left tail (by symmetry) and thus
to obtain a confidence interval we have
\delta = 2 e^{-\frac{a}{b^2}h\left( \frac{b\varepsilon}{a}\right)} \iff \varepsilon = b \ln\left( \frac{2}{\delta}\right) + \sqrt{2a \ln\left( \frac{2}{\delta}\right)}
Theorem 4.5 (Bernstein for sum of IID) If X_1, \cdots, X_N are IID random variables with |X_i| \le c and \textrm{Var}(X_i)\le \sigma^2 then
You should note that this bound can be substantially better than Hoeffding’s bound is \sigma \le \sigma_{max}
4.9 Comparing the bounds for Bernoulli
Hoeffdings amount to bounding the variance by \frac14. Bernstein which incorporporates knowledge of (or a bound on) the variance is much better than Hoeffdings as soon as the variance is not too big (0.1 in the epicture below.)
4.10 References
Boucheron, S., G. Lugosi, and P. Massart. 2013. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press.
Ross, Sheldon M. 2013. Simulation. Academic Press.
———. 2019. Introduction to Probability Models. Twelfth edition. London, United Kingdom: Academic Press.
Rubinstein, Reuven Y., and Dirk P. Kroese. 2017. Simulation and the Monte Carlo Method. Third edition. Hoboken New Jersey: John Wiley & Sons.