Math 606, Spring 2023
University of Massachusetts Amherst
2023-06-25
We introduce the basic definitions necessary to describe Markov chains and provide a first series of examples. For further reading we recommend the books Lawler (2006) and Levin et al. (2017).
A time homogeneous Markov chain is specified by
Initial distribution \mu(i) for i \in S where \mu(i)=P(X_0=i). We have \mu(i)\ge 0 \quad \textrm{ and } \sum_{i \in S} \mu(i) =1 and \mu(i) is called a probability vector .
Transition matrix P(i,j) for i,j \in S where P(i,j)=P(X_n=j|X_{n-1}=i). We have P(i,j) \ge 0 \quad \textrm{ and } \sum_{j \in S} P(i,j)=1 and P(i,j) is called a transition probabilities
All quantities of interest for the Markov chain can be computed using these two objects. For example we have P\left\{ X_0 =i_0, X_1=i_1, X_2=i_2\right\} \,=\, \mu(i_0) P(i_0,i_1) P(i_1, i_2) \,. or P\left\{ X_2=i \right\} \,=\, \sum_{i_0\in S} \sum_{i_1 \in S} \mu(i_0) P(i_0,i_1) P(i_1, i) and so on.
\mu = ( \mu(1), \mu(2), \cdots, \mu(N)) that is \mu is a row vector whose entries are the initial distribution.
Also we will write P for the N \times N matrix whose entries are P(i,j)
P \,=\, \left( \begin{array}{cccc} P(1,1) & P(1,2) & \cdots & P(1,N) \\ P(2,1) & P(2,2) & \cdots & P(2,N) \\ \vdots & \vdots & & \vdots \\ P(N,1) & P(N,2) & \cdots & P(N,N) \end{array} \right)
Proposition 1.1 (Kolmogorov equations) For a Markov process with initial distribution \mu and transition probabilities P:
1.The n-step transition probabilities are given by P \left\{ X_{n}=j \vert X_0={i} \right\} \,=\, P^n(i,j) where P^n is the matrix product \underbrace{P\cdots P}_{\rm n~times}.
If \mu(i)=P\{X_0=i\} is the distribution of X_0 then P\{ X_n=i\} \,=\, \mu P^n (i) is the distribution of X_n.
If f = (f(1), \cdots, f(n))^T is a column vector (think of f as a function f: S \to \mathbb{R}) then we have
P^n f(i) \,=\, E\left[ f(X_n) | X_0=i\right].
Proof. For 1. we use induction and assume the formula is true for n-1. We condition on the state at time n-1 , use the formula P( AB\vert C) = P(A \vert BC) P(B\vert C), the Markov property, to find
\begin{aligned} P \left( X_{n}=j \vert X_0={i}\right) &= \sum_{k \in S} P \left( X_{n}=j \,, X_{n-1}=k \vert X_0={i} \right) \\ &= \sum_{k \in S} P \left( X_{n}=j \vert X_{n-1}=k \,, X_0={i} \right) P \left( X_{n-1}=k \vert X_0={i} \right) \\ &= \sum_{k \in S} P \left\{ X_{n}=j \vert X_{n-1}=k \right\} P \left\{ X_{n-1}=k \vert X_0={i} \right\} \\ &= \sum_{k \in S} P^{n-1}(i, k) P(k,j) \,=\, P^n(i,j) \,. \end{aligned}
For 2. note that \mu P is a probability vector since \sum_{i} \mu P(i) \,=\, \sum_{i} \sum_{j} \mu(j ) P(j,i) \,=\, \sum_{j} \mu(j) \sum_i P(j,i) \,=\, \sum_j \mu(j) \,.
Furthermore by the formula for conditional probabilities part 1. P\{ X_n=j\} \,=\, \sum_{k \in S} P\{ X_n=j \vert X_0 =k\} P\{ X_0=k\} \,=\, \sum_{k} \mu(k) P^n(k,j) \,=\, \mu P^n (j)\,.
For 3. we have
P^nf (i) \,=\, \sum_{k} P^n(i,k) f(k) \,=\, \sum_{k} f(k) P \left\{ X_{n}=k \vert X_0={i} \right\} \,=\, E[f(X_{n})\,\vert \, X_0=i] \,.
Limiting distributions are always stationary distributions: Suppose \pi is a limiting distribution and \lim_{n\to \infty} \mu P^n = \pi. Then \pi P \,=\, (\lim_{n \to \infty} \mu P^n ) P \,=\, \lim_{n \to \infty} \mu P^{n+1} \,=\, \lim_{n \to \infty} \mu P^{n} \,=\, \pi \,. and thus \pi is stationary.
In the next section we will derive conditions under which stationary distributions are unique and are limiting distributions.
Example 1.1 (2-state Markov chain) Suppose S=\{1,2\} with transtion matrix \displaystyle P \,=\, \left( \begin{array}{cc} 1-p & p \\ q & 1-q \end{array} \right) \,.
The equation for the stationary distribution, \pi P = \pi gives \pi(1) (1-p) + \pi(2) q \,=\, \pi(1) \quad \quad \pi(1) p + \pi(2) (1-q) \,=\, \pi(2) that is p \pi(1) = q \pi(2). Normalizing to a probability vector gives \displaystyle \pi = \left( \frac{q}{p+q},\frac{p}{p+q} \right).
We show \pi is a limiting distribution. Set \mu_n \equiv \mu P^n and consider the difference \mu_n -\pi. Using \mu_n(2)=1-\mu_n(1) we get the equation \begin{aligned} \mu_n(1) - \pi(1) &= \mu_{n-1} P(1) - \pi(1) = \mu_{n-1}(1) (1-p) + (1- \mu_{n-1}(1)) q - \frac{q}{p+q} \\ & = \mu_{n-1} (1)( 1-p-q) - \frac{q}{p+q}(1 -p -q) \,=\, (1-p-q) ( \mu_{n-1}(1) - \pi(1) ) \end{aligned} By induction \mu_n(1) - \pi(1) = (1-p-q)^n (\mu_0(1) - \pi(1)). If either p>0 or q>0 then \lim_{n \to \infty} \mu_n(1) = \pi(1) and this implies that \lim_{n\to \infty} \mu_n(2) = \pi(2) as well.
If either p or q does not vanish then \mu_n = \mu P^n converges to a stationary distribution for an arbitrary choice of the initial distribution \mu.
Example 1.2 (Coupon collecting) A company offers toys in breakfast cereal boxes. There are N different toys available and each toy is equally likelky to be found in any cereal box.
Let X_n be the number of distinct toys that you collect after buying n boxes and is natural to set X_0=0. Then X_n is a Markov chain with state space \{0,1,2,\cdots,N\} and it has a simple structure since X_{n} either stays the same of increase by 1 unit.
The transition probabilities are \begin{aligned} P(j, j+1) \,&=\, P\{ {\rm ~new~toy~} \vert {\rm ~already~j~toys}\} \,=\, \frac{N-j}{N} \\ P(j, j) \,& =\, P\{ {\rm ~no~new~toy~} \vert {\rm ~already~j~toys}\} \,=\, \frac{j}{N} \end{aligned} The Markov chain X_n will eventually reach the state N and stays there forever (N is called an absorbing state). Let us denote by \tau the (random) finite time \tau it takes to reach the state N. To compute its expectation, E[\tau], let us write \tau = T_1 + \cdots +T_N \,, where T_i is the time needed to get a new toy after you have gotten your (i-1)^{th} toy. The T_i’s are independent and have T_i has a geometric distribution with p_i = (N-i)/N. Thus E[ \tau] \,=\, \sum_{i=1}^N E[ T_i] \,=\, \sum_{i=1}^N \frac{N}{N-i} \,=\, N \sum_{i=1}^N \frac{1}{i} \approx N \ln (N) \,.
Example 1.3 (Random walk on a graph) Consider an undirected graph G consists with vertex set V and edge set E (each edge e=\{v,w\} is an (unordered) pair of vertices). We say that the vertex v is a neighbor of the vertex w, and write v \sim w, if \{v,w\} is an edge. The degree of a vertex v, denoted \textrm{deg}(v), is the number of neighbor of v.
Given such a graph G=(V,E) the simple random walk on G is the Markov chain with state space V and transition matrix P(v,w) \,=\, \left\{ \begin{array}{cl} \frac{1}{{\rm deg} (v)} & {\rm if~} w \sim v \\ 0 & {\rm otherwise} \end{array} \right. \,.
The invariant distribution for the random walk on graph is given by \pi(v) \,=\, \frac{ {\rm deg} (v)} { 2 |E|} where is the number of edges. First note that \sum_{v} \pi(v)=1 since each edge connects two vertices. To show invariance note that \pi P(v) \,=\, \sum_{w} \pi(w)P(w,v) \,=\, \sum_{w ; w \sim v} \frac{{\rm deg} (w)}{2|E|} \frac{1}{ {\rm deg}(w)} \,=\, \frac{1}{2|E|} \sum_{w ; w \sim v}1 \,=\, \pi(v) \,.
P\,=\, \left( \begin{array}{ccccc} 0 & \frac13 & \frac13 & \frac 13 & 0 \\ \frac14 & 0 & \frac14 & \frac 14 & \frac14 \\ \frac13 & \frac13 & 0 & 0& \frac 13 \\ \frac13 & \frac13 & 0 & 0 & \frac13 \\ 0 & \frac13 & \frac13 & \frac13 & 0 \\ \end{array} \right) \quad \quad \quad \pi = \left( \frac{3}{16}, \frac{4}{16}, \frac{3}{16}, \frac{3}{16}, \frac{3}{16}\right)
Example 1.4 (Random walk on the hypercube) The d-dimensional hypercube graph Q_d has for vertices the binary d-tuples \mathbf{x}=(x_1, \cdots, x_d) with x_k \in \{0,1\} and two vertices are connected by an edge when they differ in exactly one coordinate (flipping a 0 into 1 or vice-versa).
The simple random walk on Q_d moves from one vertex x=(x_1, \cdots, x_K) by choosing a coordinate j \in \{ 1,2,\cdots, d\} uniformly at random and setting x_j \to (1-x_j).
The degree of each vertex is d, the number of vertices is 2^d and the number of edges is 2^d \frac{d}{2} so \pi(\mathbf{x}) \,=\, \frac{1}{2^d},the uniform distribution on Q_d.
Example 1.5 (Assymetric random walks on {0,1, , N}) State space S=\{0,1,\cdots,N\} and the Markov chain goes up by 1 with probability p and down by 1 with probaility 1-p. P(j, j+1) =p,\, P(j, j-1) = 1-p \,, \quad \textrm {for } j=1, \cdots, N-1 We can pick different boundary conditions (BC) at 0 and N:
Absorbing BC: P(0,0)=1, P(N,N)=1
Reflecting BC: P(0,1)=1, P(N,N-1)=1
Partially reflecting BC: P(0,0)=(1-p)\,, P(0,1)=p P(N,N-1)=(1-p)\,, P(N,N)=p
Periodic BC: P(0,1)=p\,, P(0, N)=(1-p)\,, \quad P(N, 0) =p\,, P(N, N-1)= (1-p)
Example 1.6 (Ehrenfest urn model) Suppose d balls are distributed among two urns, urn A and urn B. At each move one ball is selected uniformly at random among the d balls and is transferred from its current urn to the other urn. If X_n is the number of balls in urn A then the state space is S=\{0, 1, \cdots, d\} and the transition probabilities P(j,j+1) \,=\, \frac{d-j}{d}\,, \quad P(j,j-1) \,=\, \frac{j}{d} \,. We show that the invariant distribution is ninomial with paramters (d,\frac{1}{2}), that is \pi(j) \,=\, {d \choose j} \frac{1}{2^d}.
\begin{aligned} \pi P (j) &= \sum_{k} \pi(k) P(k,j) = \pi(j-1) P(j-1,j) + \pi(j+1) P(j+1, j) \\ &= \frac{1}{2^d}\left[ {d \choose j-1} \frac{d - (j-1)}{d} + {d \choose j+1} \frac{j+1}{d} \right] = {d \choose j} \frac{1}{2^d} \,. \end{aligned}
This Markov chain is closely related to the simple random walk Y_n on the hypercube Q_d. Indeed selecting randomly one of the d balls and moving it the other urn is equivalent to selecting a random coordinate y_k of \mathbf{y}=(y_1, \cdots, y_d) and changing it to 1-y_k. If we denote by j= |\mathbf{y}|=y_1+ \cdots + y_d to be the number of 1s in \mathbf{y}. Then P(X_n=j+1|X_n=j) = P(\textrm{ choose } k \textrm{ such that } y_k=0) = \frac{d-j}{d}
We present the basic convergence theory for finite state Markov chains and introduce the concept of irreducubility and period of a Markov chain. This is very classical stuff, see e.g. Lawler (2006) and Levin et al. (2017).
We first show that stationary distributions always exist for finite state Markov chains. This will not be the case if the state space is countable.
Theorem 2.1 Let X_n be a Markov chain on a finite state space S. Then X_n has at least one stationary distribution.
Proof. Boltzano Weierstrass theorem asserts that any bounded sequences \{\mathbf{x}_k\} in \mathbb{R^d} has a convergence subsequence.
Choose \mu_0 to be an arbitrary initial distribution and let \mu_n = \mu P^n. Apply Boltzano Weierstrass to the sequence \mu_n will not work but consider instead the time averages \nu_n \,=\, \frac{\mu + \mu P + \cdots \mu P^{n-1}}{n} The sequence \nu_n is bounded since \mu_n and \nu_n are probability vector. Therefore there exists a convergent subsequence \nu_{n_k} with \nu_{n_k}=\pi and \pi is a probability vector as well. We show that \pi is a stationary distribution. Note \nu_n P - \nu_n \,=\, \frac{ \mu P + \cdots \mu P^{n} - \mu - \cdots -\mu P^{n-1}}{n} \,=\, \frac{\mu P^n - \mu}{n} \quad \textrm{ and thus } \quad \left\vert \nu_n P(j) - \nu_n(j) \right\vert \,\le \, \frac{1}{n} To conclude we note that | \pi P(j) - \pi(j)| \,=\, \lim_{k \to \infty} | \nu_{n_k}P (j) - \nu_{n_k}(j) | \,\le\, \lim_{k \to \infty} \frac{1}{n_k} \,=\, 0 \,. and thus \pi P = \pi, that is \pi is a stationary distribution.
P \,=\, \left( \begin{array}{cccc} \frac{1}{2} & \frac{1}{2} & 0 & 0 \\ \frac13 & \frac23 & 0 & 0 \\ 0 & 0 & \frac16 & \frac56 \\ 0 & 0 & \frac45 & \frac15 \end{array} \right) \quad \quad \pi_1=\left( \frac35 , \frac 25 ,0,0\right) \textrm{ and } \pi_2= \left( 0, 0 ,\frac{25}{49}, \frac{24}{49}\right) \quad \textrm{ are stationary} If the Markov chain starts in state 1 or 2 it will only visits states 1 and 2 in the future and so we can build a stationary distribution (by Theorem 2.1 using an initial \mu_0 which vanishes on 3 and 4 ) which will vanish on the state 3 and 4.
Notations:
We say that j is accessible from i, symbolically i \rightsquigarrow j if there exists n\ge 0 such that P^n(i,j)>0.
We say that i and j communicate , symbolically i \leftrightsquigarrow j if i \rightsquigarrow j and j \rightsquigarrow i.
A Markov chain X_n is irreducible if every state i\in S communicate with every other state j \in S that is for any pair of states i,j in S there exists n such that P^n(i,j) >0.
In an irreducible Markov chain starting from any state you will eventually visit any other state.
Theorem 2.2 (Uniqueness of stationary distributions) Suppose X_n is an irreducible Markov chain with finite state space S.
If \pi is a stationary distribution then \pi(j)>0 for all j\in S,
Suppose h is a column vector such that Ph=h then h=c(1,1,\cdots,1)^T is a constant vector.
The Markov chain X_n has a unique stationary distribution \pi.
Proof. For 1. if \pi is stationary distribution then \pi(i)>0 for at least one i. If j is such that i \rightsquigarrow j then there exists a time r such that P^r(i,j) >0 and thus \pi(j) =\pi P^r(j)\,=\, \sum_{k} \pi(k) P^r(k,j) \,\ge \, \pi(i) P^r(i,j) >0 \,. Since X_n is irreducible, \pi(j)>0 for any j \in S.
For 2, suppose that Ph=h and i_0 such that h(i_0)= \max_{i\in S} h(i)\equiv M. If h is not a constant vector there exists j with i_0 \rightsquigarrow j but h(j) < M and P^r(i_0,j>0)>0. Since P^rh=h, M= h(i_0) \,=\, P^r h(i_0) \,=\, P^r( i_0, j) \underbrace{h(j)}_{< M} + \sum_{l \not= i_0} P^n(i_0,l) \underbrace{h(l)}_{\le M} < M\sum_{l} P^(i_0,l)=M\,, and this is a contradiction.
For 3. part 2. shows that the geometric multiplicity of the eigenvalue 1 matrix P is equal to 1 and thus so is the geometric multiplicity of the eigenvalue 1 for the adjoint P^T and thus P has at most one left eigenvector \pi.
Question: If we have a unique stationary distribution is it correct that \lim_{n \to \infty} \mu P^n = \pi?
Without further assumption the answer is NO, because of possible periodic behavior. Consider for example the random walk on \{1, \cdots, 2N\} with, for example, periodic boundary conditions. The stationary distribution is uniform \pi= \left(\frac{1}{2N}, \cdots, \frac{1}{2N}\right). But if X_0 is even then X_1 is odd and then alternating periodically between odd and even positions. For example for
P \,=\, \left( \begin{array}{cccc} 0 & 1/4 & 0 & 3/4 \\ 3/4 & 0 & 1/4 & 0 \\ 0 & 3/4 & 0 & 1/4 \\ 1/4 & 0 & 3/4 & 0 \end{array} \right) \quad \quad \pi = \left( \frac14,\frac14 ,\frac14,\frac14 \right) \textrm{ is stationary} and
P^{10}=\begin{pmatrix} 0.4995 & 0 & 0.5004 & 0 \\ 0 & 0.4995 & 0 & 0.5004 \\ 0.5004 & 0 & 0.4995 & 0 \\ 0 & 0.5004 & 0 & 0.4995 \\ \end{pmatrix} \quad \quad P^{11}= \begin{pmatrix} 0 & 0.5002 & 0 & 0.4997 \\ 0.4997 & 0 & 0.5002 & 0 \\ 0 & 0.4997 & 0 & 0.5002 \\ 0.5002 & 0 & 0.4997 & 0 \\ \end{pmatrix} and P^n(i,j) does not converge, oscillates, asymptotically between 0, and 1/2.
Theorem 2.3 (Doeblin’s Theorem) Suppose X_n be an irreducible and aperiodic Markov chain on the finite state space S with stationary distribution \pi. There exists a constant C>0 and number \alpha with 0\le \alpha< 1 such that for any initial distribution \mu we have |\mu P^n(j) - \pi(j)| \,\le\, C \alpha^n \,, \tag{2.1} i.e., the distribution of X_n converges, exponentially fast, to \pi.
Proof. Since the Markov chain is irreducible and aperiodic we can find an integer r such that P^r has strictly positive entries. Let \Pi be the stochastic matrix \Pi \,=\, \left( \begin{array}{cccc} \pi(1)& \pi(2)& \cdots& \pi(N) \\ \pi(1)& \pi(2)& \cdots& \pi(N) \\ \vdots& \vdots & &\vdots \\ \pi(1) &\pi(2)& \cdots& \pi(N) \end{array} \right) where every row is the stationary distribution \pi. Note that this corresponds to independent sampling from the stationary distribution.
Since all elements of P^r(i,j) are strictly positive we can pick \delta >0 sufficiently small such that for all i,j P^r(i,j) \,\ge \, \delta\,\Pi(i,j)=\delta \pi(j) \,. \tag{2.2}
Let us set \theta = 1-\delta and by Equation 2.2 we define a stochastic matrix Q through the equation P^r = (1-\theta) \Pi + \theta Q \,. Note the following facts:
Since \pi P = \pi we have \Pi P^n = \Pi for any n \ge 1.
For any stochastic matrix M we have M \Pi = \Pi since all rows of \Pi are identical.
Using these two properties we show, by induction, that any integer k\ge 1, P^{kr} \,=\, (1 - \theta^k) \Pi + \theta^k Q^k. This is true for k=1 and so let us assume this to be true for k. We have \begin{aligned} P^{r(k+1)} = P^{rk} P^r &= \left[ (1 - \theta^k) \Pi + \theta^k Q^k \right] P^r = (1 - \theta^k) \Pi P^r + \theta^k Q^k [(1-\theta) \Pi + \theta Q] \\ &= (1 - \theta^k) \Pi + \theta^k(1-\theta) \Pi + \theta^{k+1} Q^{k+1} = (1 - \theta^{k+1}) \Pi + \theta^{k+1} Q^{k+1} \end{aligned} and this concludes the induction step.
From this we conclude that P^{rk} \to \Pi as k \to \infty. We write any integer n as n=k r +l where 0 \le l < r. We have then P^n = P^{kr} P^l = [(1 - \theta^k) \Pi + \theta^k Q^k] P^l = \Pi + \theta^{k} \left[ Q^k P^l - \Pi \right] and thus |P^n(i,j) - \pi(j)| \,=\, \theta^k |Q^k P^l(i,j) - \Pi(i,j)| \le \theta^k \le \frac{1}{\theta} (\theta^{1/r})^n \equiv C\alpha^n \,. The same holds for any initial distribution multiplying by \mu(i) and summing.
As we have seen before Markov chain can exhibit periodic behavior and circle around various part of the state space.
For example for random walks with periodic boundary conditions the period is 2 is the number of sites is even and 1 if the number of sites is odd.
The set {\mathcal T}(j) is closed under addition: n,m \in {\mathcal T}(j) \implies m+n \in {\mathcal T}(j) which holds true since P^{n+m}(j,j) \ge P^{n}(j,j)P^{m}(j,j).
By definition if \textrm{per}(j)=d then {\mathcal T}(j) \subset \{ 0,d,2d, \cdots\} Since it is closed under addition a fact from number theory (see exercise) implies that there exists k such that \{ kd, (k+1)d, (k+2)d \cdots \} \subset {\mathcal T}(j) \quad \textrm { or equivalently } \quad P^{nd}(j,j)>0 \textrm{ for all } n\ge k
Lemma 2.1 If i \leftrightsquigarrow j then \textrm{per}(i)=\textrm{per}(j).
Proof. Suppose \textrm{per}(j)=d and i \leftrightsquigarrow j. There exist integers m and m such that P^m(i,j)>0 and P^n(j,i)>0 and thus P^{m+n}(i,i)>0 \textrm{ and }P^{m+n}(j,j)>0 which means that m+n \in {\mathcal T}(i) \cap {\mathcal T}(j) and so m+n=kd for some integer k.
Suppose l \in {\mathcal T}(i) then P^{n+m+l}(j,j) \ge P^n(j,i) P^l(i,i) P^m(i,j) > 0 and n+m+l \in {\mathcal T}(j). Therefore d divides l since it divides both n+m+l and n+m and this shows that \textrm{per}(j) \le \textrm{per}(i). Reversing the roles of i and j we find \textrm{per}(j) = \textrm{per}(i). \quad \blacksquare
Consequences:
If X_n is irreducible then all states have the same period and we can define the period of the irreducible Markov chain X_n .
If X_n is irreducible and has period 1 (also called aperiodic) and S is finite then P^{n}(i,j) >0 for all suficiently large n (compare with our earlier definition of irreducible and paeriodic).
Decomposing the state space S = G_1 \cup G_2 \cup \cdots G_d:
Start with a state i\in S and let i \in G_1.
For all j such that P(i,j)>0 let j \in G_2.
For all j\in G_2 and all k such that P(j,k)>0 let k\in G_3.
After assigning set to G_d assign the next states to G_1.
Repeat until all states have been assigned. If |S| is finite this takes at most |S| steps.
By construction since the period j we get a partition of S into d distinct subsets G_1, \cdots, G_d and P(i,j)> 0 \implies i \in G_k \textrm{ and } j \in G_{k+1 (\textrm{mod} d)} For example
P= \begin{matrix} 1\\2\\3\\4\\5\\6\\7 \end{matrix} \begin{pmatrix} 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & .2 & 0 & 0 & .8 & 0 \\ .3 & .7 & 0 & 0 & 0 & 0 & 0 \\ .2 & .8 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & .7 & .3 & 0 \\ \end{pmatrix}
Iteratively we find
1 \rightsquigarrow 4 \rightsquigarrow 3,6 \rightsquigarrow 1,2 \rightsquigarrow 4,7 \rightsquigarrow 3,5,6 \rightsquigarrow 1,2
\rightsquigarrow 4,7
and thus the Markov chain is irreducible with period 3 and so we have
S=\{1,2\} \cup \{4,7\} \cup \{3,5,6\}
Relabeling the state the transition matrix P takes the following block matrix form P\,=\, \begin{matrix} G_1 \\ G_2 \\ \vdots\\ G_{d-1}\\ G_d \end{matrix} \begin{pmatrix} 0 & P_{G_1 G_2} & 0 & \cdots &0 \\ 0 & 0 &P_{G_2 G_3} & \cdots &0 \\ \vdots & \vdots & & \vdots & \vdots \\ 0 & 0 & \cdots & 0 &P_{G_{d-1} G_d} \\ P_{G_dG_1}& 0 & \cdots & 0 & 0 \end{pmatrix}
where P_{G_l G_{l+1}} describe the transition between states in G_l and G_{l+1}.
Putting the matrix P to the power d we find the block diagonal form P^d\,=\, \begin{matrix} G_1 \\ G_2 \\ \vdots\\ G_d \end{matrix} \begin{pmatrix} Q_{G_1} & 0 & 0 & \cdots &0 \\ 0 & Q_{G_2} & 0 & \cdots & 0 \\ \vdots & \vdots & & & \vdots \\ 0 & 0 & \cdots & 0 & Q_{G_d} \end{pmatrix}
Theorem 2.4 If X_n is an irreducible Markov chain on a finite state space with stationary distribution \pi then we have \lim_{n \to \infty} \frac{1}{n} \sum_{k=0}^{n-1} P^k(i,j) \,=\, \pi(j) \,.
Proof. The matrix Q_{G_l}, restricted to state space G_l governs the evolution of an irreducible aperiodic Markov chain and thus, by Theorem 2.3, we have \lim_{n \to \infty} P^{nd}(i,j) \,=\,\lim_{n \to \infty} Q^n_{G_l}(i,j) = \pi_{G_l}(j)\,\quad \textrm{ for } i \in G_l, j \in G_l where \pi_{G_l} is the stationary distribution for Q_{G_l}. Morever P^{l}(i,j)=0 if l\not=nd for some n.
If i \in G_{l-1} and j \in G_{l} then \lim_{n \to \infty} P^{nd+1}(i,j) \,=\, \lim_{n\to \infty} \sum_{k \in G_{l}} P(ik) P^{nd}(k,j) \,=\, \sum_{k \in G_{l}} P(ik) \pi_{G_l}(j) \,=\, \pi_{G_l}(j)\,, and P^{m}(i,j)=0 if m\not=nd+1 for some n. Similarly if i \in G_{l-r} and j \in G_{l} we have \lim_{n \to \infty} P^{nd+r}(i,j) \,=\, \pi_{G_l}(j)\,. and P^{m}(i,j)=0 if m\not=nd+r for some n.
The sequence P^n(i,j) is asymptotically periodic where d-1 successive 0 alternates with a number converging to \pi_{G_l}(j) and thus if we define \pi \equiv \frac{1}{d} ( \pi_{G_1}, \cdots, \pi_{G_d}) then \pi is normalized, stationary and \lim_{n \to \infty} \frac{1}{n} \sum_{k=1}^n P^k(i,j) \,=\, \pi(k) since the time spend in state j is asymptotically equal to \frac{1}{d} \pi_{G_l}(j).
We now have good understanding the asymptotic behavior of P^n(i,j) or \mu P^n as n \to \infty. When simulating or observing a Markov chain we usually observe a single sequence X_0, X_1, X_2, \cdots and our next goal is to derive a law large numbers for this sequence of random variables. To do this it will be useful to track how often the Markov chain visits some state.
Notations:
Hitting time and return are closely related and are distinct only from the way they consider what happen at time 0 (both are useful later on). They are example of stopping time and have the crucial property that to decide the events \tau(j)=k only depends on X_0, \cdots X_k. In particular we have the strong Markov property P( X_{\tau(j)+1}=l| \tau(j)=k, X_0=i_0, \cdots, X_k=i_k) = P( X_1=l| X_k=j) which means that the Markov chains starts afresh after the random return time.
Irreducibility means that starting from i the Markov chain will reach j and thus for any i we have P\{\tau(j) <\infty |X_0=i\} >0 If the state space is finite then we have something much stronger. Not only we have P\{\tau(j) <\infty |X_0=i\} =1, the expectation of \tau(j) is actually finite.
Lemma 2.2 For an irreducible Markov chain on the finite state space S the expected return time E[\sigma(j)|X_0=i] < \infty
Proof. By irreducbility the Markov chain can reach the state j from any state i in a finite time. Since S is finite this time is uniformly bounded. So there exist a time m and \varepsilon>0 such that P^l(i,j)\ge \epsilon for some l \le m uniformly in i \in S. Using the strong Markov property this implies that P\{\sigma(j) > k m|X_0=i\} \le (1-\varepsilon) P \{ \sigma(j) > (k-1) m |X_0=i\}\le \cdots \le (1-\varepsilon)^k
Therefore, using that P\{\sigma(j) > n|X_0=i\} is a decreasing function of n E[\sigma(j)|X_0=i] = \sum_{n \ge 0} P\{\sigma(j) > n|X_0=i\} \le m \sum_{k \ge 0} P\{\sigma(j) > km|X_0=i\}\le m \sum_{k \ge 0} (1-\varepsilon)^k < \infty \quad \blacksquare
We will also need the random variable Y_n(j) which counts the number of visits to starte j up to time n that is Y_n(j) \equiv \sum_{k=0}^{n-1} \mathbf{1}_{\{X_k=j\}} \quad = \textrm{ the number of visits to the state } j \textrm{ up to time } n
Since E \left[ \mathbf{1}_{\{X_k=j\}} \vert X_0=i \right] =P^k(i,j) and we have \frac{1}{n}E[Y_n(j)\vert X_0=i]= \frac{1}{n}\sum_{k=0}^{n-1} P^k(i,j) and thus by Theorem 2.3 or Theorem 2.4 \pi(j) \,=\, \lim_{n\to \infty} \frac{1}{n}E\left[ \sum_{k=1}^{n} {\bf 1}_{\{X_k=j\}} \right] \quad = \textrm{ the proportion of time spent in state } j \textrm{ in the long run } This is an important intuition
\pi(j) = \textrm{expected fraction of time that the Markov chain spends in state } j \textrm{ in the long run}. The next results strengthen that intepretation.
Theorem 2.5 (Ergodic Theorem for Markov chain) Let X_n be an irreducible aperiodic Markov chain with arbitrary initial condition \mu, then, with probability 1 we have \lim_{n \to \infty} \frac{1}{n} \sum_{k=0}^{n-1} {\bf 1}_{\{X_n=j\}} \,=\, \pi(j) \,, this is the ergodic theorem for Markov chain.
Moreover if \tau(j) is the first return time to j we have the Kac’s formula \pi(j) \,=\, \frac{1}{E[\tau(j)|X_0=j]} \,.
Proof. For a Markov chain starting in some state i consider the successive return to the state j. By the strong Markov property, one the Markov chain has reached j it starts a fresh. So the time when the Markov chain returns to state j for the k^{th} time is T_k(j)\,=\, \tau_1(j) + \cdots+ \tau_k(j) \,, where \tau_l(j) are independent copies of the return times \tau(j). For l\ge 2 \tau_l^{(j)} is conditioned on starting at j while for l=1 it depends on the initial condition. Note that by the strong LLN for IID random variables we have \lim_{k\to \infty}\frac{T_k(j)}{k}\,=\, \lim_{k \to \infty} \frac{1}{k} \left( \tau_1(j) + \cdots+ \tau_k(j) \right) \,=\, E[\tau(j)|X_0=j] \,.
Note further that if the total number of visits to state j up to time is k, it means that that we have returned exactyl k times and so \left\{ Y_n(j)=k \right\} =\left\{ T_{k}(j) \le n < T_{k+1}(j) \right\} So we have \frac{ T_{Y_n(j)}(j) } {Y_n(j)} < \frac{n}{Y_n(j)} \le \frac{T_{Y_n(j)+1}(j)}{Y_n(j)+1} \frac{Y_n(j)+1}{ Y_n(j)} As n \to \infty Y_n(j) \to \infty too and taking n \to \infty both extremes of the inequality converge to E[\tau(j) \vert X_0=j] with probability 1 and thus we conclude that, with probability 1, \lim_{n \to \infty} \frac{Y_n(j)}{n} \,=\, \frac{1}{E[\tau(j)|X_0=j]} \,. On the other hand we know that \lim_{n \to \infty} \frac{E[Y_n(j)]}{n} \,=\, \pi(j) \,, and thus \pi(j) \,=\, \frac{1}{E[\tau(j)|X_0=j]} \,. Putting all the pieces together gives the theorem. \quad \blacksquare.
We drop the assumption of irreducibility and develop a number of tool to study the transient behavior of Markov chains.
It is reflexive : i \leftrightsquigarrow i.
It is symmetric: i \leftrightsquigarrow j implies j \leftrightsquigarrow i.
It is transitive: i \leftrightsquigarrow j and j \leftrightsquigarrow l implies i \leftrightsquigarrow l.
There are two kinds of communication classes, transient classes and closed classes:
A class C is called transient if there exists i\in C and j \in S \setminus C with i \rightsquigarrow j.
This means it is possible to exit the class C and never come, since it could come back it would mean j \rightsquigarrow k for k \in C and thus j \in C.
A class C is called closed classes if it is not transient that is, for any pair i\in C and j \in S \setminus C we have i \not\rightsquigarrow j.
Clearly it is impossible to exit a closed class.
The next result state if you start in a finite transient class you will always eventually exit it (this may not be true any more S is infinite).
Lemma 3.1 Suppose S is finite and X_0 \in C where C is a transient class. Then X_n exits C after a finite time with probability 1. As a consequence for i, j \in C we have \lim_{n\to \infty} P^n(i,j) =0 \,.
Proof. By irreducibility X_n can exit C starting from any i\in C after a finite time. Since C is finite, there exists
k and \theta < 1 such that have
P \{X _k \in C | X_0=i \} \le \theta \,\, \textrm{ for all } i \in C \,.
Using the (strong) Markov property this implies that P \{X _{nk} \in C | X_0=i \} \le \theta^n and so so the probability to stay in transient class goes to 0 as time goes by. If i and j both belong to the transient class C this can be re-expressed as
\lim_{n\to \infty} P^n(i, j) \,=\, 0 \,.
\quad \blacksquare.
To figure out the class structure it is convenient to build a directed graph where the vertices are the state and each directed edges corespond to a pair (i,j) with P(i,j)>0.
For example
P= \begin{matrix} 1 \\ 2 \\ 3\\4\\5\\6\\7\\8 \end{matrix} \begin{pmatrix} 0 & .4 & .6 & 0 & 0 & 0 & 0 &0 \\ .3 & .7 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & .5 & 0 & 0 & 0 & .5\\ 0 & 0 & .2 & 0 & .8 & 0 & 0 &0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 &0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 &0 \\ 0 & 0 & 0 & 0 & 0 & .3 & 0 &.7 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 &0 \end{pmatrix}
There are 4 classes: T_1=\{1,2\}, T_2=\{3,4\}, T_3=\{5\} are all transient and C=\{6,7,8\} which is closed.
Markov chain with recurrent classes R_1, \cdots R_L and transient classes denoted by T_1, \cdots T_K (set T= T_1 \cup \cdots T_K). After reordering the states the transition matrix has the block form P\,=\, \begin{matrix} R_1 \\ R_2 \\ R_3 \\ \vdots \\ R_L \\ T \end{matrix} \begin{pmatrix} P_1 & & & & &\\ & P_2 & & 0 & &\\ & & P_3 & & & \\ & 0 & & \ddots & &\\ & & & &P_L &\\ & & S & & & Q \end{pmatrix} \tag{3.1}
where P_l gives the transition probabilities within the class R_l, Q the transition within the transient classes and S\not= 0 the transition from the transient classes into the recurrent classes.
It is easy to see that P^n has the form P^n\,=\, \begin{matrix} R_1 \\ R_2 \\ R_3 \\ \vdots \\ R_L \\ T \end{matrix} \begin{pmatrix} P_1^n & & & & &\\ & P_2^n & & 0 & &\\ & & P_3^n & & & \\ & 0 & & \ddots & &\\ & & & &P_L^n &\\ & & S_n & & & Q^n \end{pmatrix} for some matrix S_n.
Example: Consider the random walk with absorbing boundary conditions (see Example 1.5). There are three classes, 2 closed ones
\{0\}, \{N\} and 1 transient \{1, \cdots, N-1\} for example with N=5 we have
P\,=\, \begin{matrix} 0 \\ 5 \\ 1 \\ 2 \\ 3 \\ 4 \end{matrix} \begin{pmatrix} 1 & 0 & 0 & 0 &0 & 0\\ 0 & 1 & 0 & 0 & 0& 0\\ 1/2 & 0 & 0 & 1/2 & 0 & 0 \\ 0 & 0 & 1/2 & 0 & 1/2 & 0 \\ 0 & 0 & 0 & 1/2 & 0 & 1/2\\ 0 & 1/2 & 0 & 0 & 1/2 & 0 \end{pmatrix} To get a feel of what’s going on we find (rounded to 4 digit precision)
P^{20} = \begin{pmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0.7942 & 0.1953 & 0.004 & 0 & 0.0065 & 0 \\ 0.5924 & 0.3907 & 0 & 0.0104 & 0 & 0.0065 \\ 0.3907 & 0.5924 & 0.0065 & 0 & 0.0104 & 0 \\ 0.1953 & 0.7942 & 0 & 0.0065 & 0 & 0.004 \\ \end{pmatrix}\,, \quad P^{50}=\begin{pmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0.8 & 0.2 & 0 & 0 & 0 & 0 \\ 0.6 & 0.4 & 0 & 0 & 0 & 0 \\ 0.4 & 0.6 & 0 & 0 & 0 & 0 \\ 0.2 & 0.8 & 0 & 0 & 0 & 0 \\ \end{pmatrix}
As expected starting in transient class the Markoveventually exit it to reach here of one the two recurrent class. From P^50 one can read that starting in, say state 2 with probability .6 the Markov chain will end up at 0 and with probability .4 the Markov chain will end up in 5. We will learn how to compute these probability later on but we study first how long it takes to exit the closed class.
Starting in a recurrent class R_l the Markov chain stays in R_l forever and its behavior is dictated entirely by the irreducible matrix P_l with state space R_l.
Starting from a transient class the Markov chain will eventually exit the class T and is absorbed into some recurrent class. One basic question is: What is the expected time until absorption?
We define the absorption time \tau_{\textrm{abs}} = \min\{ n \ge 0 \,;\, X_n \notin T \}
Theorem 3.1 (Expected time until absorption) Let j be a transient state and let \tau_{\textrm{abs}} to be the time until the Markov chain reaches some closed class. Then we have E[\tau_{abs} |X_0=j] \,=\, \sum_{i\in T} M(j,i) \,. where M=(I-Q)^{-1} = 1 + Q + Q^2 + \cdots and Q is from the decomposition Equation 3.1.
Proof. From Lemma 3.1 we know that Q^n(i,j) \to 0 and so all eigenvalues of Q have absolute values strictly less than 1. Therefore I-Q is an invertible matrix and we can define
M = (I-Q)^{-1} = I +Q +Q^2 + Q^3 + \cdots
The second equality is the geometric series for matrices which follows from the identity
(I+ Q + \cdots Q^n)(1-Q) = I -Q^{n+1} \,.
To give a probabilistic interpretation to the matrix M we introduce the random variable
Y(i) \,=\, \sum_{n=0}^\infty 1_{\{X_n=i\}} \quad = \textrm{ total number of visits to state } i
If i is transient by Lemma 3.1 Y(i) < \infty with probability 1. If j is another transient state we have
E[ Y(i) \,\vert\, X_0=j]= \,E\left[ \sum_{n=0}^\infty {\bf 1}_{\{X_n=i\}} \,\vert\, X_0=j\right]
\,=\,\sum_{n=0}^\infty P\left\{ X_n=i \,\vert\, X_0=j\right\}
\,=\,\sum_{n=0}^\infty Q^n(i,j) = M(i,j)
That is M(j,i) is simply the expected number of visits to i if X_0=j and thus, summing over all the transient states we obtain
E[\tau_{abs} |X_0=j] \,=\, \sum_{i\in T} M(j,i) \,. \quad \blacksquare
Suppose X_n is irreducible and we are interested in computing the expected time to reach one state from another state that is E[ \tau(i)| X_0=j] \textrm{ for } j \not= i.
If i=j then from Theorem 2.5 that E[ \tau(i)| X_0=i]=\pi(i)^{-1} where \pi is the stationary distribution.
If i \not= j we use the following idea. Relabel the states such the first state is i and modify P such that i is an absorbing state
P =
\begin{pmatrix}
P(i,i) & R \\ S & Q
\end{pmatrix}
\quad \quad \longrightarrow \quad \quad
\widetilde{P} =
\begin{pmatrix}
1 & 0 \\ S & Q
\end{pmatrix}
\tag{3.2}
The fact that P is irreducible implies that S\setminus \{i\} is a transient class for the modified transition matrix \widetilde{P}. The return time \tau(i) starting from i\not= j for the Markov chain with transition P is exactly the same as the absorption time for the Markov chain with transtion \widetilde{P}. Indeed the hitting time does not depend on the submatrix R. Therefore from Theorem 3.1 we obtain immediately
Theorem 3.2 (Expected hitting time in irreducible Markov chain) Let X_n be an irreducible Markov chain. For i \not= j we E[\tau^(i) |X_0=j] \,=\, \sum_{l\in T} M(j,l) \,. where M=(I-Q)^{-1} and Q is given in Equation 3.2 and is obtained by deleting the i^{th} row and i^{th} column from P.
Example(continued): Random walk with absorbing boundary conditions on \{0,1,\cdots,5\}. We get
Q\,=\, \begin{pmatrix}
0 & 1/2 & 0 & 0 \\
1/2 & 0 & 1/2 & 0 \\
0 & 1/2 & 0 & 1/2\\
0 & 0 & 1/2 & 0
\end{pmatrix} \quad \implies
\quad
M=(I-Q)^{-1}
\,=\,
\begin{pmatrix}
1.6 & 1.2 & 0.8 & 0.4 \\
1.2 & 2.4 & 1.6 & 0.8 \\
0.8 & 1.6 & 2.4 & 1.2\\
0.4 & .8 & 1.2 & 1.6
\end{pmatrix}
and thus the expected times until absorption are 4 for states 1 and 4 and 6 for states 2 and 3.
Example: Random walk with reflecting boundary conditions (see Example 1.5) with N=5 and we compute E[ \tau(0) | X_0=i]. The stationary distribution is \pi = \left(\frac{1}{10}, \frac{2}{10}, \frac{2}{10}, \frac{2}{10}, \frac{2}{10}, \frac{1}{10}\right) so E[ \tau(0) | X_0=0]=10. For i\not=0 we delete the first row and column from P and have Q\,=\, \left( \begin{array}{ccccc} 0 & 1/2 & 0 & 0 &0 \\ 1/2 & 0 & 1/2 & 0 &0 \\ 0 & 1/2 & 0 & 1/2 &0 \\ 0 & 0 & 1/2 & 0 &1/2 \\ 0 & 0 & 0 & 1 &0 \end{array} \right) \,, \quad M=(I-Q)^{-1} \,=\, \left( \begin{array}{ccccc} 2 & 2 & 2 & 2 & 1\\ 2 & 4 & 4 & 4 &2 \\ 2 & 4 & 6 & 6& 3\\ 2 & 4 & 6 & 8 & 4 \\ 2 & 4 & 6 & 8 & 5 \end{array} \right) and so the expected return times to 1 are 10, 9, 16, 21, 24, 25 respectively.
If X_0 = i \in T belongs to some transient class and if there are more than one closed classes, say R_1, R_2, \cdots, R_L then the Markov may be absorbed in distinct closed class and so we wisht to compute the probabilities P\{ X_n \,\,\textrm{ reaches class } R_l \,|\, X_0=i \}
Without loss of generality we can assume that each closed class is an absorbing state r_1, \cdots r_L (we can always collapse a closed class into a absorbing state since it does not matter which state in the recurrent class we first visit). We denote the transient states by t_1, \cdots, t_M and upon reordering the state the transition matrix has the form P\,=\, \begin{matrix} r_1 \\ \vdots \\ r_l \\ t_1 \\ \vdots \\ t_M \end{matrix} \begin{pmatrix} & & & & & \\ & I & & &0& \\ & & & & & \\ & & & & & \\ & S & & &Q & \\ & & & & & \\ \end{pmatrix} \tag{3.3}
We wish to compute A( t_i, r_l) = P\{ X_n \, \textrm{ reaches } r_l \,|\, X_0= t_i\}\,. and it will be convenient to set A(r_l, r_l)=1 and A(r_k, r_k)=0 if k\not= l.
Theorem 3.3 (Absorption probabilities in closed classes) For a Markov chain with transition matrix Equation 3.3 where the states t_1, \cdots t_M are transient we have P\{ X_n \, \textrm{ reaches } r_l \,|\, X_0= t_i\} = A(t_i, r_l) \quad \textrm{ with } \quad A = (I -Q)^{-1} S
Proof. We condition on the first step of the Markov chain to obtain \begin{aligned} A( t_i, r_l) &= P \left\{ X_n = r_l {\rm ~eventually~} \vert X_0= t_i\right\} \\ &= \sum_{k \in S} P \left\{ X_n = r_l \textrm{ eventually}, X_1=k \vert X_0= t_i\right\} \\ &= \sum_{ k \in S} P \left\{ X_1 = k \vert X_0= t_i\right\} P \left\{ X_n = r_j {\rm ~eventually~} \vert X_1= k\right\} \\ &= \sum_{ k \in S} P(t_i, k) A( k, r_j) \,=\, P(t_i, r_l) + \sum_{ t_k } P(t_i, t_k) A( t_k, r_j) \,. \end{aligned} If A be the L \times M matrix with entries A(t_i, r_l), then this can be written in matrix form as A \,=\, S + Q A or A \,=\, (I-Q)^{-1} S \,=\, MS \,.
Continuing with the Random walk with absorbing boundary conditions, we get A \,=\, MS \,=\, \left( \begin{array}{cccc} 1.6 & 1.2 & 0.8 & 0.4 \\ 1.2 & 2.4 & 1.6 & 0.8 \\ 0.8 & 1.6 & 2.4 & 1.2\\ 0.4 & .8 & 1.2 & 1.6 \end{array} \right) \left( \begin{array}{cc} 1/2 & 0 \\ 0 & 0 \\ 0 & 0 \\ 0 & 1/2 \end{array}\right) \,=\, \left( \begin{array}{cc} .8 & .2 \\ .6 & .4 \\ .4 & .6 \\ .2 & .8 \end{array}\right) For example from state 2 the probability to be absorbed in 0 is .6, and so on….
The Markov chain “moves up” until it reaches one of the 5 states on the top (which we relabel 0,1,2,3,4) and the chains stays on those 5 states forever and performs a random walks with absorbing boundary conditions.
Want the probability that A wins, of course starting from the initial conditions 0-0. Compute this probability by successive conditioning.
Consider the events D_i=\{ X_n \textrm{ reaches the top row in state } i \}
We compute P(D_i) simply by enumerating all the paths leading up to state i starting form the initial state.
P(D_0)=p^4 + 4 qp^4\,, \quad P(D_1)=4p^3q^2\,, \quad P(D_2)= 6p^2q^2\,, \quad P(D_3)= 4p^2q^3\,, \quad P(D_4)=q^4 + 4 pq^4 \quad
Conditioning on the B_i’s: \displaystyle P(A) =\sum_{i=0}^4 P(A|D_i) P(D_i)
Obviously we have P(A|D_0) =1 and P(A|D_4) =0. By conditioning on the first step we find the system of equations \begin{aligned} P(A|D_1) &= p + q P(A|B_2) & & P(A|D_1) = \frac{p (1-pq)}{1-2pq} \\ P(A |D_2) &= pP(A |D_1) + q P(A |D_3) & \implies \quad\quad & P(A|D_2) = \frac{p^2}{1-2pq}\\ P(A |D_3) &= p P(A|D_2) & & P(A|D_3) = \frac{p^3}{1-2pq} \end{aligned}
Putting everyting together
P(A) = \frac{p^4 (1+2q)(1+4q^2)}{1-2pq}
A gambler is coming to the casino to play the game of dice called craps. At this game the standard bet (pass line bet) has even money odds and the probability to win is \frac{244}{495}=0.49292929..., one of the casino games with best odds.
Suppose that the player start with a fortune of j and bets 1 in every game. Their ultimate goal is to reach a fortune of N and, if they do, they would stop. Of course they could lose all their money before reaching they goal.
The game is described by a random walk on \{0,1, \cdots, N\} with absorbing boundary conditions and we want to compute \alpha_j = P(X_n \textrm{ reaches } N \textrm{ before reaching } 0| X_0=j)
Conditioning on the first step gives the second order homogeneous linear difference equation \alpha_j = p \alpha_{j+1} + (1-p) \alpha_{j-1} \quad \textrm{ with boundary conditions } \alpha_0=0\,, \alpha_N=1
To solve this second order homogeneous linear difference equations we look for solutions of the form \alpha_j=x^j which leads to the equation px^2 - x + (1-p) =0 \quad \implies x_1=1\,, x_2= \frac{1-p}{p} \tag{3.4} This leads to the system of equation (for p\not= \frac12) \alpha_j= C_1 + C_2 \left(\frac{1-p}{p} \right)^j \quad \textrm{ with } \quad \left\{ \begin{array}{r} C_1+C_2=0 \\ C_1 + C_2\left(\frac{1-p}{p} \right)^N=1 \end{array} \right.
Solving gives the Gambler’s ruin probabilities
\alpha_j = \frac{ 1 - \left( \frac{1-p}{p} \right)^j }{1 -\left( \frac{1-p}{p} \right)^N} \quad \quad \textrm{ Gambler's ruin for } q\not= p
If p=\frac{1}{2} the quadratic equation Equation 3.4 has a double root 1 but we note that \alpha_j= j is a second linearly independent solution of \frac{1}{2}\alpha_{j+1} - \alpha_j + \frac{1}{2}\alpha_{j-1}=0. Solving the equation \alpha_j= C_1 + C_2j with the boundary conditions gives \alpha(j) \,=\, \frac{j}{N} \quad \quad \textrm{ Gambler's ruin for } p=\frac{1}{2}
How bad does it get?:For example if p=\frac{244}{495} and you start with a fortune a 50 and want to double it, the probability to succeed is \frac{ 1 - \left( \frac{251}{244}\right)^{50}}{1 -\left(\frac{251}{244}\right)^{100}}=.1955, not so great. You could also be more risky and bet an amount of 10 in which case you probability to succeed i a much better \frac{ 1 - \left( \frac{251}{244}\right)^{5}}{1 -\left(\frac{251}{244}\right)^{10}}=.4647. Even better bet everything and your probability to win is p=.4929. (In casino boldness pays, or loses less).
Limiting cases To get a better handle on the gambler’s ruin formula we slightly rephrase the problem:
We start at 0.
We stop whenever we reach W (W is the desired gain) 0r when we reach -L (L is the acceptable losss).
The state are now j \in \left\{ -L, -L+1, \cdots, \cdots, W-1, W\right\}
We get P(-L, W)\,\equiv\, P\left( {\rm Reach~}W{\rm~before~reaching~} - L {\rm ~starting~from~}0 \right) \,=\, \frac{1 - \left( \frac{1-p}{p} \right)^L}{1 - \left( \frac{1-p}{p} \right)^{L+W}} \,.
The limit L \to \infty describes a player with infinite wealth:
P(-\infty,L)= \left\{\begin{array}{cl}1 & {\rm~if~} p> \frac12 \\ \left(\frac{p}{1-p}\right)^W & {\rm~if~} p< \frac12 \end{array} \right.
Even with infinite wealth it is exponentially hard to win W!
The limit W \to \infty describes a player with fortune L who does not stops unless they lose. P(-L,\infty) \,=\, \left\{\begin{array}{cl} 1 - \left(\frac{1-p}{p}\right)^L & {\rm~if~} p > \frac{1}{2} \\ 0 & {\rm~if~} p < \frac{1}{2} \end{array} \right. The probability to play forever is, unsurprisingly, 0.
We introduce some basic examples of Markov chains on a countable state space
Example 4.1 (Random walk on the non-negative integers) Let us consider a random walk on the set of nonnegative integers with partially reflecting boundary conditions at 0. The transition probabilities are given by \begin{matrix} 0 \\ 1 \\ 2 \\ \vdots \end{matrix} \begin{pmatrix} q & p & 0 &0 & 0 &\ldots \\ q & 0 & p &0 & 0 &\ldots \\ 0 & q & 0 & p & 0& \ldots \\ \vdots & \vdots & \ddots & \ddots & \ddots& \\ \end{pmatrix} \quad \quad \textrm{ with } q=1-p
Example 4.2 (Discrete-time queueing model) At a service station (think of a cash register), during each time period there is a probability p that an additional customer comes in the queue. The first person in the queue is being served and during each time period there is a probability q that this person exits the queue.
We denote by X_n the number of people in the queue (either in being served or waiting in line). The state space is S = \{ 0,1,2,3, \cdots \} and the transition probabilities are given by P\,=\, \begin{matrix} 0\\1\\2\\ \vdots \end{matrix} \begin{pmatrix} 1-p & p & 0 &0 & \ldots \\ q(1-p) & qp + (1-p)(1-q) & p(1-q) &0 & \ldots \\ 0 & q(1-p) & qp + (1-p)(1-q) & p(1-q) & \\ \vdots& \vdots & \ddots & \ddots & \ddots & \\ \end{pmatrix}
Example 4.3 (Repair shop) A repair shop is able to repair one item on any single day. On day n Z_n items break down and are brought for repair to the repair shop and we assume that Z_n are IID random variables with pdf P\{ Z_n = k\} = a_k for k=0,1,2, \cdots. If X_n denotes the number of item in the shop waiting to be repaired we have X_{n+1} = \max \{ (X_n -1), 0 \} + Z_n The state space is S = \{ 0,1,2,3, \cdots \} and the transition probabilities are P\,=\, \begin{matrix} 0\\1\\2 \\ \vdots \end{matrix} \begin{pmatrix} a_0 & a_1 & a_2 & a_3 & \cdots \\ a_0 & a_1 & a_2 & a_3 & \cdots \\ 0 & a_0 & a_1 & a_2 & \cdots \\ \vdots & \vdots & \vdots & \vdots & \end{pmatrix} \quad \quad \textrm{ with } \sum_{k=0}^\infty a_k =1
Example 4.4 (Success run chain) Imagine a player taking a series of bets labelled 0,1,2, \cdots. The probability to win bet j is p_j. If the player wins bet j they move up to bet j+1 but if they lose they move back to bet 0. If X_n denotes the number of successive winning bets then X_n has state space S = \{ 0,1,2,3, \cdots \} and transition probabilities P\,=\, \begin{matrix} 0\\1\\2 \\ \vdots \end{matrix} \begin{pmatrix} q_0 & p_0 & 0 &0 & \ldots \\ q_1 & 0 & p_1 &0 & \ldots \\ q_2 & 0 & 0 & p_2 & \\ \vdots & \vdots & & \ddots & \ddots \\ \end{pmatrix} \quad \quad \textrm{ with } q_i=1-p_i This Markov chain has allows for nice analytical computations.
Example 4.5 (Simple d-dimensional random walk) The state space S of the Markov chain is the d-dimensional lattice \mathbb{Z}^d. We denote by {\bf e}_i, i=1, \cdots, d the standard orthonormal basis in \mathbb{R}^d. We view \mathbb{Z}^d as the vertex set of a graph and any point {\bf x=(x_1, \cdots, x_d)} is connected by edges to 2d neighbors {\bf x} \pm {\bf e}_i. For the simple random walk we have p({\bf x}, {\bf x} \pm {\bf e}_i) \,=\, \frac{1}{2d} and all the others p({\bf x},{\bf y})=0.
Example 4.6 (Branching process) The branching process, also known as the {} model the evolution over time of populations.
In a unit of time every individual in a population dies and leave behind a random number of descendents.
To describe the Markov chain we will use IID random variables Z_n^{(k)} indexed by $n=0,1,2$ and k=0,1,2,\cdots.
The branching process is given by
X_{n+1} \,=\, \sum_{k=1}^{X_n} Z_n^{(k)}
which simply says that the each of X_n individuals in the population at time n has a random number $ Z_n^{(k)}$ of descendents. It is not convenient to write down the transition probabilities but we will study this process later using its moment generating function.
We introduce the corresponding definitions of transience and recurrence of a state. Recall that the return time to state i is given by
\tau(i) = \min \{ n \ge 1 \,;\, X_n =i \}\,.
A state i is recurrent if if the Markov chain starting in i will eventually return to i with probability 1, i.e. if P \left\{ \tau(i) < \infty |X_0=i \right\} = 1 .
A state i is transient if it is not recurrent, that is starting in i the Markov chain return to i with probability q<1, i.e., if P \left\{ \tau(i)< \infty |X_0=i \right\} = q < 1 .
Recall also the random variable Y(i) which counts the number of visits to state j: Y(i) = \sum_{k=0}^\infty I_{\{X_k=i\}} \quad \textrm{ with expectation } E[Y(i)|X_0=j] = \sum_{n=0}^\infty P^n(j,i)
Theorem 4.1 (Transience/recurrence dichotomy)
A state i is recurrent \displaystyle \iff P\{ Y(i)=\infty | X_0=i \}=1 \iff \sum_{n=0}^\infty P^{n}(i,i) = \infty.
Moreover if i is recurrent and i \leftrightsquigarrow j then j is recurrent and we have
\sum_{n=0}^\infty P^n(i,j) = \infty
and
P \left\{ \tau(j) < \infty |X_0= i \right\} = 1
The state i is transient \displaystyle \iff P\{ Y(i)<\infty | X_0=i \} =1 \iff \sum_{n=0}^\infty P^{n}(i,i) < \infty Moreover if i is transient and i \leftrightsquigarrow j then j is transient and we have \sum_{n=0}^\infty P^n(i,j) < \infty and P \left\{ \tau(j) < \infty |X_0= i \right\} < 1\,.
Proof. If i is recurrent the Markov chain starting from i will return to i with probability 1, and then by the Markov property will return a second time with probability 1 and therefore infinitely many time with probability 1. This means that Y(i)=\infty almost surely and so \sum_{k} P^{k}(i,i) = + \infty.
If i \leftrightsquigarrow j then then we can find time l and m such that P^l(i,j)>0 and P^m(j,i)>0 and so \sum_{n=0}^\infty P^{n}(j,j) \ge \sum_{n=0}^\infty P^{n+l+m}(j,j) \ge P^m(j,i) \sum_{n=0}^\infty P^n(i,i) P^l(i,j) = \infty\,,
and j is recurrent. A similar argument shows that \sum_{n=0}^\infty P^n(i,j) = \infty.
It is a consequence of irreducibility that P \left\{\tau(i) < \tau(j) |X_0= j \right\} > 0. (Argue by contraduction, if this probability were 0 by the Markov property, the chain would never visits i starting from j). As a consequence 0\,=\, P \left\{ \tau(j) = \infty |X_0= j \right\} \,\ge\, P \left\{ \tau(i)< \tau(j) |X_0= j \right\} P \left\{ \tau(j) = \infty |X_0= i \right\} and therefore P \left\{ \tau(j) < \infty |X_0= i \right\} =1.
On the other hand, if i is transient, by the Markov property again, the random variable Y(i) is a geometric random variable with success probability q<1 which implies that E[Y(i)]=\sum_{k} P^{k}(i,i) = \frac{1}{q} <\infty. This implies all the other equivalence stated. \quad \blacksquare
We analyze the recurence and transience propeties for the simple random walk on \mathbb{Z}^d (see Example 4.5).
As we will see this depends on the dimension. To prove recurrence transience here we compute/estimate directly
\sum_{n=0}^\infty P^n(0,0) = \sum_{n=0}^\infty P^{2n}(0,0)
since X_n is periodic with period 2.
d=2: In dimension 2 to return to 0 in 2n steps the Markov chain must take exactly k steps to the left and k steps to the right and n-k steps up and n-k steps down. Therefore P^{2n}(0,0) = \sum_{k=0}^n \frac{ 2n!}{ k! k! (n-k)! (n-k)!} \frac{1}{4^{2n}} = \frac{1}{4^{2n}} {2n \choose n} \sum_{k=0}^n {n \choose k} {n \choose n-k} Now we claim that \sum_{k=0}^n {n \choose k} {n \choose n-k}= {2n \choose n} as can be seen by the counting the number of ways that a team of n can be formed out of n boys and n girls. Therefore P^{2n}(0,0) = {2n \choose n}^2 \frac{1}{4^{2n}} \sim \frac{1}{\pi n} and thus the simple random walk is recurrent for d=2 since \sum\frac{1}{n} diverges.
d=3: Similarly as in 2 dimension 2 we find
P^{2n}(0,0) = \sum_{k,j : k+j \le n} \frac{ 2n!}{ j! j! k! k! (n-k-j)! (n-k-j)!} \frac{1}{6^{2n}}
= \frac{1}{2^{2n}} {2n \choose n} \sum_{k,j : k+j \le n} \left( \frac{1}{3^n} \frac{n!}{j! k! (n-j-k)!} \right)^2
To analyze this quantity we note that, by the multinomial theorem,
\sum_{k,j : k+j \le n} \frac{1}{3^n} \frac{n!}{j! k! (n-j-k)!} =1
Moreover we have \sum_{i} q_i =1 \implies \sum_{i}q_i^2 \le max_{i} q_i and thus we only need to the maximum of \displaystyle \frac{n!}{j! k! (n-j-k)!}. If k_0,j_0 is the maximum then we must have for example \frac{n!}{(j_0-1)! k_0! (n-j_0-k_0+1)!} \le \frac{n!}{(j_0)! k_0! (n-j_0-k_0)!} \implies 2j_0 \le n -k_0 +1 \,. Repeating the same computation with j_0\to j_0+1, k_0\to k_0-1, k_0 \to k_0+1 gives the set of inequalities n- j_0 -1 \le 2 k_0 \le n- j_0 +1 \, \quad \textrm{ and } \quad n- k_0 -1 \le 2 j_0 \le n- k_0 +1 which implies that \frac n3 -1 \le j_0, k_0 \le \frac n3 +1 i.e. j_0 and k_0 are of order n/3. Using Stirling’s formula P^{2n}(0,0) \le \frac{1}{2^{2n}} {2n \choose n} \frac{1}{3^n} \frac{n!}{ (n/3)! (n/3)! (n/3)! } \sim \frac{3 \sqrt{3}}{2} \frac{1}{ (\pi n)^{3/2}} which shows that the random walk is transient in dimension 3.
Continuing with Example 4.4 we consider the return time to state 0, \tau(0) whose pdf we cab compute explicitly its p.d.f since to return to 0 the only possible paths are 0 \rightarrow 0, 0 \rightarrow 1 \rightarrow 0, 0 \rightarrow 1 \rightarrow 2 \rightarrow 0, and so on. We set u_n \equiv p_0p_1 \cdots p_{n-1} and find P( \tau(0) = k|X_0=0) = p_0 p_1 p_{k-2}q_{k-1}= p_0 p_1 p_{k-2}(1-p_{k-1}) = u_{k-1} - u_k\,. Therefore P( \tau(0) \le n |X_0=0) = \sum_{k=1}^n P( \tau(0) = k|X_0=0) = (1- u_0) + (u_0-u_1) + \cdots + (u_{n-1} - u_n) = 1 - u_n and P( \tau(0) < \infty|X_0=0) = 1 - \lim_{n\to \infty} u_n and we obtain that \textrm{The success run chain is recurrent if and only if } \lim_{n \to \infty} u_n = \lim_{n \to \infty} p_0 \cdots p_{n-1} =0 To have a better handle on this criterion we need a little result from analysis about infinite products.
Lemma 4.1 With q_k=1-p_k and u_n= \prod_{k=0}^{n-1} u_k
\lim_{n \to \infty} u_n= \lim_{n\to \infty} \prod_{k=0}^{n-1} p_k = 0 \iff \sum_{k=0}^\infty {q_k} = \infty \lim_{n \to \infty} u_n= \lim_{n\to \infty} \prod_{k=0}^{n-1} p_k > 0 \iff \sum_{k=0}^\infty {q_k} < \infty
Proof. We have
\displaystyle \prod_{k=0}^{n-1} p_k > 0 \iff \infty > - \log (\prod_{k=0}^{n-1} p_k ) =- \sum_{k=1}^n \log p_k = - \sum_{k=1}^n \log (1-q_k) Taking n \to \infty shows that
\lim_{n\to \infty} \prod_{k=0}^{n-1} p_k > 0 \iff \sum_{k=1}^\infty \log (1-q_k) \textrm{ converges.}
For this to happen we must have \lim_{k \to \infty} q_k =0. But since \lim_{x \to 0} \log(1-x)/x =1 by L’Hospital rule, we have that \sum_{k=1}^n \log (1-q_k) converges if and only if \sum_{k=1}^n q_k converges. \quad \blacksquare
For the chain to be transient q_k must go to 0 fast enough. If say q_i= \frac{1}{i} then the chain is recurrent while if q_i= \frac{1}{i^2} then it is transient.
We add one more method to establish transience. For this pick reference state j and consider the hitting time to state j, \sigma(j) (not the return time but we will play with both.) \alpha(i)\,=\, P \left\{ \sigma(j)< \infty | X_0= i \right\} By definition we have \alpha(j)=1 since \sigma(j)=0 if X_{0} = j. If the chain is transient we must have \alpha(i) < 1 \quad \textrm { for } i \not= j \,. since by Theorem 4.1 for i \not=j_0 we have \alpha(i)= P \left\{ \tau(j) < \infty |X_0= i \right\} < 1.
Let us derive an equation for \alpha(i) by conditioning of the first step. For i \not=j \begin{aligned} \alpha(i) &= P \left( \sigma(j) < \infty | X_0= i \right) = P \left( \tau(j) < \infty | X_0= i \right) \\ &= \sum_{k \in S} P \left( \tau(j) < \infty | X_1=k \right) P(i,k) = \sum_{k \in S} P \left( \sigma(j)< \infty | X_0=k \right) P(i,k) \\ &= \sum_{k \in S} P(i,k) \alpha(k) \end{aligned} and thus \alpha(i) satisfies the equation P\alpha(i) = \alpha(i) \,, \quad i \not= j \,.
Theorem 4.2 (A criterion for transience) An irreducible Markov chain X_n is transient if and only if for some state j_0 there exists a solution for the equation P\alpha(i) = \alpha(i) \textrm { for } i \not= j_0 \tag{4.1} such that \alpha(j_0)=1 \quad \textrm { and } \quad 0 < \alpha(i) < 1 \quad \textrm { for } i \not= j_0 \tag{4.2}
Proof. We have already established the necessity. In order to show the sufficiency assume that we have found a solution for Equation 4.1 and Equation 4.2. Then for i \not= j_0 we have, using repeatedly the equation P\alpha(i)=\alpha(i) \begin{aligned} 1 &> \alpha(i) = P\alpha(i) = P(i, j_0) \alpha(j_0) + \sum_{j \not= j_0} P(i, j) \alpha(j) = P(i, j_0) + \sum_{j \not= j_0} P(i, j) P\alpha(j) \\ &= P(i, j_0) + \sum_{j \not= j_0} P(i, j) P(j,j_0) + \sum_{j \not= j_0, k \not= j_0} P(i, j) P(j,k) \alpha(k) \\ &= P(i, j_0) + \sum_{j \not= j_0} P(i, j) P(j,j_0) + \sum_{j,k \not= j_0} P(i, j) P(j,k) P(k,j_0) + \cdots \\ &= P(\tau(j_0)=1|X_0=i) + P(\tau(j_0)=2|X_0=i) + P(\tau(j_0)=3|X_0=i) + \cdots \\ &= P(\tau(j_0)< \infty|X_0=i)\,. \end{aligned} which establishes transience. \quad \blacksquare
A finite state irreducible Markov chain is always recurrent, P(\tau(i) < \infty |X_0=j)=1 and we have proved Kac’s formula for the invariant measure \pi(i) = E[ \tau(i)|X_0=i]^{-1}, that is the random variable \tau(i) has finite expectation.
For a countable state space it is possible for a Markov chain to be recurrent but that \tau(i) does not have finite expectation. This motivates the following definitions.
A state i is positive recurrent if E[ \tau(i)|X_0=i] < \infty
A state i is null recurrent if it is recurrent but not positive recurrent
We first investigate the relation between recurrence and existence of invariant measures. We first show that if one state j is positive recurrent then there exists a stationary distribution. The basic idea is to decompose any path of the Markov chain into successive visits to the state j. To build up our intuition if a stationary distribution were to exists it should measure the amount of time spent in state i and to measure this we introduce \mu(i) = E\left[ \sum_{n=0}^{\tau(j)-1} {\bf 1}_{\{X_n=i\}}| X_0=j \right] \quad = \textrm{number of visits to $i$ between two successive visits to $j$}.
Note that \mu(j)=1 since X_0=j, and if j is positive recurrent \sum_{i} \mu(i) = \sum_{i\in S} E\left[ \sum_{n=0}^{\tau(j)-1} {\bf 1}_{\{X_n=i\}}| X_0=j \right] \,=\, E\left[ \tau(j) \,|\, X_0=j \right]
and thus \pi(i)= \frac{\mu(i)}{E[ \tau(i)|X_0=i]} is a probability distribution.
Theorem 5.1 For a recurrent irreducible Markov chain X_n and a fixed state j, \mu(i) = E[ \sum_{n=0}^{\tau(j)-1} {\bf 1}_{\{X_n=i\}}| X_0=j ] is stationary in the sense that
\mu P = \mu
If the state j is positive recurrent then \mu can be normalized to a stationary distribution \pi.
Proof. The chain visits j at time 0 and then only again at time \tau(j) and thus we have the two formulas \begin{aligned} \mu(i) &= E\left[ \sum_{n=0}^{\tau(j)-1} {\bf 1}_{\{X_n=i\}}| X_0=j \right] = \sum_{n=0}^\infty P( X_{n}=i, \tau(j)> n | X_0=j)\\ &= E\left[ \sum_{n=1}^{\tau(j)} {\bf 1}_{\{X_n=i\}}| X_0=j \right] = \sum_{n=1}^{\infty} P( X_n=i, \tau(j) \ge n | X_0=j ) \end{aligned}
We have then, by conditioning on the last step, and using \mu(j)=1 \begin{aligned} \mu(i) &= \sum_{n=1}^{\infty} P( X_n=i, \tau(j)\ge n | X_0=j ) = P(j,i) + \sum_{n=2}^\infty P( X_n =i, \tau(j)\ge n | X_0=j) \\ &= P(j,i) +\sum_{k \in S, k\not=j} \sum_{n=2}^\infty P( X_n =i, X_{n-1}=k, \tau(j)\ge n | X_0=j) \\ &= P(j,i) +\sum_{k \in S, k\not=j} \sum_{n=2}^\infty P(k,i) P( X_{n-1}=k, \tau(j)\ge n | X_0=j) \\ &= P(j,i) +\sum_{k \in S, k\not=j} \sum_{n=2}^\infty P(k,i) P( X_{n-1}=k, \tau(j)> n-1 | X_0=j) \\ &= P(j,i) +\sum_{k \in S, k\not=j} \sum_{m=1}^\infty P(k,i) P( X_{m}=k, \tau(j)> m | X_0=j) \\ &= \mu(j) P(j,i) + \sum_{k\not=j} E\left[ \sum_{m=1}^{\tau(j)-1} {\bf 1}_{\{X_m=k\}}| X_0=j \right] P(k,j) \\ &= \mu(j) P(j,i) + \sum_{k\not=j} \mu(k) P(k,j) \,=\, \sum_{k} \mu(k) P(k,j) \end{aligned} which proves the invariance of \mu. If the chain is positive recurrent we have laready seen that \mu is normalizable. \quad \blacksquare.
Theorem 5.2 Assume the irreducible Markov chain has a stationary distribution \pi(i) then \pi(i)>0 for any i and we we have Kac’s formula \pi(i) = E \left[ \tau(i) | X_0=i \right]^{-1}. In particular all states are positive recurrent and the stationary distribution is unique.
Proof. Let us assume that \pi is invariant. We first show that the chain must be recurrent. If the chain were transient then we would have P^n(i,j) \to 0 as n \to \infty and so by dominated convergence \pi(i) = \sum_{j} \pi(j) P^n(j,i) \to 0 \textrm{ as } n \to \infty \,. which is impossible.
The fact that \pi(i)>0 for all i\in S is proved as for finite state space see the argument in Theorem 2.2.
To prove positive recurrence we use a clever argument involving the time reversed chain (more on time reversal in Chapter 8 and in the exercises). Consider the Markov chain with transition matrix Q(i,j)=\frac{\pi(j) P(j,i)} {\pi(i)}. It is easy to verify that Q(i,j) is a transition matrix and that \pi is stationary for Q, \pi Q=\pi. We denote by Y_n the Markov chain with transition matrix Q, since \pi(i)>0 this Markov chain has the same communication structure as X_n and is irreducible since X_n is. By the previous argument Y_n must be recurrent.
Next we write \pi(i) E[ \tau(i) |X_0=i] \,=\, \pi(i) \sum_{n=1}^\infty P (\tau(i) \ge n |X_0=i) The event \{\tau(i) \ge n \} conditioned on \{ X_0=i\} correspond to a sequence of states i_0=i, i_1, \cdots \cdots, i_{n-1}, i_n=j where i_1, \cdots, i_{n-1} cannot be equal to i and j can be any state. Using repeatedly the relation \pi(i)P(i,j)= \pi(j)Q(j,i) the probability of such event can be written as \begin{aligned} \pi(i) P(i,i_1) \cdots P(i_{n-1}, i_n) &= \pi(j) Q(j,i_{n-1}) \cdots Q(i_1, i) \end{aligned} For the Markov chain Y_n this correspond to a path starting in j and returning to i after exactly n steps. Therefore we find
\begin{aligned} \pi(i) E[ \tau(i) |X_0=i] = \pi(i) \sum_{n =1}^\infty P (\tau(i) \ge n |X_0=i) &= \sum_{j \in S} \pi(j)\sum_{n=1}^\infty P ( \tau(i) =n | Y_0 = j) \\ &= \sum_{j \in S} \pi(j) P( \tau(i) < \infty|Y_0 = j) =1 \,. \end{aligned} where for the last equality we have used the recurrence of the time reversed chain. This shows Kac’s formula which implies the uniqueness of the stationary distribution and that all states are positive recurrent. \quad \blacksquare
Theorem 5.3 (Ergodic theorem for countable state space Markov chains)
Proof. We have actually already proved all of it. Positive recurrence implies the existence of the stationary distribution (Theorem 5.1) and Kac’s fromula is from Theorem 5.2 which implies uniqueness of the stationary distribution. We can now repeat the proof of Theorem 2.5 to show that if X_0=i with i arbitrary we have \lim_{n \to \infty} \frac{1}{n} \sum_{k=0}^{n-1} {\bf 1}_{\{X_k=j\}} = \pi(j) \,. \tag{5.1} The reader should verify that the proof of Theorem 2.5 only use positive recurrence and not the finiteness of the state space. Taking now expectation of Equation 5.1 and summing over initial condition we have \lim_{n \to \infty} \frac{1}{n} \sum_{k=1}^{n} \mu P^k(j) \,=\, \pi(j). \quad \blacksquare
We continue our theoretical consideration by proving that if the chain is aperiodic then the distribution of X_n converges to \pi(j).
Theorem 5.4 Suppose X_n is an irreducible positive recurrent aperiodic Markov chain. Then for any initial distribution mu we have \lim_{n \to \infty} \mu P^n(j) \,=\, \pi(j) \,.
Proof. We will use a coupling argument: we take two independent copies X_n and Y_n of the Markov chain where X_n is starting in the initial distribution \mu while Y_n is starting in the stationary distribution \pi.
The idea is to consider the coupling time \sigma = \inf\{ n \ge 1 ; X_n = Y_n\} \,. At the (random) time \sigma, X_n and Y_n are in the same state and after that time X_n and Y_n must have the same distribution by the (strong) Markov property. But since Y_n is distributed according to \pi so must X_n be as well and thus, at the coupling time \sigma, X_n has reached stationarity.
Let us now consider the chain Z_n=(X_n,Y_n) with transition probabilities P( Z_{n+1} = (k,l) \,|\, Z_n=(i,j) ) = P(i,k) P(j,l) and stationary distribution \pi( i,j) = \pi(i) \pi(j). Since X_n and Y_n are aperiodic, given states i,j,k,l we can find n_0 such that for every n \ge n_0 we have P^n(i,k) > 0 and P^n(j,l)>0. This implies that Z_n is irreducible and thus, since a stationary measure exists, by Theorem 5.3 the chain Z_n is positive recurrent. Since the coupling time is the first time the Markov chain Z_n hits a state of the form (j,j), recurrence of Z_n implies that P( \sigma < \infty)=1 and thus P(\sigma >n ) \to 0.
To conclude \begin{aligned} \left| \mu P^n(j) - \pi(j)\right| & = \left| P( X_n=j) - P(Y_n =j) \right| \\ & \le \left| P( X_n=j , \sigma \le n ) - P(Y_n =j , \sigma\le n) \right| + \left| P( X_n=j , \sigma > n ) - P(Y_n =j, \sigma > n) \right| \\ & = \left| P( X_n=j , \sigma > n ) - P(Y_n =j, \sigma > n) \right| \\ & = \left| E[ ({\bf 1}_{\{X_n=j\}} - {\bf 1}_{\{Y_n=j}\}) {\bf 1}_{\{\sigma > n\}} ] \right| \\ & \le E[ {\bf 1}_{\{\sigma > n\}}] = P( \sigma > n) \rightarrow 0 \textrm{ as } n \to \infty \end{aligned} and this prove the convergence. \quad \blacksquare.
Elementary properties of \phi_X(s):
\phi_X(s) is an increasing function of s with \phi_X(0)=P\{X=0\} and \phi_X(1)=1.
Differentiating gives
\phi_X'(s) = \sum_{k=1}^\infty k s^{k-1} P\{X=k\}\,, \quad \quad \phi_X''(s) =\sum_{k=1}^\infty k(k-1) s^{k-2} P\{X=k\}
and thus \phi_X'(1)=E[X].
If P\{X \ge 2\} then \phi_X''(s)>0 and \phi_X is strictly convex.
If X_1, X_2, \cdots X_m are independent random variables then \phi_{X_1 + \cdots + X_m}(s) = \phi_{X_1}(s) \cdots \phi_{X_1}(s)
To avoid trival cases we assume that p(0)>0 (the population can go extinct) and p(0)+p(1)< 1 (the population can grow). We define the extinction probability \begin{aligned} a_n(k) &= P\{ X_n=0 | X_0=k\} \\ a(k) & = \lim_{n \to \infty} P\{ X_n=0 | X_0=k\} = P \left\{ \textrm{population goes extinct} | X_0=k \right\} \end{aligned}
Since for the population to go extinct all branches must go extinct, by independence, we have a(k) = a(1)^k \quad \textrm{ so we set } a \equiv a(1) and we assume from now on that X_0=1. Note this formula is also correct for k=0 since a(0)=1.
Using these computations we are ready to derive the main result
Theorem 6.1 Let Z be random variable describing the distribution of descendants of a single individual in a branching process and we assume p(0)>0 and p(0)+p(1)<1. Then if X_0=1 the extinction probability is the smallest root of the equation \phi_Z(a)=a.
If \mu \le 1 then a=1 and the population eventually dies out.
If \mu>1 then the extinction probability a<1 is the unique root of the equation \phi_Z(s)=s with 0 < s < 1.
Proof. We have aready established the extinction probability a is a root of \phi_Z(s)=s. But we also know that 1 is a root since \phi_Z(1)=1 and the slope of \phi_Z is equal to \mu at s=1. Since p(0)+p(1)<1 then \phi_Z(s) is strictly convex and thus \phi_Z(s) has at most two roots. We have the followoing cases (illustrated in Figure 6.1 on the next slide).
If \mu< 1 the equation \phi_Z(s)=s has two roots 1 and s>1 and the extinction probability is 1.
If \mu= 1 the line s is a tangent to the curve \phi_Z(s) at s=1 and so \phi_Z(s)=s has one roots 1 and the extinction probability is 1.
Finally if \mu> 1 the equation \phi_Z(s)=s has two roots 1 and a<1. Since \phi_Z(0)=p(0)>0 the second root satisfies a>0. To show that the smallest root is the extinction probability note that we have
a_n(1)=P\left\{X_n=0|X_0=1\right\}=\phi_Z^n(0)
By induction we show that a_n(1)\le a. True for n=0 since a_0(1)=0. Assuming that a_{n-1}(1)\le a we have
a_n(1)= P \left\{ X_n=0|X_0=1 \right\}= \phi_Z^n(0) = \phi_Z( \phi_Z^{n-1}(0)) =\phi_Z(a_{n-1}(1)) \le \phi_Z(a) =a
where we used that \phi_Z is increasing.This shows that the smallest root is the extinction probability \quad \blacksquare.
If p(0)=\frac14, p(1)=\frac14, p(2)=\frac12 \implies \phi(s)=\frac14 + \frac14 s + \frac12 s^2 then \mu=\frac{5}{4}>1 and solving \phi(a)=a gives a=1,\frac12 so the extinction probability is 1/2.
If p(0)=\frac14, p(1)=\frac12, p(2)=\frac11 \implies \phi(s)=\frac14 + \frac12 s + \frac14 s^2 then \mu=1 and solving \phi(a)=a gives a=1 so the extinction probability is 1.
If p(0)=\frac12, p(1)=\frac14, p(2)=\frac14 \implies \phi(s)=\frac12 + \frac14 s + \frac14 s^2 then \mu=\frac{3}{4}<1 and solving \phi(a)=a gives a=1,2 so the extinction probability is 1.
Remark: The theorem provides a numerical algorithm to find the extinction probability (if \mu <1). Indeed we have shown that the sequence 0, \phi_Z(0), \phi^2(0), \phi^3(0), \cdots converges to a.
import numpy as np
def f(s):
return (1/8) + (3/8)*s + (1/8)*s**2 + (1/8)*s**3 + (2/8)*s**4
s = 0 # initial condition
N = 20 # the number of iterations
for i in range(N):
s = f(s)
print(s)
0.125
0.17413330078125
0.19498016827133297
0.20415762609873456
0.20826713424631457
0.2101216307743601
0.21096146654108697
0.21134240863922932
0.21151532651001245
0.2115938436492296
0.21162950143053777
0.21164569616414142
0.2116530515720635
0.21165639233633957
0.21165790969301684
0.21165859887003846
0.21165891189175637
0.21165905406517702
0.2116591186398882
0.21165914796951835
Recall the Markov chain in Example 4.3. To establish if the system is postive recurrent we try to solve \pi P=\pi and find the system of equations \begin{aligned} \pi(0)&= \pi(0)a_0 + \pi(1)a_0 \\ \pi(1)&= \pi(0)a_1 + \pi(1) a_1 + \pi(2)a_0 \\ \pi(2)& = \pi(0)a_2 + \pi(1)a_2 +\pi(2) a_1 + \pi(3)a_0 \\ & \vdots \\ \pi(n)& = \pi(0)a_n + \sum_{j=1}^{n+1} \pi(j) a_{n+1-j} \end{aligned}
We solve this using the generating functions \psi(s)=\sum_{n=0}^\infty s^n \pi(n) and \phi(s)=\sum_{k=0}s^k a_k: \begin{aligned} \psi(s) = \sum_{n=0}^\infty s^n\pi(n) &= \pi(0) \sum_{n=0}^\infty s^n a_n + \sum_{n=0}^\infty s^n \sum_{j=1}^{n+1} \pi(j) a_{n+1-j}\\ & = \pi(0) \sum_{n=0}^\infty s^n a_n + s^{-1} \sum_{j=1}^\infty \pi(j) s^j \sum_{n=j-1}^\infty s^{n+1-j} a_{n+1-j} \\ &=\pi(0) \phi(s) + s^{-1}(\psi(s)-\pi(0)) \phi(s) \end{aligned}
Solving for \psi(s) we find, after some algebra, the equation \psi(s) = \frac{\pi(0) \phi(s)}{1- \frac{1- \phi(s)}{1-s}}
To see if we can find \pi(0) such that this equation canbe solved we take s \to 1. We have \psi(1)=\phi(1)=1 and we have \displaystyle \lim_{s\to 1}\frac{1- \phi(s)}{1-s}=\phi'(1)=\sum_{k=0}^\infty k a_k=\mu which is the mean number of object arriving in the repair shop in a single day. We find the equation 1= \frac{\pi(0)}{1- \mu} and so we can find a solution 0 < \pi(0) \le 1 if and only if \mu< 1 and so \pi(0)=1-\mu. \textrm{ The repair shop Markov chain is positive recurrent iff } \mu < 1
To study transience we use Theorem 4.2 and try to find a solutinon P \alpha(j) =\alpha(j) for j\ge 1 with \alpha(0)=1 and 0 < \alpha(j) <1 for j>1. We find the system of equation \begin{aligned} \alpha(1) &= a_0 \alpha(0) + a_1 \alpha(1) + a_2 \alpha(2) + \cdots \\ \alpha(2) &= a_0 \alpha(1) + a_1 \alpha(2) + a_2 \alpha(3) + \cdots \\ \alpha(3) &= a_0 \alpha(2) + a_1 \alpha(3) + a_2 \alpha(4) + \cdots \\ & \vdots \\ \alpha(n) & = \sum_{j=0}^\infty a_j \alpha(j+n-1) \end{aligned} We try for a solution of the form \alpha(j)=s^j which gives s^n = \sum_{k=0}^\infty a_j s^{j+n-1} = s^{n-1} \phi(s) \implies \phi(s)=s and from our analys of branching processes we know that a solution with s<1 exists iff \mu= \sum_{k=0}^\infty k a_k >1. So we get \textrm{ The repair shop Markov chain is transient iff } \mu > 1 and it is null-recurrent if \mu=1.
Consider a Markov chain with transition probabilities P(i,j) and stationary distribution \pi(i). We can rewrite the equation for stationarity,
\pi(i)=\sum_{j} \pi(j) P(j,i)
as
\sum_{j} \pi(i) P(i,j) \,=\, \sum_{j} \pi(j) P(j,i) \,.
\tag{7.1} which we are going to interpret as a balance equation.
We introduce the (stationary) probability current from i to j as J(i,j) \equiv \pi(i) P(i,j) \tag{7.2} and Equation 7.1 can be rewritten as \underbrace{\sum_{ j} J(i,j)}_{\textrm{total current out of state } i} \,=\, \underbrace{\sum_j J(j,i) }_{\textrm{total current into of state i}} \, \quad \quad \textrm{ balance equation } \tag{7.3} i.e., to be stationary the total probability current out of i must be equal to the total probability current into i.
A stronger condition for stationarity can be expressed in terms of the balance between the currents J(i,j). A Markov chain X_n satisfies detailed balance if there exists \pi(i)\ge 0 with \sum_{i} \pi(i)= 1 such that for all i,j we have
\pi(i) P(i,j) \,=\, \pi(j) P(j,i) \quad \textrm{ or equivalently } \quad J(i,j) = J(j,i)
\tag{7.4} This means that for every pair i,j the probability currents J(i,j) and J(j,i) balance each other.
Clearly Equation 7.4 is a stronger condition than Equation 7.3 and thus we have
Lemma 7.1 If the Markov chain satisfies detailed balance for a probability distribution \pi then \pi is a stationary distribution.
But it is easy to see that detailed balance is a stronger condition than stationarity. The property of detailed balance is often also called (time)-reversibility since we have the following results which states that the probability of any sequence is the same as the probability of the time reversed sequence.
Theorem 7.1 (Time reversibility) Suppose the Markov chain X_n satisfies detailed balance and assume that the initial distribution is the stationary distribution \pi. Then for any sequnce of states i_0, \cdots i_n we have P \left\{ X_{0}= i_0 \,, X_1=i_1\,, \cdots \, , X_n =i_n \right\} \,=\, P \left\{ X_{0}= i_n \,, X_1 = i_{n-1}\,, \cdots \, , X_n =i_0 \right\}
Proof. Using Equation 7.4 repeatedly we find \begin{aligned} P \left\{ X_{0}= i_0 \,, X_1=i_1\,, \cdots \, , X_n =i_n \right\} & \,=\, \pi(i_0) P(i_0, i_1) P(i_1, i_2) \cdots P(i_{n-1}, i_n) \\ & \,=\, P(i_1, i_0) \pi(i_1) P(i_1, i_2) \cdots P(i_{n-1}, i_n) \\ & \,=\, P(i_1, i_0) P(i_2, i_1) \pi(i_2) \cdots P(i_{n-1}, i_n) \\ & \,=\, \cdots \\ & \,=\, P(i_1, i_0) P(i_2, i_1) \cdots \pi(i_{n-1}) P(i_{n-1}, i_n) \\ &\ \,=\, P(i_1, i_0) P(i_2, i_1) \cdots P(i_n, i_{n-1}) \pi(i_n) \\ &\,=\,P \left\{ X_{0}= i_n \,, X_1 = i_{n-1}\,, \cdots \, , X_n =i_0 \right\} \end{aligned} \blacksquare
The next result is very easy and very useful.
Theorem 7.2 Suppose X_n is a Markov chain with finite state space S and with a symmetric transition matrix, i.e, P(i,j)=P(j,i). Then X_n satisfies detailed balance with \pi(j) = 1 /|S|, i.e., the stationary distribution is uniform on S.
Random walk on the hypercube \{0,1\}^m. The state space S is S= \{0,1\}^m \,=\, \left\{ \sigma = ( \sigma_1, \cdots, \sigma_m) \,;\, \sigma_i \in \{0,1\} \right\} To define the move of the random walk, just pick one coordinate j \in \{ 1, \cdots, m\} and flip the j^{th} coordinate, i.e., \sigma_j \to 1- \sigma_j. We have thus P(\sigma, \sigma') =\left\{ \begin{array}{cl} \frac{1}{m} & {\rm if~}\sigma {\rm~and~}\sigma' {\rm~differ~by~one~coordinate} \\ 0 &{\rm otherwise} \end{array} \right. Clearly P is symmetric and thus \pi(\sigma) \,=\, 1/ 2^m.
Random walk on the graph G=(E,V) has transition probabilities p(v, w) = \frac{1}{ {\rm deg}(v)}. This Markov chain satifies detailed balance with the (unnormalized) \mu(v) = deg(v). Indeed we have P(v,w) >0 \iff P(w,v)>0 and thus if P(v,w)>0 we have
\mu(v) P(v, w) \,=\, {\rm deg}(v) \frac{1}{{\rm deg} (v)} = 1 \,=\, {\rm deg}(w) \frac{1}{{\rm deg} (w)}= \mu(w) P(w, v) \,.
This is slightly easier to verify that the stationary equation \pi P = \pi. After normalization we find \pi(v) = {\rm deg}(v)/2|E|.
For example for the simple random walk on \{0, 1, \cdots, N\} with reflecting boundary conditions we find \pi \,=\, \left( \frac{1}{2N}, \frac{2}{2N}, \cdots, \frac{1}{2N} \right).
Birth and death Markov chain always satisfy detailed balance. Indeed the only non trivial detailed balance conditions are \pi(j)P(j,j+1) = \pi(j+1) P(j+1,j) \implies \pi(j) p_j \,=\, \pi(j+1) q_{j+1} \,, \quad \textrm{ for } j=0, \cdots, N-1 \,. and this can be solved recursively. We obtain \begin{aligned} \pi(1) \,&=\, \pi(0) \frac{p_0}{q_1} \\ \pi(2) \,&=\, \pi(1) \frac{p_1}{q_2} \,=\, \pi(0) \frac{p_0 p_1 }{q_1 q_2} \\ & \vdots \\ \pi(N) \,&=\, \pi(0) \frac{p_0 p_1 \cdots p_{N-1}}{q_1 q_2 \cdots q_{N-1}} \end{aligned} and with normalization \pi(j) \,=\, \frac{ \prod_{k=1}^j \frac{p_{k-1}}{q_k}}{ \sum_{l=0}^N \prod_{k=1}^l \frac{p_{k_1}}{q_k}}. If N is infinite we need the sum in the deominator to be finite.
For example the Ehrenfest urn in Example model has p_j \,=\, \frac{N-j}{N} \,, \quad q_j \,=\, \frac{j}{N} and thus we obtain \pi(j) \,=\, \pi(0) \frac{ \frac{N}{N} \frac{N-1}{N} \cdots \frac{N-(j-1)}{N}}{ \frac{1}{N} \frac{2}{N} \cdots \frac{j}{N}} \,=\, \pi(0) { N \choose j } \quad \textrm{ and also } \pi(0) =\sum_{j=0}^N {N \choose j} \,=\, 2^N.
Suppose you are given a certain probability distribution \pi on a set S and you goal is to generate a sample from this distribution. The Monte-Carlo Markov chain (MCMC) method consists in constructing an irreducible Markov chain X_n whose stationary distribution is \pi. Then to generate \pi one simply runs the Markov chains X_n long enough such that it is close to its equilibrium distribution. It turns out that using the detailed balance condition is a very useful tool to construct the Markov chain in this manner.
A-priori this method might seems an unduly complicated way to sample from \pi. Indeed why not simply simulate from \pi directly using one of the method of Section [Simulation]
To dispel this impression we will consider some concrete examples but for example we will see that often one want to generate a uniform distribution on a set S whose cardinality |S| might be very difficult.
A large class of model are so-called “energy models” which are described by an explicit function f:S \to \mathbb{R} interpreted as the energy (or weight) of the state. Then one is interested in the stationary distribution \pi(i) = Z^{-1} e^{-f(i)} \quad \textrm { where } Z= \sum_{i}e^{-f(i)} \textrm{ is a normalization constant} As we will see computing the normalization constant Z can be very difficult and so one cannot directly simulate from \pi!
The measure \pi assign biggest probability to the states i where f(i) is the smallest, that is they have the smallest energy.
Often (e.g. in economics) the measure is written with a different sign convention \pi(i) = Z^{-1}e^{f(i)} which now favor the states with the highest values of f.
For a graph G=(E,V) a proper q-coloring consists of assigning to each vertex v of the graph one of q colors subject to the constraint that if 2 vertices are linked by an edge they should have different colors. The set S' of all proper q-coloring which is a subset of S \,=\, \left\{ 1, \cdots, q\right\}^V. We denote the elements of S by \sigma = \{ \sigma(v)\}_{v \in V} with \sigma(v) \in \{1, \cdots, q\}. The uniform distribution on all such proper coloring is \pi(\sigma) = 1/|S'| for all \sigma \in S'. Even for moderately complicated graph it can be very difficult to compute to |S'|.
Consider the following transition rules:
1. Choose a vertex v of at random and choose a color a at random. 2. Set \sigma'(v) = a and \sigma'(w)=\sigma(w) for w \not=v. 3. If \sigma' is a proper q-coloring then set X_{n+1}=\sigma'. Otherwise set X_{n}=\sigma.
The transition probabilities are then given by \begin{aligned} P(\sigma, \sigma') =\left\{ \begin{array}{cl} \frac{1}{q |V|} & \textrm{ if } \sigma \textrm{ and } \sigma' \textrm{ differ at exactly one vertex} \\ 0 & \textrm{ if }\sigma \textrm{ and } \sigma' \textrm{ differ at more than one vertex} \\ 1 - \sum_{\sigma' \not = \sigma} P(\sigma, \sigma') & \textrm{ if } \sigma'=\sigma \end{array} \right. \end{aligned} Note that P(\sigma, \sigma) is not known explicitly either but is also not used to run the algorithm. The cardinality |S'| is also not needed.
In order to check that the uniform distribution is stationary for this Markov chain it note that P(\sigma, \sigma') is symmetric matrix. If one can change \sigma into \sigma' by changing one color then one can do the reverse transformation too.
This is a classical optimization problem. You own m books and the i^{th} book has weight w_i lb and is worth $ v_i. In your knapsack you can put at most a total of b pounds and you are looking to pack the most valuable knapsack possible.
To formulate the problem mathematically we introduce \begin{aligned} w \,&=\, (w_1, \cdots w_m) \in {\bf R}^m &\textrm{ weight vector} \\ v \,&=\, (v_1, \cdots v_m) \in {\bf R}^m &\textrm{ value vector} \\ \sigma\,&=\, (\sigma_1, \cdots \sigma_m) \in \{0,1\}^m &\textrm{ decision vector} \end{aligned} where we think that \sigma_i =1 if the i^{th} book is in the knapsack. The state space is S' \,=\, \left\{ \sigma \in \{0,1\}^{m} \,;\, \sigma \cdot w \le b \right\} and the optimization problem is \textrm{Maximize } v \cdot \sigma \textrm{ subject to the constraint } \sigma \in S'
As a first step we generate a random element of S' using the simple algorithm. If X_n=\sigma then
In other words, choose a random book. If it is in the sack already remove it. If it is not in the sack add it provided you do not exceed the the maximum weight. Note that the Markov chain X_n is irreducible, since each state communicates with the state \sigma=(0, \cdots, 0). It is aperiodic except in the uninteresting case where \sum_{i} w_i \le b. Finally the transition probabilities are symmetric and thus the uniform distribution the unique stationary distribution.
In the knapsack problem we want to maximize the function f(\sigma)=\sigma\cdot v on the state space. One possible algorithm would be to generate an uniform distribution on the state space and then to look for the maximum value of the function. But it would be a better idea to sample from a distribution which assign higher probabilities to the states which we are interested in, the ones with a high value of f.
Let S be the state space and let f \,:\, S \to \mathbb{R} be a function. It is convenient to introduce the probability distributions define for \beta>0 by \pi_\beta(i) \,=\, \frac{e^{\beta f(i)}}{Z_\beta} \, \quad \textrm{ with } \quad Z_\beta =\sum_{j \in S} e^{\beta f(j)} \,. Clearly \pi_\beta assign higher weights to the states i with bigger values of f(i). Let us define S^* \,=\, \left\{ i \in S \,;\, f(i) = f^* \equiv \max_{j \in S} f(j) \right\} \,. If \beta=0 then \pi_0 is simply the uniform distribution on S. For \beta \to \infty we have \lim_{\beta \to \infty} \pi_\beta(i) \,=\, \lim_{\beta \to \infty} \frac{ e^{\beta (f(i)- f^*)}} { |S^*| + \sum_{j \in S\setminus S^*} e^{\beta (f(j) - f^* )}} \,=\, \left\{ \begin{array}{cl} \frac{1}{|S^*|} & {\rm ~if~} i \in S^* \\ 0 & {\rm ~if~} i \notin S^* \\ \end{array} \right. \,, i.e., for large \beta \pi_\beta is concentrated on the global maxima of f.
A fairly general method to generate a distribution \pi on the state space S is given the Metropolis algorithm. This algorithm assumes that you already know how to generate the uniform distribution on S by using a symmetric transition matrix Q.
Theorem 8.1 (Metropolis algorithm) Let \pi be a probability distribution on S with \pi(i)>0 for all i and let Q be a symmetric transition matrix. Consider the Markov chain with the following transition matrix (the Metropolis algorithm). If X_n=i
If Q is an irreducible transition probability matrix on S then the Metropolis algorithm defines an irreducible Markov chain on S which satisfies detailed balance with stationary distribution \pi.
Proof. Let P(i,j) be the transition probabilities for the Metropolis Markov chain. Then we have P(i,j) \,=\, Q(i,j) \alpha \,=\, Q(i,j) \min \left\{ 1 \,,\, \frac{\pi(j)}{\pi(i)} \right\} \,.
Since we assume \pi(i)>0 for all states i, the acceptance probability \alpha never vanishes. Thus if P(i,j)>0 whenever Q(i,j)>0 and thus P is irreducible if Q is itself irreducible.
In order to check the reversibility we note that \pi(i) P(i,j) \,=\, Q(i,j) \pi(i) \min \left\{ 1, \frac{\pi(j)}{\pi(i)} \right\} \,=\, Q(i,j) \min \left\{\pi(i)\,,\, \pi(j) \right\} and the r.h.s is symmetric in i,j and thus \pi(i) P(i,j) =\pi(j) P(j,i). \quad \blacksquare
Note that only the ratios \frac{\pi(j0}{pi(j)} are needed to run the algorithm, in particular we do not need the normalization constant. This is a very important feature of the Metropolis algorithm.
We could have chosen another acceptance probability \alpha(i,j). By inspection of the proof it is enough to pick \alpha(i,j) such that \alpha(i,j) \le 1 and \pi(i) \alpha(i,j) = \pi(j) \alpha(j,i) is symmetric. Some such examples will be considered in the homework and the choice given in Theorem 8.1 is actually optimal in the sense it gives the highest acceptance probability.
The general case with a non-symmetric proposal matrix Q is called the Metropolis-Hastings algorithm. In this case we use the acceptance probability is chosen to be \alpha(i,j) = \min \left\{ 1, \frac{\pi(j)Q(j,i)}{\pi(i)Q(i,j)} \right\} and it is not difficult to check (see Homework) that the Metropolis-Hasting algorithm yields a reversible Markov chain with stationary distribution \pi.
Suppose we are given a graph G=(E,V). Often in application to computer networks or social networks the graph is not fully known. Only local information is available, given a vertetx one knows the vertices one is connected. Nonetheless you can run a random walk on the graph by choosing a edge at random and moving to the corresponding vertex, i.e. Q(v,w) = \frac{1}{\textrm{ deg}(v)} \textrm{ if } v \sim w and the stationary distribution is \pi(v) \propto deg(v).
Suppose you wish to generate a uniform distribution on the graph. Then we can use the Metropolis-Hasting to generate a Markov chain with a uniform stationary distribution. We have \alpha(v,w) = \min\left\{ 1, \frac{Q(w,v)}{Q(v,w)}\right\} =\min\left\{ 1, \frac{\textrm{ deg}(v)}{\textrm{ deg}(w)}\right\} \textrm{ if } v \sim w and thus the transition matrix P(v,w) = Q(v,w) \alpha(v,w) = \frac{1}{\textrm{ deg}(v)} \min\left\{ 1, \frac{\textrm{ deg}(v)}{\textrm{ deg}(w)}\right\} = \min\left\{ \frac{1}{\textrm{ deg}(v)}, \frac{1}{\textrm{ deg}(w)} \right\} generates a uniform distribution on a graph.
Consider the distribution \pi_\beta(\sigma) \,=\, \frac{e^{\beta v \cdot \sigma}}{ Z_\beta} where the normalization constant Z_\beta = \sum_{\sigma \in S'} e^{\beta v \cdot \sigma} is almost always impossible to compute in practice. However the ration \frac{\pi(\sigma')}{\pi(\sigma)} = e^{\beta v \cdot (\sigma'-\sigma)} does not involve Z_\beta and is easy to compute.
For this distribution we take as the proposal Q matrix used to generate a uniform distribution on the allowed states of the knapsack (see Knapsack problem) and the Metropolis algorithm now reads as follows. If X_n=\sigma then
In short, if you can add a book to your knapsack you always do so, while you remove a book with a probability which is exponentially small in the weight.
Another algorithm which is widely used for Monte-Carlo Markov chain is the Glauber algorithm which appear in the literature under a variety of other names such as Gibbs sampler in statistical applications, logit rule in economics and social sciences, heat bath in physics, and undoubtedly under various other names.
The Glauber algorithm is not quite as general as the Metropolis algorithm. Indeed we assume that the state space S has the following structure S \subset \Omega^V where both \Omega and V are finite sets. For example S \subset \{0,1\}^m in the case of the knapsack problem or S \subset \{ 1, \cdots, q\}^V for the case of the proper q-coloring of a graph. We denote by \sigma \,=\, \{ \sigma(v) \}_{v \in V} \,, \quad \sigma(v) \in \Omega \,. the elements of S.
It is useful to introduce the notation \sigma_{-v} \,=\, \{ \sigma(w) \}_{w \in V, w\not =v} and we write \sigma \,=\, ( \sigma_{-v}, \sigma(v) ) \,. to single out the v entry of the vector \sigma.
Theorem 8.2 (Glauber algorithm) Let \pi be a probability distribution on S\subset \Omega^V and extend \pi to \Omega^V by setting \pi(\sigma)=0 if \sigma \in \Omega^V \setminus S. If X_n \,=\, \sigma then
The irreducibility of the algorithm is not guaranteed a-priori and needs to be checked on a case-by-case basis.
Proof. The transition probabilities are given by
P(\sigma, \sigma') = \left\{ \begin{array}{cl} \frac{1}{ |V|} \frac{ \pi( \sigma_{-v}, \sigma'(v)) } {\sum_{ b \in \Omega} \pi( \sigma_{-v}, b)} & \textrm{ if } \sigma_{-v} = \sigma'_{-v} {\rm~for~some~}v \\ 0 & \textrm{ if } \sigma_{-v} \not= \sigma'_{-v} {\rm~for~all~}v \\ 1 - \sum_{\sigma'} P(\sigma, \sigma') & \textrm{ if } \sigma=\sigma' \end{array} \right.
To check detailed balance we note that if P(\sigma, \sigma')\not=0 \pi(\sigma) P(\sigma, \sigma') \,=\, \frac{\pi(\sigma) \pi(\sigma')} {\sum_{ b \in \Omega} \pi( \sigma_{-v}, b)} \,, and this is symmetric in \sigma and \sigma'. \quad \blacksquare.
Let G=(E,V) be a graph and let S= \{-1,1\}^V. That is to each vertex assign the value \pm 1, you can think of a magnet at each vertex pointing either upward (+1) or downward (-1). To each \sigma \in S we assign an “energy” H(\sigma) given by H(\sigma) \,=\, - \sum_{ e=(v,w) \in E} \sigma(v) \sigma(w) \,. The energy \sigma is minimal if \sigma(v) \sigma(w) =1 i.e., if the magnets at v and w are aligned. Let us consider the probability distribution \pi_\beta (\sigma) \,=\, \frac{e^{-\beta H(\sigma)} }{Z_\beta} \,, \quad Z_\beta \,=\, \sum_{\sigma} e^{-\beta H(\sigma)} \,. The distribution \pi_\beta is concentrated around the minima of H(\sigma). To describe the Glauber dynamics note that H ( \sigma_{-v}, 1) - H( \sigma_{-v}, -1) \,=\, -2 \sum_{ w\,;\, w \sim v} \sigma(w) and this can be computed simply by looking at the vertices connected to v and not at all the graph. So the transition probabilities for the Glauber algorithm are given by picking a vertex at random and then updating with probabilities \frac{ \pi( \sigma_{-v}, \pm 1)} { \pi( \sigma_{-v}, 1) + \pi( \sigma_{-v}, -1) } \,=\, \frac{1}{ 1 + e^{\pm \beta \left[ H ( \sigma_{-v}, 1) - H( \sigma_{-v}, -1)\right]}} \,=\, \frac{1}{1 + e^{\mp 2 \beta \sum_{ w\,;\, w \sim v} \sigma(w)}} \,.
By comparison for the Metropolis algorithm we pick a vertex at random and switch \sigma(v) to -\sigma(v) and accept the move with probability \min\left\{ 1, \frac{ \pi( \sigma_{-v}, - \sigma(v))}{\pi( \sigma_{-v}, \sigma(v))} \right\} \,=\, \min\left\{ 1, \frac{ \pi( \sigma_{-v}, - \sigma(v))}{\pi( \sigma_{-v}, \sigma(v))} \right\} \,=\, \min \left\{ 1, e^{ 2\beta \sum_{ w\,;\, w \sim v} \sigma(w)\sigma(v)} \right\} \,.
The idea of simulated annealing comes from physics. The concept of annealing in physics is to obtain a low energy state of a solid (typically a crystal) you first heat it up to reach a liquid state and then, slowly, decrease the temperature to let the particles arrange themselves.
For a Markov chain the idea is to pick a temperature schedule T_1 > T_2 > T_3 > \cdots \quad \textrm{ with } \lim_{k \to \infty} T_k =0 with T_1 sufficiently large.
The following algorithm is a Markov chain with nonstationary transition probabilities.
Simulated annealing algorithm:
+ Initialise the Markov chain X_0 and the temperature T_1.
+ For each k run N_k steps of the Metropolis or Gibbs sampler with invariant distribution \pi_{T_k}.
+ Update the temperature to T_{k+1} starting with the final configuration.
A nice result about Metropolis sampler can be found for example in Hajek (1988) (a more precsie version is given there)
Theorem 8.3 (Convergence of simulated annealing) For the simulated annealing of the Metropolis algorithm (N_k=1) there exists a constant d^* such that we have \lim_{n \to \infty} P\{X_n \in S^*\} = 1 if and only if \sum_{k=1}^\infty e^{-\frac{d^*}{T_k}} =\infty The constant d^* measure the depths of the local minima of f where locality is measure in terms of the proposal matrix Q.
The theorem tells us that if decrease the temperature very very slowly, e.g. like
T_k = \frac{c}{\log(k)}
with c> d^* then the Markov chain will converge, with probabilty 1, to a minima.
This an extremely slow cooling schedule which makes it not very practical and other schedules are used like T_{k+1} = 0.99 T_k.
The state space is the product of k copies of S S^{(K)} = \underbrace{S \times \cdots \times S}_{\textrm{k times}} and we denote a state by the vector \mathbf{i} = (i_1, \cdots, i_K).
One picks a set of temperatures T_1 < T_2 < \cdots < T_k and a probability distribution
\pi^{(k)} ({\bf i}) =\prod_{l=1}^K \pi_{T_l}(i_l)= \prod_{l=1}^K Z_{l} e^{- \frac{f(i_l)}{T_l}}
which is the product of the distribution at different temperatures.
The parallel tempering consists if two kinds of moves. The parallel move update each component of \mathbf{X}_n independently with the Metropolis algorithms at different temperature and there is a swapping move which exchanges a pair of components of the state vector \mathbf{X}_n in such a way as not to disturb the invariant measure. The component at the lowest temperature can be used to find the desired minimum.
Parallel tempering algorithm:
+ Suppose the state \mathbf{X}=\mathbf{i}, pick a random number U. + If U < \alpha then we do the parallel step and update each compoent of \mathbf{X} according to a Metropolis move at the corresponding temperatures T_l.
+ If U \ge \alpha we do the swapping step. We randomly chose a neighboring pair l,l+1 and propose to swap the components X_{n,l} and X_{n,l+1}. We accept the swap with probability
\min\left\{
1, \frac{ \pi_{T_l}( i_{l+1}) \pi_{T_{l+1}}( i_l ) }{ \pi_{T_l}(i_l) \pi_{T_{l+1}}(i_{l+1})}
\right\}
The parallel moves clearly satisfy satisfy the detailed balance since each component does. As for a swap move which swaps the component i_l and i_{l+1} n the state vector, we also have detailed balance since \begin{aligned} & \pi_{T_l}(i_l) \pi_{T_{l+1}}(i_{l+1}) (1-\alpha) \frac{1}{K-1} \min\left\{ 1, \frac{ \pi_{T_l}( i_{l+1}) \pi_{T_{l+1}}( i_l ) }{ \pi_{T_l}(i_l) \pi_{T_{l+1}}(i_{l+1})} \right\} \\ & = (1-\alpha) \frac{1}{K-1} \min \left\{ \pi_{T_l}(i_l) \pi_{T_{l+1}}(i_{l+1}) , \pi_{T_l}( i_{l+1}) \pi_{T_{l+1}}( i_l ) \right\} \end{aligned} which is symmetric in i_l,i_{l+1}.
Given two probability measure \mu and \nu on S the total variation distance between \mu and \nu is given by \|\mu - \nu\|_{\textrm{TV}}= \sup_{A \subset \Omega}|\mu(A) - \nu(A)| that is the largest distance between the measures of sets.
The supremum over all set is not convenient to compute but we have the formula
Theorem 9.1 We have the formula \begin{aligned} \|\mu - \nu\|_{\textrm{TV}} &= \sum_{i: \mu(i) \ge \nu(i)} (\mu(i) - \nu(i))(x) = \frac{1}{2} \sum_{i} | \mu(i) - \nu(i)| \end{aligned} \tag{9.1}
Proof. First note that the second equality in Equation 9.1 follows from the first. Indeed, if the first equality holds, then by interchanging \mu and \nu we also have
\|\mu - \nu\|_{\textrm{TV}} =
\sum_{i: \mu(i) \ge \nu(i)} |\mu(i) - \nu(i)| =
\sum_{i: \nu(i) \ge \mu(i)} |\nu(i) - \mu(i)|
which proves the second equality.
To prove the first equality in Equation 9.1 we consider the set B=\{i:\mu(i) \ge \nu(i)\}. For any event A we have \begin{aligned} \mu(A) - \nu(A) = \sum_{i \in A } \mu(i)-\nu(i) \le \sum_{i \in A \cap B} \mu(i)-\nu(i) \le \sum_{i \in B} \mu(i)-\nu(i) =\mu(B)-\nu(B) \end{aligned} By interchanging the role of \mu and \nu we find \begin{aligned} \nu(A) - \mu(A) &\le \nu(B^c)-\mu(B^c) = \mu(B) -\nu(B) \end{aligned} and thus for any set A we have |\mu(A)-\nu(A)|\le \mu(B) - \nu(B). \quad\blacksquare
The total variation is also intimately related to the notion of coupling between probability measures. A coupling between the probability measure \mu and \nu is a probability measure q(i,j) on the product space S \times S such that \sum_{j} q(i,j) = \mu(j) \quad \textrm { and } \quad \sum_{j} q(i,j) = \mu(j) i.e. the marginals of q are \mu and \nu.
Coupling are nicely expressed in terms of random variables. We can think of q(i,j) has the (joint) pdf of the rv (X,Y) where X has pdf \mu and Y has pdf \nu.
There always exists a coupling since we can always hoose X and Y independent, i.e. q(i,j)=\mu(i)\nu(j).
On the oppposite extreme if \mu=\nu then we can pick X=Y has a coupling i.e q(i,i)=\mu(i)=\nu(i) and q(i,j)=0 if i\not=j.
Theorem 9.2 (Coupling representation of total variation) We have \|\mu -\nu\|_{\textrm{TV}}= \inf \left\{ P\{X \not =Y\}\,; (X,Y) \textrm{ coupling of } \mu \textrm { and } \nu \right\} \tag{9.2}
Proof. We first prove an inequality: \begin{aligned} \mu(A)-\nu(A) &= P\{X \in A\} - P\{Y \in A\} \le P\{X \in A\}- P\{X \in A, Y \in A\} \\& = P\{X \in A, Y \notin A\} \le P\{ X \not=Y\} \end{aligned} and thus \|\mu-\mu\|_{\textrm{TV}} \le \inf P \{ X \not=Y\}.
To show the equality we construct an optimal coupling.
Recall from Theorem 9.1 that \displaystyle \|\mu-\nu\|_{\textrm{TV}}=\sum_{\mu(i)\ge \nu(i)}\mu(i)-\nu(i) = \sum_{\nu(i)\ge \mu(i)}\nu(i)-\mu(i) (see regions A and B in Figure 9.1) and we set p= \sum_{i} \mu(i) \wedge \nu(i) = 1 - \|\mu-\nu\|_{\textrm{TV}} (see region C in Figure 9.1).
Consider now the coupling defined as follows. Pick a random number U
Since p \gamma_C + (1-p) \gamma_A = \mu and p \gamma_C + (1-p) \gamma_B = \nu this defines a coupling and we have P\{ X\not=Y\}=1-p =\|\mu-\nu\|_{\textrm{TV}}. \quad \blacksquare.
A coupling of a Markov chain is a stochastic process (X_n,Y_n) with state space S\times S such that both X_n and Y_n are Markov chains with transition matrix P but with possibly different initial conditions.
A Markovian coupling is a coupling such that the joint (X_n,Y_n) is itself a Markov chain with some transition matrix Q((i,j), (k,l)) which must then satisfy \sum_{k} Q((i,j), (k,l)) = P(j,l) \quad \textrm{ and } \sum_{l} Q((i,j), (k,l)) = P(i,k) \,.
Recall that Theorem 2.3 where we proved the convergence of aperiodic irreducible Markov chains to their stationary distribution we used an independent coupling. We will expand upon this idea by considering other more intersting couplings.
The coupling time is defined by \sigma= \inf \{ n \ge 0 \,:\, X_n=Y_n \} that is, it is the first time that the Markov chain visit the same state.
After the coupling time \sigma, X_n and Y_n have the same distribution (by the strong Markov property) so we can always modified a coupling so that, after time \sigma X_n and Y_n are moving together, i.e. X_n=Y_n for all n \ge \sigma. We will always assume this to be true in what follows.
Theorem 9.3 Suppose (X_n,Y_n) is a coupling of a Markov chain such that X_0=i and Y_0=j and \sigma is the coupling time. Then we have \|P^n(i,\cdot) - P^n(j,\cdot)\|_{ \textrm{TV} } \le P \left\{ \sigma>n| X_0=i, Y_0=j\right\}
Proof. We have P\{X_n=l|X_0=i\}=P^n(i,l) and P\{Y_n=l|X_0=j\}=P^n(j,l) and therefore X_n and Y_n is a coupling of the the probability distributions P^n(i,\cdot) and P^n(j,\cdot). So by Theorem 9.2 we have \|P^n(i,\cdot) - P^n(j,\cdot)\|_{\textrm{TV}} \le P \left\{ X_n \not= Y_n | X_0=i,Y_0=j\right\} = P \left\{ \sigma > n | X_0=i,Y_0=j\right\}\,. \blacksquare
We can use this result to bound the distance to the stationary measure
Theorem 9.4 We have \sup_{i \in S } \| P^n(i,\cdot) - \pi\|_{\textrm{TV}} \le \sup_{i,j \in S } \|P^n(i,\cdot) - P^n(j,\cdot)\|_{\textrm{TV}}
Proof. Using the stationarity and the triangle inequality we have \begin{aligned} \| P^n(i,\cdot) - \pi\|_{\textrm{TV}} &= \sup_{A} |P^n(i,A) - \pi(A)| \\ & = \sup_{A}| \sum_{j} \pi(j) (P^n(i,A)-P^n(j,A))| \\ %& \le \sup_{A} \sum_{j} \pi(j) |P^n(i,A)-P^n(j,A)| \\ & \le \sum_{j} \pi(j) \sup_{A} |P^n(i,A)-P^n(j,A)| \\ & = \sum_{j} \pi(j) \|P^n(i,\cdot) - P^n(j,\cdot)\|_{TV} \\ & \le \sup_{i,j\in S} \|P^n(i,\cdot) - P^n(j,\cdot)\|_{TV} \quad \blacksquare \end{aligned}
We start with a countable state space example, the success run chain which is very easy to analyze.
Parenthetically, it should be noted that the supremum over i in d(n) is often not well suited for countable state space. It may often happen that the number of steps it take to be close to the stationary distribution may depend on where you start.
We consider a special case of Example 4.4 with constant succes probability. P(n,0)=(1-p)\, P(n,n+1)= p\,, n=0,1,2,\cdots
Suppose X_0=i and Y_0=j (with i\not=j) we couple the two chains by moving them together.
Pick a random number U, if $U (1-p)) then set X_1=Y_1=0 the couling time \sigma=1.
If U \ge 1-p then $move X_1=i+1, Y_1=j+1.
Clearly we have
P( \sigma >n| X_0=i, Y_0=j) = p^n
and thus we find for the mixing time
d(n)=p^n < \varepsilon \iff n \ge \frac{\ln(p)}{\ln \varepsilon}
Recall that for the random walk on the hypercube (see Example 1.4) a state is a d-bit \sigma=(\sigma_1, \cdots, \sigma_d) with \sigma_j\in \{0,1\}.
The Markov chain is periodic with period 2 so to make it aperiodic we consider its “lazy” version in which we do not move with probability 1/2, that is we consider instead the transition matrix \frac{P+I}{2} which makes it periodic. The Markov chain moves then as follows, we pick a coordinate \sigma_j at random and then replace by a random bit 0 or 1.
To couple the Markov chain we simply move move them together: If X_n=\sigma and Y_n=\sigma' we pick a coordinate j \in \{1,\cdots, d\} at random and then replace both \sigma_j and \sigma'_j by the same random bit. This is a coupling and after a move the chosen coordinates coincide.
Under this coupling X_n and Y_n will get closer to each other if we select a j such that \sigma_j \not= \sigma'_j and we will couple when all the coordinates have been selected. The distribution of the coupling time is thus exactly the same as the time \tau need to collect d toys in the coupon collector problem.
We now get a bound on the tail as follows. We denote by A_i the events that coordinate i has not been selected after m steps. We have, using the inequality (1-x)\le e^{-x} \begin{aligned} P\{\sigma > n\} = P\left(\bigcup_{i=1}^d A_i\right) \le \sum_{i=1}^d P(A_i) = \sum_{i=1}^d \left(1- \frac{1}{d}\right)^n \le d e^{-\frac{n}{d}} \end{aligned} So if we pick n = d\ln(d) + c d we find P\{\sigma > d\ln(d) + c d\} \le d e^{-\frac{d \ln(d) + cd}{d}}= e^{-c} \implies t_{\textrm{mix}}(\varepsilon) \le d\ln(d) + \ln(\varepsilon^{-1})d\,.