Generalities

Generalities#

Markov Chains#

A Markov process is a stockastic process in which events at the step \(N\) only depends on events at the step \(N-1\). It can be represented by a trasition matrix

\[T\left(x \rightarrow y\right) ~~~~~ with ~~~~~ \int dy~T\left(x \rightarrow y\right) = 1 ~~ \forall x\]

Sarting from a single point \(\hat{x}\) the probability distribution \(p^{(n)}(x)\) after \(n\) steps is

\[p^{(0)}(x) = \delta(x - \hat{x}), ~~~~~~~ p^{(n)}(x) = \int dx~T\left(y \rightarrow x\right)~p^{(n-1)}(y)\]

If we consider the chain, i.e. the union of all points after \(n\) steps, its probability distribution is

\[C^{(n)}(x) = \frac1n \sum_{k=0}^n~p^{(n)}(x)\]

which under transition behaves like

\[\int dx~T\left(y \rightarrow x\right)~C^{(n)}(y) = \frac1n \sum_{k=0}^n~p^{(n+1)}(x) = C^{(n)}(x) + \frac1n \left( p^{(n+1)}(x) - p^{(0)}(x) \right)\]

Thus the limiting chain \(C(x) \equiv C^{(\infty)}(x)\) satisfy the stationarity equation

(1)#\[\int dx~T\left(y \rightarrow x\right)~C(y) = C(x)\]

Metropolis Hastings#

Metropolis Hastings [1] is an algorithm that uses Markov chains to provide sampling of a given distribution \(P(x)\).

Given a proposal transition matrix \(Q\)

\[Q\left(x \rightarrow y\right) ~~~~~ with ~~~~~ \int dy~Q\left(x \rightarrow y\right) = 1 ~~ \forall x\]

Let’s define the acceptance as

\[\mathcal{A}(x, y) = \min\left(1, A(x, y)\right) ~~~~~~~ A(x, y) = \frac{P(y)~Q\left(y \rightarrow x\right)}{P(x)~Q\left(x \rightarrow y\right)}\]

Let’s consider the following Markov transition matrix

\[T\left(x \rightarrow y\right) = Q\left(x \rightarrow y\right)~\mathcal{A}(x, y) + \delta\left(x - y\right)~\int dz ~Q\left(x \rightarrow z\right)~\left(1 - \mathcal{A}(x, y)\right)\]

One can show that \(P(x)\) satisfies the stationarity equation for \(T\). Thus, if a large number of steps is performed, the chain resulting from \(T\) provides a good sample of \(P(x)\).

The resulting algorithm is as follows:

# Algorithm 1

X[n]                           # Point at the n-th markov iteration

Y, Fwd, Bkd = Proposal(X[n])   # Propose a point Y sampling Q(X(n) -> Y)
                               # Fwd = Q(X[n] -> Y)
                               # Bkd = Q(Y -> X[n])

Z = Uniform(0, 1)              # Get a random Z between 0 and 1

A = P(Y) / P(X[n]) * Bkd / Fwd # Acceptance

if Z > A:

    X[n + 1] = X[n]            # Refused

else:

    X[n + 1] = Y               # Accepted

# Loop to get X[n + 2]

Multistep MCMC#

Consider a Markov process which transition matrix \(T\) is the composition of \(M\) matrices \(T^{(i)}, i=0,\dots,M-1\). This can be seen as a multi step process in which the transition matrices cycle during the Markov chain steps.

Sarting from a single point \(\hat{x}\) the probability distribution after \(n\) cycles and additional \(k=0, \dots, M-1\) steps is

\[p^{(0)}(x) = \delta(x - \hat{x}), ~~~~~~~ p^{(Mn + k + 1)}(x) = \int dx~T^{(k)}\left(y \rightarrow x\right)~p^{(Mn + k)}(y)\]

and the partial chain, i.e. the union of all points at the ::k step of the cycle, after \(n\) cycles

\[C^{(k, n)}(x) = \frac1n \sum_{i=0}^n~p^{(Mn + k)}(x),~~~~~~~C^{(k)}(x)\equiv C^{(k, \infty)}(x)\]

The analogous of stationarity equation above is the following set of stationarity equations

(2)#\[\int dx~T^{(k)}\left(y \rightarrow x\right)~C^{(k)}(y) = C^{(k + 1) \% M}(y),~~~~~~ k=0, \dots, M-1\]

The case we are interested in is when performing a multi step Metropolis Hasting MCMC sampling where all steps transition matrices correspond to the same posterior \(P\) and different proposals \(Q^{(i)}\). In this case it is easy to see that all \(C^{(k)}\) in the above equation sample P. This will be underlying mechanism of most of the algorithms described below (proposal mixing, parallel tempering, ensemble sampling).

Mixing Proposals#

While performing MCMC sampling, in order to better explore target posterior, one often alternates different proposals according to some “weights”.

To see how this works, suppose you want to sample a given posterior \(P(x)\) on some \(d\)-dimensional space. And suppose you have a set of \(N\) proposals \(Q_i\) that you want to alternate according to the discrete probability distribution \(p_i\)

\[Q_i\left(x \rightarrow y\right)~~~~~~p_i~~~~~~i=1, \dots, N~~~~~~~\sum_{i=1}^{N}p_i = 1\]

In order to do this consider sampling the posterior \(\tilde{P}\) on the \((x, i)\) space

\[\tilde{P}(x, i) = P(x)~p_i\]

and perform a 2-steps MCMC as follows

Step1: starting from \((x, i)\) point propose a new point with proposal

\[\tilde{Q}\left(x, i \rightarrow y, j \right) = p_j~\delta(x - y)\]

in this case, obviously, \({x=y}\) and the acceptance is

\[A = \frac{\tilde{P}(x, j)}{\tilde{P}(x, i)}\frac{p_i}{p_j} = 1\]

thus the new point \((x, j)\) is always accepted.
Step2: propose a new point with proposal

\[\hat{Q}\left(x, j \rightarrow y, k \right) = \delta_{jk}~Q_j\left(x \rightarrow y\right)\]

in this cas, of course, \({j=k}\) and the acceptance is

\[A = \frac{\tilde{P}(x, j)}{\tilde{P}(x, j)} \frac{Q_j\left(y \rightarrow x\right)}{Q_j\left(x \rightarrow y\right)} = \frac{P(y)}{P(x)}\frac{Q_j\left(y \rightarrow x\right)}{Q_j\left(x \rightarrow y\right)}\]

In this way we sample \(\tilde{P}\) and, by considering only the \(x\) variable in the chain, we marginalize over \(i\) obtaining \(P\).

Here is how the algorithm goes in practice:

# Algorithm 2

X[n]                           # Point at the n-th markov iteration

w                              # Weights of the proposals

Proposals                      # Set of proposal

i = choose(1, N, weights=w)    # Choose one random index according to weights

Proposal = Proposals[1]

# Run Algorithm 1 to get X[n + 1]

# Loop to get X[n + 2]

Parallel MCMC#

When performing a multistep MCMC one can apply a mechanism of reordering and grouping the steps that allows us to partially parallelize the execution.

Suppose you have a \(N\)-steps MCMC sampling on a \(d\)-dimensional space with variable \(\mathbf{x}=(x_1,\dots,x_d)\) and consider the acceptance \(A^{(i)}, i=1,\dots,N\) for each steps.

Consider having consecutive steps that are mutually independent e.g.

\[T^{(i)}\left(x_1, \dots, x_k, x_i \rightarrow y_i\right)\Pi_{j \neq i}\delta(x_j - y_j)~~~~~~~i=k, \dots, d\]

They can be executed in parallel.

Note that the result of the sampling, the target posterior, is not affected by changing the order of steps in the cycle or by duplicating some of the steps. So one can rearrange the multi-steps MCMC so to have sub sequencies of steps that are independent and can thus be executed in parallel. We can see how this mechanism is the basis of parallel tempering and some walker mixing proposals (i.e. stretch) in ensemble sampling.