target PDF, π(x). For this to hold, π(x) must be aninvariant distribution of the chain, meaning that
π(x) = Z
π(x0)T(x0, x)dx (6.2)
A sufficient condition for invariance is that of detailed balance, which for con-tinuous spaces is
Z
A
Z
B
π(x)T(x, x0)dx0dx= Z
B
Z
A
π(x0)T(x0, x)dxdx0 (6.3)
i.e. the probability of making a transition from some point in Ato some point inBis the same as the other way around. It is easy to see (through integration) that if (6.3) holds, (6.2) will hold.
The second condition is that the chain must be ergodic, which basically means that the starting point is inconsequential, so that the sampling may start in any place and still converge to generating samples from the target distribution.
The target distribution here is of course the conditional parameter posterior for a given hemodynamic model,π(·) =p(θ|D). Bayes’ rule allows us to rewrite the posterior in terms of the likelihood, p(D|θ), the prior, p(θ) and a normalizing factor,p(D):
p(θ|D) = p(D|θ)p(θ)
p(D) (6.4)
The proposal is accepted as the next discrete sample according to theacceptance ratio:
r= p(θ0|D)
p(θ|D) = p(D|θ0)p(θ0)
p(D|θ)p(θ) (6.5)
The proposal distribution does not figure in this ratio, since it is symmetric, p(θ0|θ) =p(θ)|θ0); also note that the normalizing factor, p(D), cancels out. If r ≥ 1 the proposal is accepted, and the next sample generated is θi+1 = θ0. If r < 1, the proposal is accepted with probability r. If the proposal is not accepted, the next sample is simply kept at the current value, θi+1 = θ. In other words, theacceptance function is
A(θ, θ0),min(1, p(θ0|D)/p(θ|D)) (6.6) The transition probability function of the Metropolis-Hastings algorithm may thus be expressed as a product of the proposal PDF and the acceptance function, T(θ, θ0) =p(θ0|θ)A(θ, θ0).
It is easy to see that detailed balance holds for this transition probability func-tion. Letting the integrand in (6.3) representπ(x) =p(θ|D),
π(x)T(x, x0) =π(x)p(x0|x)A(x, x0)
=p(x0|x)π(x) min(1, π(x0)/π(x))
=p(x0|x) min(π(x), π(x0))
=p(x|x0)π(x0) min(1, π(x)/π(x0))
=π(x0)T(x0, x)
(6.7)
since the proposal is symmetric.
In practice, of course, it suffices to compare a randomly generated variable that is uniformly distributed from 0 to 1 with r to determine wether or not the proposal is accepted, and no minimum need be taken.
The acceptance rate ρ is defined as the percentage of proposes samples that are accepted. Generally, for higher dimensional problems, it is necessary to use higher acceptance rates ([58]), but for the current problems of dimension around 10, an acceptance rate betweenPmin = 0.2 and Pmax = 0.5 was found to give good ’mixing’ (convergence).
6.2.1 Automated proposal generation
The shape and scale of the proposal distribution Σ is critical to the conver-gence of the algorithm, i.e. how many samples need be generated before the approximation (6.1) is sufficiently accurate. It is therefore normal to choose Σ by trial and error, manually. However, with the application at hand, where a lot of approximations must be done with different models and different data sets, this is not satisfactory. Therefore, an automated procedure for finding a good proposal was implemented.
The sampling is started with an arbitrary normal distribution of dimension dim(θ). Then, several short ’scout’ sampling runs are then performed, each withN samples. After each of these short runs, the covariance of the generated samplesθN is used as the new proposal covariance,
Σ0= 1 N
XN n=1
(θn−µθ)(θn−µθ)T (6.8)
whereµθ is the empirical mean,
µθ= XN n=1
θn
After each update, it is necessary to scale this new proposal, i.e. to find a scalingσso that Σ =σΣ0achieves the desired acceptance rate. This is a search problem, which has been solved by a simple bisection method, see figure 6.1.
First, an upper boundσ0is found by doublingσuntil the acceptance rate is less thanPmin. Bisection is then carried out until the estimated acceptance rate is in the desired range.
After a set number (usually around 10) of these initial iterations, the main sam-pling run is performed keeping the proposal distribution constant (the samples of the initial runs are not used further).
It was found that 100 samples in each of these short runs was sufficient to find good proposals. As small a number as possible is of course desirable for speed, but too small numbers give too high of a variance in the covariance estimates (6.8) and may also lead to divergence during the bisection algorithm, as the variance of the estimated ρ is also increased with small sample size. Usually from 1 to 5 iterations are needed for each scaling.
Figure 6.1: Bisection is used to find an appropriate scaling for the proposal distribution, so that the acceptance rateρlies betweenPmin andPmax.
6.2.2 Finding a good starting point
Another condition for reaching convergence as quickly as possible is to start the sampling from a ’good’ starting point. ’Good’ here means a point with a high posterior probability p(θ|D). Starting far from regions of high posterior probability means that it might take a long time to move to high probability region, and also the proposals found in such regions might not be optimal for sampling around the major modes, which is what the algorithm should be doing.
Several alternative policies exist for finding a good starting point. It is impor-tant that such a point can be found quickly, as otherwise the point of speeding up convergence is defeated. The method used here might be called an ’itera-tive univariate search’, and is very quick and simple. Starting from parameter values that are expected a priori to be likely, each parameter in turn is iterated through to equally distributed values across a range covering the bulk of the corresponding prior distribution. It is then set to the value giving the highest likelihood. This is then repeated for the next parameter, keeping the optimal value for the previous parameter, until all parameters have been ’optimized’
in this way. The whole procedure is then repeated, starting with the optimal values from the previous iteration. It was found that 2 iterations of this proce-dure with 4 points in each range was sufficient for finding a good starting guess across different data sets and models. This simple procedure could be improved in countless ways (e.g. refining the search range after each iteration), but serves its purpose very well in practice.
As an example of the benefit of this method, the log likelihood of the stochastic version of the augmented balloon model in typical runs was around 600 with the initial parameter setting, around 1500 after the starting point procedure, and around 2600 after 3000 simulated annealing iterations (data set 2).