to Exact Simulation

(1)

A Guide to Exact Simulation

Xeni K. Dimakos

Department of Mathematics. University of Oslo, Box 1053 Blindern, 0316 Oslo, Norway

Summary

Markov Chain Monte Carlo (MCMC) methods are used to sample from complicated multivariate distributions with normalizing constants that may not be computable in practice and from which direet sampling is not feasible. A fundamental problem is to determine convergence of the chains. Propp &

Wilson (1996) devised a Markov chain algorithm called Coupling From The Past (CFTP) that solves this problem, as it produces exact samples from the target distribution and determines automatidy how long it needs to run. Exact sampling by CF”P and other methods is currently a thriving resuuch topic.

This paper gives a review of some of thee ideas, with emphasis on the CFI’P d g o r i t h . The concepts of coupling and monotone CFTP are introduced, and results on the running time of the algorithm presented. The interruptible method of Fill(1998) and the method of Murdoch & Green (1998) for exact sampling for continuous distributions are presented. Novel simulation experiments are reported for exact sampling from the Ising model in the setting of Bayesian image restoration, and the results are compared to standard MCMC. The results show that C F ” worka at least as well as standard MCMC, with convergence monitored by the method d M e r y & Lewis (19!E, 199a).

Key wonk: Coupling from the past; Efficient sampling; Exact simulation; Fill’s algorithm, Forward coupling;

Ising model; MCMC.

1 Introduction

Markov Chain Monte Carlo (MCMC) is an iterative algorithm used when direct sampling from the distribution of interest R is not feasible. This is typically the case in image analysis, spatial statistics and graphical and Bayesian hierarchical modeling where models are often complex, multivariate and difficult to normalize. Instead, an ergodic Markov chain with the target distribution R as stationary distribution is devised. Starting from an arbitrary initial state, the chain is run until it is believed to be close to equilibrium and the final state considered as a sample from II

.

Assisted by methods for convergence diagnostics (CowIes & Carlin, 1996; Brooks & Roberts, 1999; Mengersen et at., 1999).

the MCMC user has to determine how long to run the simulation. However, as the chain is simulated in finite time and cannot be sprted in R, the method always produces an approximate sample. In contrast, a Markov chain is said to be a c t if the true equilibrium is reached and the final sample is drawn from it. We have chosen to use the term “exact”, while some authors prefer (see Kendall &

Mgller (2000) for an argument) the term “perfect” simulation.

This paper assumes the reader to be familiar with the idea of standard MCMC simulation and the most common samplers such as the Metropolis-Hastings algorithm (Metropolis et af., 1953; Hastings,

1970) and the Gibbs sampler (Geman & Geman, 1984). We will focus on the recently developed methods for exact simulation, which organize MCMC simulations cleverly to produce exact samples from the distribution of interest. The purpose ofthis paper is to serve as an introduction with pointers to further reading. We also report on our experiments with exact simulation and comparisons with standard MCMC.

(2)

28

X.K.

DIMAKOS

Exact simulation is not new. For simple distributions exact samples are obtained by inversion, tabulation, transformation and composition methods (Ripley, 1987, Chapter 3). Rejection sampling mpley, 1987, pp. 60-63) is in principle possible for any distribution and does not demand computation of complicated normalizing constants. For practical purposes, rejection sampling is fast enough only if we are able to sample easily from a proposal envelope distribution that is reasonably close to the target distribution, something which however is difficult to design in multivariate settings.

Some of the basic ideas leading to the new developments in exact simulation can be traced back several decades, and also recently important contributions have appeared. Propp &

Wilson

(1W8a) write "the conceptual ingredients of CFCP were in the air" when Propp & Wilson invented it and mention several precursors. Asmussen et al. (1992) were the first to show that it is possible to obtain exact samples from a finite chain, provided the chain is irreducible, the state space is finite and the number of states known. Lovkz & Winkler (1995) devised a rule for stopping an irreducible Markov chain on a finite state space in a state whose probability distribution is exactly the target distribution, but with an enormous running time for chains with more than a few states. In contrast, the methods that we present in this paper work for complicated distributions and are fast enough to be useful in

In Section 2 we introduce the idea of coupling and show how forward coupling produces biased samples. Section 3 is devoted to the original CFTP algorithm of Propp & Wilson (1996) and the simpler monotone CFTP that works under certain monotonicity requirements and Section 4 gives some results on _therunning time of the CFI'P algorithm. An iid sample from the &get distribution may be obtained by running independent CFI'P runs, ^{but this}is computationally expensive. Section 5 includes a presentation of several sampling schemes that are more efficient, in the sense that they utilize more of the generated values. In Section 6 we present a simulation experiment for the king model, in which the running time of the CFI'P algorithm is compared to standard MCMC with convergence determined by the method of Raftcry & Lewis (1992,1996). We conclude that CFTP is the faster method. Section ₇presents an extension of CFTP to continuous state spaces by Murdoch &

Green (1998) and Green & Murdoch (1999). while Section 8 introduces the algorithm of Fill (1998) that is based on rejection sampling and has a different structure than CFTP. The paper ends with a discussion and a guide to sources of information on exact simulation available on the Internet.

practice.

2 ForwardCoupling

Consider a Markov chain defined on a discrete finite state space S, and suppose that two copies of the chain

{X'&

and

{Y'IPQ,

are started from two different initial states. Here and in the following boldface is used to indicate vector-valued variables. Assume that the chains are coupled. Coupling is described as a tool in probability theory in Lindvall(1992). Examples of couplings used with standard MCMC are Johnson (1996) and Frigessi & Den Hollander (1993). In the current context, we say that two chains are coupled if they use the same sequence of random numbers for the transitions. If the trajectories of the chains meet at time

T.

the chains will merge and proceed together or coalesce, that is x' =

f

Vr 2

T,

due to the coupling. If we similarly couple chains started from all states of

S.

and wait until all the chains have coalcsced, then the initialization bias has in some sense worn off. It may then be intuitively plausible to expect that the common value of the chains at the time of coalescence is an unbiased sample from the stationary distribution, but this impression is false, as the following two examples show. The first example, taken from Kendall(l998) and Fill (1998). involves a simple two state Markov chain, while the second example, from an early preprint of Haggswm & Nelander (1999) that appeared in Nelander (1998), fits in the Gibbs sampler framework. For both examples it is convenient to consider the transitions of the chains via an update function. The update function is simply an expression for the transition mechanism that given the previous value of the chain and the necessary random numbers, produces the new state of the chain.

(3)

Definition I . The transitions of a Markov chain (X')p"=o with transition matrix

P

can be described by a deterministic updatefunction

4

as

X'+'

= Q(x', Ut+') if P(x, y) = P ( & ( x , U) = y) Vx, y E

S

for a vector of random numbers U (in the simplest case a single uniform random number) and ( U ' ) z are iid with the same distribution as

U.

Example 1 ( A I-dimensional target). Assume that we have a Markov Chain on S = (0, I} that from state 0 moves to 0 or I with probability 1/2 each, and from state I always moves to state 0. The transitions of the chain may be written in terms of the update function

4

as =

4 ( x f , U'+'),

where

for uniform random numbers

U'.

The stationary distribution of this chain is n(0) = 2/3 and

~ ( 1 ) = 1/3. Starting one chain in each of the two states and applying the same random numbers, it is easy to see that the chains always coalesce in state 0 (see Figure 1. left panel), and hence the distribution at the time of coalescence is not the target distribution x

.

0

S S

Figure 1. The left panel illustrates forward coupling for Ernmplc 1, and the righr panel illustrates CFTP for the same cuunplr. In both examples the same sequence of r h m numbers L used a d we see how forward coupling always will happen in statt 0, while with C F P the chains cwlcsce in state 0, but the state at time 0 may be either 0 or I . For i m t m e , for V-' 5 0.5 then Xo would be 0. To,wrdrrstanii the right panel, the reader should try starting from times t =

-

^1,^-2,

.

^,

.

ctc. The solid line indicates the j n a l path

Euunple 2 ( A 2-dimensional target x). Let

X

=

( X I , Xz)

and S = [(i, j ) : i = 0 , 1 , 2 ; j = 0,1,2} and take n to be uniformly distributed on the subset ((0, 0), (0.1). (2, I), (2.2)) for which the values of the two elements differ by at most 1. For this distribution x a Gibbs sampler that updates the elements in random order is expressed by the update function

(4)

30 where

X.K.

DIMAKOS

and

V', U'

are independent uniform random numbers that determine which element that is updated and the new value for this element.

Assume that chains are started from ((0, 0). (0, l) , (2, l ) , (2,2)). Figure 2 shows how the chains coalesce according to which element is updated. Careful inspection of the figure reveals that the 4 0 These examples show that coalescence of two forward coupled chains fails to identify when the chains only coalesce in the states ( 0 , l ) and (2, 1) with probability 112 each.

chains reach stationarity.

UPDATING

X I

p J

...

122]

rn?.

^.... ^....

^...

(0,1( ...

^... ... ...-... ... _...

m

I

...

. * m

UPDATING X2 ...

...

_.-.

r n... .... 1.... 7

(001

... _...

~ - ~ ~ ~ r n

r ^~^.^I... ^.I^..

u

⁵^{1 /2} ^c

u

> 1/2

F w2 lllustratwn of cculescencc in Example 2. The lep and rigk panel show the updahig of the first and second element of X espectivcly. The doned linrs show the updating for U 5 112 and thc solid lines for 0

=-

112 The same r d o m numben am applied for all 4 chains, ad &for each lcpdarr we only get dotted or solid lines.

3 Propp and Wilson's Coupling from the Past 3.1 Coupling from the Past

Propp &Wilson (1996) devised a surprisingly simple algorithm which uses coupling from the past, also called backward coupling, to produce exact samples from a stationary distribution. Consider a finite state space

S

with

M

states and an ergodic Markov chain with limit distribution n that can be described by a deterministic update function,

9.

In standard MCMC the update function

9

may be any function satisfying Definition 1. However, in C F P as described in Algorithm 1 below, the choice of update function is essential.

As

the following example illustrates, there are update functions that make coalescence impossible, and as will become clear, this in turn implies that the CFTP algorithm never terminates.

(5)

Exumple 3. Consider the Markov chain with state space (0, 1 ) , transition matrix

and stationary distribution n

= (a,

^$).Assume that a chain is started in each of the two states and that the same random numbers are used for the updates. With the update function

i f U I 1/2 I - x i f U w 1/2,

$ ( x ,

u>

=

for U E [0, 11 and x E (0, l ) , coalescence never happens. On the other hand, using 0 i f U 5 1/2

1 i f U

=-

^1/2.

$ ( x . U ) =

the chains coalesce after one step. 0

Instead of running the chain from the present and into the future, we wiIl run it from the past to the present, that is, we assume that at time --co one chain is started in each of the states of S and run to time 0. Also, we use the same sequence of random numbers for all these chains. We will assume that

$ is chosen according to the condition of Theorem 1, so that all the chains will coalesce a s . and be stationary by time 0 if started in the infinite past. Of course, starting a chain in the infinite past is not possible, and the following strategy is applied instead. One chain is started in each state of S at time -1. For each such chain the same random vector

@

is applied. If the chains are not all coalesced at time 0, we repeat, but now the chains are started from -2. New random numbers U-' determine the transitions from time -2 to

-

1. In the transitions from

-

¹to 0. the same random numbers are used as in the preceding trial. We continue going back in time in q i s manner, reusing the random numbers generated in the previous steps of the procedure, until the chains have all coalesced by time 0. Note that even if we only check for coalescence at time 0, the chains may have merged before this time point.

Let the mth state of

S

be denoted x,, rn = 1,

. . .

, M and denote by X'Z(t1, x,) the state at time r2

of a chain started in state x , at time tl c r2. Then by definition

Xo(-t,Xm) =$($(...Q~(x,,

u-'+'),

- . . u - ' > , v O > t

and we introduce the notation

X'2(t1,xm) = q y X r n , U t [ + 1 , .

. .

^,U'*)

to exhibit the dependency of&e update function and the sequence of random numbers. Define the events

A: = ((B?(x,, U'I'',

. . .

, Ut2) equals one common value V xm E

s).

Using this notation the CFTP algorithm may be expressed in terms of successively trying larger and larger values o f t until A!' occurs. The algorithm can be expressed as follows.

ALGORITHM 1

(m).

1. Set the starting value for the time to go back, TO c - 1.

2. Generate a random vector Ur0+l.

3. Start a chain in each state x,, rn = 1,

. . .

^,M , of S at time To, and run the chains X'+'(To. ^{X , )} =

(6)

32

X.K.

DIMAKOS

#($(To, Xm), U"') to time 0, t = To, To

+

^1,

. . .

^,-1.

4. Check for coalescence at time 0, that is check if ~ ( T O , x,) occupy the same state Vm. If so, this common value

Xo

is returned. Otherwise let

TO

c (TO

-

1) and continue from step 2.

The following theorem from Propp & Wilson (1996) guarantees that the value

Xo

returned by Algorithm 1 is distributed according to the stationary distribution. The proof we give resembles that of Haggstrom & Nelander (1999) but uses a different notation. The original proof is found in Propp

& Wdson (1996).

THEOREM 1. Suppose an ergodic Markov chain is constructed so that there exists an L ^coo s.t.

P ( A t ) > 0, then ( i ) with probability 1 the C F P algorithm (Algorithm 1 ) return a value, and ( i i ) this value is a realization of a random variable distributed according to the stationary distribution of the Markov chain.

Proof:

- ( k - I ) L

(i) Since the events A-kL , k ₌_{1 , 2 ,}

. .

^,are independent and by the assumption of the theorem have the same positive probability of occuring, the probability that at least one of them will occur is 1. This implies that the algorithm terminates with probability 1.

(ii) Let T, be the smallest t for which A!, occurs. If the event occurs then also A!?, occurs for t > T, and moreover P ( - T , , x,,,) =

p(-t,

x,) V states x,,, x,,, because the same random numbers are applied (in accordance with step 3 of the algorithm) when evolving the chain from time

-T,

to 0.

This

holds for any t > T,, in particular is true, and thus p ( - T , , x,,,)

-

ⁿsince it is the state visited by an infinitely long trajectory of an ergodic Markov chain with stationary distribution n. The CFlT algorithm finds T,, and the returned value has the correct distribution.

Kendall8r Thonnes (1999) described the cI;Tp algorithm as a virtual simulation from time -00,

since it allows us to sample an infinitely long simulation by reconstructing it over a finite time interval. We may regard the method as a search algorithm that reconstructs the part of the infinite sequence it nceds to obtain a C O K C C ~ ~ Y distributed sample.

Care must be taken when implementing the CJTP algorithm, so that when the chains are started from -tz c -tl the same realizations of the random numbers U-rl+l,

. . .

, are used for the steps from -tl to time 0. In a practice this is done by either saving the random numbers as they are generated, or in order to save storage space, resetting the seed for the random number generator so that the same random numbers are produced.

In Algorithm 1 the times

-

1, -2, -3,

. . .

are successively med as starting points for the chains.

From the proof it is evident that Theorem 1 holds for any decreasing sequence of starting points and also for any value of To in step 1. In many situations it would be highly inefficient to try all integer time points t c 0. Propp & Wilson (1996) recommended to take simulation start times -T, = -2'.

so that in each iteration the stariing time is doubled. Propp & Wilson (1996) showed that this choice minimizes the worst-case number of steps and is close to minimizing the expected number of steps in the search. The argument is as follows. A natural choice for the sequence of starting points (all start times are of course negative, but we omit the sign in this discussion about the number of steps) is Ti = rTo, T2 = rT1,

. . .

for an initial value TO and ratio r . If A4 chains are started each time, the number of required steps is

+

^MrkTo

S

, = MTo

+

^MrTo

+

where T+ is the smallest time so that all the chains have coalesced by time 0. and where we assume

(7)

that

To

5

T-

and k is such that ?'To .2 ''7 The smallest number of steps possible.

, S

= MT-, would emerge if we were able to guess exactly right and let

To

= TdP. The ratio of the upper bound S, and S,, is r 2 / ( r

-

1) which is minimized by r = 2, and hence the doubling. Similar arguments show that to minimize the expected number of steps, we should let r = e. With this ratio, the expected number of steps is bounded above by MeT-, while the similar bound with r = 2 is M2.89TdP. Thus, in this respect very little is lost by applying the simpler doubling of time.

An important feature of the

CFI"

algorithm is that the running time is a random variable that is typically difficult to predict prior to the

run.

Also, the coalescence time

T',

and the returned value

X"

are dependent. In fact, a C F P run is determined by the sequence of random numbers only and the algorithm stops when it has reached

' T

steps into the past, and then outputs the value at time 0.

Thus, both the coalescence time and the sampIed value are functions of the same random numbers.

This means that an impatient user who aborts long runs will produce a biased sample, sampled from something like A ( .

I

^shortruns only). The problem is slmilar in standard MCMC sampling where there is a tendency to restart long runs that do not meet a chosen convergence criterion after a certain number of "unlucky" iterations. The method of Fill, reviewed in Section 8, like classical rejection sampling, is interruptible. so that runs may be aborted without biasing the samples.

Figure 1 illustrates the difference between forward coupling and CFlT in Example 1. Another nice illustration is provided by Kendall & Thonnes (1999) who compared forward simulation and CFIT for the so called Dead Leaves process. The two approaches are illustrated in an animated simulation on the website http : //vww.warwick.ac.uk/statsdept/Staff /WSK/dead.html.

3.2 Monotone C F P

An unappealing feature of Algorithm 1 is the need to start a chain from every state for a possibly huge state space S. However, if the state space and the sampler possess certain monotonicity properties, this can be avoided and the CFTP algorithm may be simplified considerably.

We first assume a partial order of the statespace S. We say that

S

admits the natural componentwise partial orderx 5 y if x, 5 y j for j = 1,.

. .

, N forx,y E S.

Definition 2.

Proposition 1.

A transition rule expressed by a deterministic update function 4, as in Definition 1 is monotone with respect to the componentwise partial order if +(x, u) 5 #(y, u) V u when x 5 y.

Denote by x,, and x, the minimum and maximum elements of S with the componentwise partial order, so that x, 5 x 5 x, V x. Assume that the update function # of the Markov chain is monotone in the sense of Definition 2. Then one needs only consider the two chains started in x, and x, in the CFT" algorithm, since all the other trajectories will be sandwiched between these.

When the

CFTP

algorithm .can be modified in this way so that only two chains need to be run, we shall call it monotone CFll? Several models n and algorithms 4, have been shown to fulfil these monotonicity requirements. Propp & Wilson (1996) considered the class of attractive spin systems from statistical physics. This is a subclass of Markov random field (MRF) models (see Haggstrom

& Nelander (1999) for a general definition) that includes the famous king model. These models are difficult to sample from and MCMC algorithms must be used for this.

Definition 3. Consider a finite set V of vertices or sites. A spin system is a configuration

X

=

( X u ;

u E V ] where each element assigns a spin

X u

to each site u. The spin

X u

takes as value either +1 (up) or -1 (down).

The set of all possible spin configurations

S

then admits the componentwise partial order. The

(8)

34

X.K.

DJMAKOS

conditional distribution of assigning spin up to

X u

in site u given the spins xeU at all other sites is

1-1.

n(X, = -1, x-u) n(X, = 1, x-u)

n ( X , = llx-u) ₌(1

+

Definition 4. When n(Xu = llx-,,) is an increasing function in x - ~ , or equivalently in x, in the componentwise partial order, the spin system R is called uttractive. Attractiveness implies that configurations where aeighbouring sites tend to have the same spin are preferred.

The Gibbs sampler is particularly useful for sampling from spin models.

As

Theorem 2 states, the Gibbs sampler is monotone for attractive models, and hence monotone

CFTP

is possible.

THEOREM 2. The Gibbs sampler is monotone according to the componentwise partial order for attractive spin systems z.

Pmo$ For spin systems, the Gibbs sampler may be defined through the update function

4

given by

X'+' = qb(x', v , U'+')

(x, = 1, x'!,) if I/'+' I

n(xU

⁼

^IIx!-J

(1)

=I ^{( X u}

⁼^-1,^d-,)otherwise.

The site to be updated, v. is either chosen at random or found from a deterministic scanning rule. If the system is attractive then n(x, = l/x'-,) 5 x(YU = 1IyLJ for x'

I

y'. Thus if u 5 n(X, = 112-J for a realization ^uof the random variable

U'+'

then also u I n(

Yu

= 1

If!")

^and

hence

Xa'

₌_Y:+' ₌1.

If

n(X, = lIx'-,,) e u 5 n(Yu = l]y'_,) then

Xi'"

= -1 and

Y;"

₌1.

Finally if n( Y , = 1If-J e u then

X:+'

= Y;+' =

-

^1.In conclusion, Q _{(x' ,}u ) 5

4 0.'.

u ) V u when x' 5 y' which is the definition of monotonicity. 0

Example 4 (Monotonicity of the Gibbs sampler for the ising model). The Ising model was introduced in statistical physics as a model for spontaneous magnetization of ferromagnetic materials.

The model describes a sct of interacting magnets, sometimes in the presence of an external magnetic field.

We consider an Ising model on a finite two dimensional rectangular gnd V , where

X

=

( X u ;

u E V ] ,

X u

E (-1,

I]

is a configuration of the spins. The interaction between sites (called pixels in image analysis applications that we will come back to in Section 6) is described by a first order neighbour system, so that the neighbouts of a site are the sites above and below, and those to the right and to the left. (We assume free boundary conditions. For instance, in Figure 3 the central site has 4 neigbhour sites, the sites in the comers have 2, and the remaining sites have 3.) The distribution of the Ising model without external field and with a constant interaction

B

is

~ ( x ) = ~ X P W C ~ i x j ~ / z B v i-j

where i

-

^j indicates that i and j is a neighbouring pair and Zg is the normalizing constant. For this model we have that

X(X, = llx-,,) = (1 + e x p ( - 2 ~ C x j ) j - ' .

This quantity is increasing in xqU, i.e. attractive, only for #I > 0 which induces clustering of sites with the same spin value. This model is called ferromagnetic, while the opposite case,

B

< 0, is

j-u

(9)

referred to as the anti-ferromagnetic, anti-monotone or repulsive Ising model. Under the influence of a sitevarying external field a, the model admits the form

' - I J

Since

X ( X , = llx-,) = (1

+

e x p ( - 2 ~

ExJ ^-

^2a,,)]-',

j - u

is increasing in x-,,, the Gibbs sampler is also monotone for the Ising model with an external field.0 When consulting the literature on exact simulation, the reader will find that most attention is given to the Gibbs sampler, which is a result of the important monotonicity property of this sampler. There is however nothing in the CFTP algorithm that says that the chain has to be a Gibbs sampler. A popular alternative is the Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings. 1970) and the MCMC literature contains a lot of discussion on what chain is most suitable in various settings, see Tierny (1994) for a discussion. The Metropolis-Hastings algorithm is however not generally monotone for the componentwise partial order. As Theorem 3 shows, the king model is an exception when one site only is updated in each iteration.

THEOREM 3. The Metropolis-Hustings algorithm with uniform proposals that updates one site in each iteration is monotone with respect to the componentwise partial order f o r the ferromagnetic king model on a two-dimensional grid with or without a sitevarying external field.

h o o t Let 2, be a candidate value for site u in the Metropolis-Hastings algorithm, drawn uniformly from (-1, I]. The update function is

XI+' =

4($, z,,

^U'+')

(z,,

~ - " ) if U'+l 5 r ( d , z,,)

=

1.

otherwise,

where r(x', z , ) = n(z,, x'-,)/n(xl) = exp((z,,

- x ; ) ( p CJ-,xf

f a , , ) ] . For realizations z , and u of Z, and Ur+' the algorithm is monotone if #(x', z , , u )

I 40".

z , , u ) for x' I y'- When z,, = 1 (z, = -1) this is equivalent to r(x', zv) I r g f . z , ) (r(x', z,,)

L

r(y', z,,)).

4 2 , 2,) =

~(~l~-,,)/ n(x~ Ix'-,,)

I

n~lIy!~,,~/n~y~ly!!,) = r ( y ' , z,,). From the same definition we have that r ( + . z , , )

=

rc(-llx'-,,)/~r(1:1x'-,) 2 n(-lly!!,)/n(y~ly!!,) = r ( f , z , ) if zu = -1. We then consider the case when

.:

^<^y:^(i.e.^x:⁼^-1,^y: ⁼^1). ^{If zv}⁼1, then r(y',

zU)

= 1 and if z, = -1 then r ( d . ^2,) ⁼1 and in both cases the order is preserved. This proof holds for all values of a,, and thus monotonicity for the king model with and without a sitevarying external field is proved. I3

We first consider the case when x i = y:. If z,, = 1 then by Definition 4

The above theorem holds (with the same proof) for any distribution n(x) for which the value of each element x, can take one of two values in

S

= ( a , b ) and which is monotone in the Sense that n ( x u 1.

SIX-,,)

is increasing in x-,, Vs E S.

A componentwise partial ordering does not represent a full ordering of a state space S. Given two arbitrary elements of S, it will often not be possible to say which is the largest. For multivariate models like the Ising model, this means that samplers that update blocks (more than one element per iteration) can not be monotone with respect to the componentwise partial order. The following example gives two illuskations.

(10)

36

X.K.

D W O S

Example 5 (Non-monotonicity of the Metropolis-Hastings sampler).

(a) Consider the three configurations of the Ising model in Figure 3(a) for which z < x' < y' in the partial order. Assume that each site is numbered from the top left to bottom right, and that an external field is present. We find that x(z)/x(x') = exp{4p - 2 a g ) andx(z)/n(y') = exp{ 128 -2(as+ix9)).

Thus for several realizations u and values of

b,

as and a9 we have n ( z ) / r ( x ' ) 5 u 5 n ( z ) / x ( y ' ) (try

B

= 1/3, as = a9 = 1, u = 0.8). and thus c$(x', z, u) 2 #(f, z, u ) , and the order is reversed.

(b) For the configurations in Figure 3 (b) we have that r(z)/n(~) = exp{l2g) > n ( z ) / r ( y ' ) = expI-2B) for an Ising model without external field. Thus for

B

> 0 we may find realizations u E (exp(-2,3}, exp(12,3}] so that $(x', z, u ) = z and # ( f , z, u) = y', which can not be ordered with respect to the componentwise partial order. 0

Z i

Y'

- + - + - +

( b )

+ + + - + -

+ + + + - -

- + - +

- -

+ + + + + +

+ - -

Figure 3. Couiigwations from !he 2D king model on a 3 x 3. In ^(a)^I5 x' 5 y' with respect to the componentwise partial ordcr. I n ( b ) f <y',whikzcannotbeordcrrd

The componentwise partial order is not the only possible ordering of the state space. Proposition 1 is stated for this ordering, but it is easy to realize that monotone CFTP is valid for any other ordering of the state space with a minimum and maximum element and an update function that is monotone with respect to that ordering. Here is an example: assume that candidate values z in the Mewpolis-Hastings algorithm are drawn from a distribution q independent on the current state of the chain and accepted with probability a@, z) = min{l, (n(z)q(x'))/(n(x')q(z))). According.

to the ordering 5. where x 5 y

e

x(x)q(y) 2 x(y)q(x) the Metropolis-Hastings algorithm is monotone for all target distributions. This ordering was suggested and its monotonicity proved by Corcoran & Tweedie (1998). In most situations this ordering is not directly applicable. since one would need to know the minimum and maximum of the target distribution to implement

CFTP.

4 Results on Coalescence Time

As in Section 3.1 we let

X"

(tl, x,) be the state at time t2 of a Markov chain started from state Xm

at time t l , rl < tz and we let X ? ( t l , x,) denote the jth element of this vector, j = 1,

. . .

,

N.

We will need some definitions in order to make precise statements.

Definition 5. Let the forward coalescence time

Tc

be defined as the time to coalescence in a forward simulation where chains are started at time 0. If the sampler is monotone with minimum and maximum elements, x,, and x,, then

T,.

= min(t 2 0 :

X>(O,

xmn) =

X>(O,

x,) V j ) .

(11)

Definition 6. Let the backward coalescence time

T,,

be the smallest value o f t so that chains started at time --t have coalesced by time 0 in CFI'P. For monotone CFI'P it is T,, = min(r 2 0 :

$(-t. = X:(-t, L) V j ) .

In Example 1 it is not difficult to see that the distribution of both

Tc

and T- is geometric with parameter 1 0 . This is not a coincidence since by the nature of the two algorithms ?;. and T- will always be governed by the same probability distribution. Many of the results on the coalescence time of the CFI'P algorithm are derived by focusing on the conceptually simpler forward coalescence time.

THEOREM 4. The distributions of the forward and backward coalescence times and _TMpare identical.

P m o j The proof is straightforward with the observation that chains started in -t and run to time 0 yield the same result as if the chains were shifted in time to start at 0 and end at time t.

P(T+ > t ) = P ( q ( - t , L)

# q(-t,

^x-) for at least one j )

= P(X', (0, x,)

# X:

(0, x-) for at least one j )

= P ( T c > t ) .

0

In practical situations it would be of great interest to have knowledge about the distribution of the coalescence time prior to initiating a CFI'P run. Apart from providing us with information about probable running times, this could allow us to constnxct a more efficient search for

T+.

General results on the distribution of the running time are not easily obtained except for very simple Markov chains. However, Propp & Wilson (1996) provided an upper bound for the expected coalescence time. Let 5 = min(t > 0 :

60)

5 l/e) be the variation threshold of a Markov chain with transition matrix

P.

where

60)

= maxx,yes IIP(x, .)

-

P ( y ,

.)I[

is the maximum variation distance. For a sampler that preserves the order of a partially ordered state space and that updates one of the

N

components of X at a time (such as the Gibbs sampler or Metropplis-Hastings algorithm for an attractive spin model), Ropp & Wilson (1996) show that

E(T&.)

I 2r(1 +log

N).

₍₃₎

Roughly speaking the expected time to coalescence will be small if the underlying samplerconverges quickly in total variation distance. In a certain sense this inequality says that the time to coalescence in CFTP will not be much larger than the time to convergence for standard forward MCMC. For more complicated models, N is replaced by the length of the largest totally ordered subset of the partially ordered state space S. In Propp & Wilson (1998b) other results on the running times for different CFI'P algorithms are given.

5 Exact Sampling for Inference

In the previous sections we have been concerned with the problem of generating a single sample from a target distribution ^TT.We now investigate how CFI'P may be used for statistical inference which often requires repeated samples from 1c or the calculation of expectations ^i$=

&(fOu))

for a computable function f and X

-

^A.When iid samples X',

. . .

, X K from n are available, the standard estimate is

6

=

c,"=,

^f( X k ) / K . When standard MCMC is used to obtain the

K

samples, these are both dependent and only asymptotically n distributed. A common procedure is to run a long chain, discard the first part of the trajectories (the burn-in) and take the average of the remaining samples.

It is possible to obtain an iid sample from TT by repeating independent CFI'P runs. We call this approach Independent CF". The approach requires a large computational effort, and uses only the last value in each

CF"

run. Other alternatives make more use of the generated values. All the

A

(12)

38

X.K.

DIMAKOS

methods produce exact samples, but the samples are dependent to various degrees. We will not argue as to which method is to be preferred, as this will depend on the sampler and on f .

LONG CFTP: Run

CFTP

to obtain a sample

Xo -

^n.Run the chain forward in time from

Xo

for T steps. The sampled values

X', . . .

,

XT

are then all distributed according to n, but are dependent.

RANDOM CFI'P: Run CFTP to obtain $sample

x0 -

^n.Run the chain forward in time from

x0

for

?

steps to produce

X', . . .

^,X T . Draw

K

values at random from these

?

samples and use these correctly n distributed variables as starting values for

K

forward simulations of length TO. The resulting values

X'.', . . .

, XTavk, k = 1,

. . .

, K are then n distributed, and possibly less dependent than the values obtained by Long CFTP.

REPEATEDCFTP: Run CFTP to obtain a sample

**Xo*'** -

^R.Run the chain forwards in time for

TO

steps, to produce

X'.', . . .

, XT0*'. Repeat independently to produce X l V k ,

. . .

, XTopk, k = 1,

. . .

,

K.

This gives

K TO

samples from n, where the

TO

values from one CFTP run are dependent, but different runs are independent.

CONCATENATED CFTP: Run CFI" to obtain a sample

X 0 . O -

^n,Run another independent CFTP to obtain X0*', and let Ti be the time to coalescence for this run. Look at the trajectory of this last run started in x 0 . O at time -TI and take it as the first

TI

sampled values, i.e. the path from XoSo to

**Xo*'**

that we denote

X'.', . . .

, XT1-', counting

**Xo*'**

= ^Xrlslbut not

g.

Continue with an independent WIT' run, taking the next Tz values to be the path from

**Xo*'**

to

X0v2

etc. This produces samples

X ' . ' , . . .

^,XT&*' of random s u e Tk, k = 1,

. . . , K.

GUARANTEE TIME

WIT: Fix a time T, to use as initial guess for coalescence time in CFTP. Follow the scheme of Concatenated WIT, but for each CFI'P run, collect only the

T,

last values in each path. Thus, the procedure is the same as for Concatenated CFTP, but each run generates a fixed number Tk =

T,,

k = 1,

. . .

,

K

of samples.

The methods Repeated CFTP, Concatenated CFTP and Guarantee Time CFl? were proposed by Murdoch & Rosenthal(2OOO). Murdoch & Rosenthal(2O00) also consider forward coupling as well as a modification of Concatenated CFTP where the resulting value of a CFTP run is used as starting point to pick a path in the same run. This approach may look appealing, but due to the dependency between the output value and the coalescence time it will yield a biased sample and the distribution n may be lost.

=

EL, zFil

f(x'*')/Tk (with Tk =

TO

or T,) is unbiased and consistent for all methods except for Concatenated CFTP. Because of the randomness of the sample sizes in Concatenated (FIT, the estimator (* =

cf=, cfil

^f(Xrsk)/

cf=l

^Tk^must^beused instead to assure consistency. Murdoch & Rosenthal(20o0) approximate the computational cost for Repeated CFTP, Concatenated

CFTP

and'Guargntee Time CFI'P (as well as forward coupling), and compared these strategies for a random walk on the integers from 0 to 20 with equal probability of stepping up and down a unit, and steps outside the range being rejected. It would be interesting to compare the various approaches in other practical problems.

The estimator

6 CFTP vs. Standard

MCMC:

Simulation Experiments

Our aim is to do Bayesian image restoration by applying the CFTP algorithm with Gibbs sampling to sample the exact posterior distribution. We study a two dimensional binary image with sites or pixels that are either black (+I) or white (-1). We assume that our true image is degraded by noise, and that we only observe the noisy image. We will compare the results on overall running time and

(13)

misclassification rates of CFTF’ and classical forward Gibbs sampling. Convergence of the Gibbs sampler is here determined by the method of Raftery & Lewis (1992, 1996). Such a comparison will give us an impression of what price we pay (if any), in terms of running time, to obtain exact samples. Experiments in the same spirit are done by Mgller & Nicholls (1999) who compare the running time in CPU seconds of two exact simulation algorithms (standard rejection sampling and

“perfect tempering” based on simulated tempering and CFTP) with standard MCMC.

Let

X

be the true image, and assume apriori that

X

is distributed according to the king model (Example 4). Let y be the observed image. We assume that the noise is Bernoulli, i.e. with probability

E the color of a pixel is changed, and with probability 1

-

c we observe the true color, independently of the other pixels. The probability of observing y when x is the true image is

f(ylx)

= n

~ W Y d ( 1

-

^#xI=Yl)

= cN((1

-

~ ) / c ) w + = y f ) ,

I

where I is the indicator function and the pixels are numbered j = 1,

. . .

^,

^N.

^Thus,the posterior is

X(XIY) = f ( u l x ) J m

a exp(jl

C

^{x i x j}

⁺

^lOg((1- E ) / E ) / ~

C

X j Y j ) ,

i- j j

which may also be regarded as an Ising model with an sitevarying external field (see (2)). The Gibbs sampler for this model can be expressed by an update function as in (1). with

A(x,, = ^IIX!-,,,y ) = ( I

+

^exp(-2p ^{E x ;}

-

^log((1- E > / E ) ~ , , ) ) - ’ . I-,,

In our simulation study the true image is a 40x40 realization of the Ising model with parameter jl = 0.45 (that we obtained by CFl’P, see Figure 5). Noise is added for E = [O.l, 0.2,0.3,0.4), and for each noisy image restorations are done for

B

= 0.45 only. Note that for

#I =

0.45 restoration is quite difficult, due to the occasional single pixels that are surrounded b y pixels of the other color. In a situation with unknown

#I

one would have to simulate over grid of ⁱand values or estimate E

and j3 in some way.

Since the Gibbs sampler is monotone with respect to the componentwise partial order for an king model with external field (see Definition 2 and Example 4), we can apply monotone

CFTP

with the upper chain started from the image with only black pixels and the lower chain from the image with only white pixels. In the Gibbs sampler one pixel is updated at a time, but we define one time step to be a full sweep where all pixels are updated in random permutation order, and we only check for coalescence after full sweeps. As an estimate of the true image we choose the

Marginal

Posterior Mode (MPM), see Winkler (1995, p. 219). To approximate it we use 500 Independent CFTP restorations of the same noisy image. The misclassification rate is the fraction of pixels of the MPM image that are different from the true image.

In order to compare standard forward Gibbs sampling with CFTP, we need to estimate the speed of convergence of the Gibbs sampler chain. The method of Raftery & Lewis is widely applied, with computer code being available on the I n b e t . We used gibbsit from Statlib ( h t t p : //lib.stat.cmu.edu/), but an implementation in CODA (Best ef al., 1995) is also available. As for most practical convergence diagnostics, several arguments have been raised against this method. It has been pointed out that given different pilot chains for the same problem, the method can produce very variable estimates on the number of required iterations. Also, the method is univariate and does not give information about the full joint posterior which is often aim of the analysis. See for instance Mengersen et al. ( 1999) and Cowles & Carlin (1996) for a discussion and details regarding the method. For the purpose of obtaining estimates of the time to convergence in a simple comparison

(14)

40

X.K.

DIMAKOS

study, we find the method suitable because it is characteristic of what is used in practice. We choose to monitor the magnetization j . ~

= cj

X j / N and the interaction statistics p =

xi-,

X i X j I 2 N . estimated by

where

X', X2, . . .

,

XT

is the Gibbs sampler chain. The method of

Raftery

& Lewis provides several diagnostics, but we will use only the estimated bum-in, that is the number of sweeps necessary to eliminate the effect of the starting value. The method is designed for estimation of posterior quantiles of the statistics of interest. To estimate the bum-in, the method needs as input a pilot sample, the quantile q of interest, and a precision parameter 6 (for the allowed distance between estimated transition probabilities and estimated stationary distribution in a two-state chain constructed from the input, see Raftery & Lewis (1992,1996) for details). Raftery & Lewis (1996) recommended that gibbsit is run for several values of q , and we have chosen to run with q = (0.025,0.5,0.975) for both monitored statistics.

We

choose as precision S = 0.01. which is a weaker requirement than the default which is 0.001. and pilot samples of length 3000. To our experience the method can be quite sensitive to the input parameters, and occasionally the estimated bum-ins are very large. To remove a dominating effect of such instabilities and to exhibit some of the characteristics of the method, we estimate the bum-in of one run by the 0.75 and 0.90 quantiles of the estimated bum-ins over both monitored statistics and over all the values of q.

Our

final estimates are the mean and standard deviation of 500 independent posterior simulations with different initial images. An estimate of the MPM was obtained using 500 simulations following the most conservative (0.90 quantile) bum-in estimate. The misclassification rates where found for this MPM estimate.

Table l(a) shows the mean and the standard deviation of the time to coalescence based on the 500 independent Cl=I" runs. The numben are based on the exact coalescence times, in the sense that the simulations are done by going back in time in steps of size ^OM.Figure 4 shows density estimates of the distribution of

T,,

for the different noise levels. The distributions are all heavily skewed with a tail that increases exponentially in the noise. In Table I@) the mean and standard deviation of the number of full sweeps required for and standard MCMC convergence are compared. In this comparison we have used the "doubling of time" approach for CFI". Here we measure computation time in terms of the number of steps taken. Since our CFTP algorithm consists of standard MCMC steps and hence the steps require the Same computational costs, this comparison is appropriate.

According to our convergence criteria, CFTP does not rquire more computation time than standard MCMC, which is not so surprising in light of (3). We have tried to make the comparison fair, and the presented results from standard MCMC are indeed similar to results we got from inspecting plots of the monitored statistics. The true and noisy images and the MPM restorations and their misclassification rates are shown in Figure 5 . We see that the misclassification rates are very similar.

the largest difference being for E = 0.4 for which both estimates are quite far from the true image.

This is an indication that standard Gibbs sampling with Raftery & Lewis diagnostics works well. We conclude that

CFTP

is an appealing algorithm for simple problems in Bayesian image restoration.

despite requiring more work in implementation.

7 Exact Sampling for Continuous State Spaces

For a continuous state space the idea of coupling from the past sometimes need specific care, since otherwise trajectories from different initial states may not coalesce. Methods that deal with this problem have been constructed, and exact sampling for some continuous state space distributions is

(15)

Table 1

Thc lejtable (a) displays the mem and slandani drvia- tion of the time to coalrscence bared on SO0 independent CFTP rum for the noisy images in Figure 5. Thc right tabke (b) shows the nwnbcr of/uu sweeps required by the CFTP algorithm and standard MCMC. For standard MCMC the estimates are based on the 0.7s and 0.90 quantiles over all the monitored stcyirtics and values of q, and for CFTP the nwnben are based on a “doubling of the timc”appmach

(a) (b)

26 7

8:: 1

⁶³ ¹⁷

0.4 300 110

Standard MCMC CFlT

E =0.1

r

E = 0.3

€.= 0.2

~-

_ -

a m a

r

E = 0.4

Figure 4. Density estimates of the time to coalescence for the lsing 0.45 prior degraded by different amounts of noise.

(16)

42 X.K. DIMAKOS

true image

E =0.1 E = 0.2

0.064 0.096

E = 0.3

0.13

E = 0.4

0.20

0.068 0.097 0.13 0.16

F i e 5. The top i m g e shows a simulation of the king model with = 0.45 using CFTP with the Gibbs sampler. The second row of images shows the top image degraded by various levels of noise. The thrd row shows the modes of 500 simulations from the posterior using CFTP and the Gibbs sampler with @ = 0.45. The misclassification rates are shown in the headings. The bottom row shows the modes and misclassification of 5 0 0 standard MCMC simulations following the conservative bum-in estimates of Table I (b).

(17)

possible. The range of models that can be handled is still limited, but ongoing research might lead to extensions.

Murdoch & Green (1998) and Green & Murdoch (1999) have developed several exact sampling algorithms which introduce a discreteness into the state space by updating sets of states into one single state. Their methods are based on ideas related to coupling that can be traced back to Doeblin (1938) (see also Lindvall(l992, pp. 235-236). with more recent applications in Meyn & Tweedie (1993). Lindvall(l992) and Reutter & Johnson (1995). We present the simplest of their algorithms, the multigamma coupler, to illustrate the general idea, and refer to their papers for other algorithms and detailed pseudo-codes.

For notational simplicity we will in the following description assume S =

lR,

and we let ( X I ) be a Markov chain on S with a stationary density A . The transition kernel of the chain, f(+) 2 0, is defined by

/

n(x>f(rlx)dx = x ( Y > .

As in the discrete case the transitions of the chain are described by the update function 9 defined so that

P(9(xt

U) I

y ) =

L

f ( v l x ) d u .

Two chains ( X : ) and ( X i ) started in two different initial states will coalesce at time t if 9 ( x i - ' ,

U)

=

~ ( x S - ' , U).

The idea is to construct an algorithm that with probability 1 makes a transition, at

some random time point, such that #(x,

U)

=

+(U)

and all chains coalesce into a single state independently of where they originate.

In the multigamma coupler the transition kernel is assumed to be bounded below so that f ( y I x ) >

r(y) Vx, y f

S,

for some nonnegative function r . Define p =

1

r ( y ) d y and R ( Y ) = P-'

Lm

r ( v ) d u

Q < ~ l x > = (1

- PI-'

( f ( ~ l x > - r ( v ) ) d v . Define the update function

where

U

= ( U l , Uz) are independent uniform random numbers. To see that this update function provides draws from the right transition kernel, note that

P(9(x9 U) 5 Y ) =

P N Y ) +

⁽¹

-

^p)Q(ylx)

=

Lm ^f

^(ulx)dv.

If 72, denotes the set of states occupied by the Markov chains at time t . the multigamma coupler may be outlined as follows. As for

CFTP

we start at a time

-T

with R-r = S and run to time 0.

At each time point random numbers

U'

are drawn. If

U:

< p the updating is done via R-'(U;) and the first time this happens 72, will shrink to a single state. Then from this state a single chain is run forward to time 0 using R-' or Q-' for the update according to whether Ui is smaller or greater than p . Until Ui < p (before

Rf

is reduced to a single state), 72, is set to be S. If the number of states at time 0 is 1, this state is the returned sample and the algorithm terminates. Otherwise, the algorithm is restarted from time -2T (or any other timepoint -T' < -T) reusing the previously generated