• Ingen resultater fundet

Chapter 14

Reference Priors

The most simple approach is to set for all m. However, there might be cases where only essential features such as and possibly should be extracted from and incorporated with confidence into (14.03). The problem then arises how to piece together a prior from and and similar fragments of information of relevance as to the distribution of m. More generally, we are looking for a method which can point out the particular prior that reflects our state of knowledge optimally.

( ) t( | ) p θ =π θ m

[ ]

E θ Var[ ]θ ( )

p θ

( , )

t m

π θ E[ ]θ Var[ ]θ

The number of possible states of knowledge is infinite, but the spectrum of states can be considered to be bounded from above and below by two extreme states as sketched in figure 14.01 below. The state “complete knowledge” refers to the unique state where we a priori are absolutely certain about the true values of ( , . The opposite to “complete knowledge” might be termed “complete ignorance”. Consequently, the state “complete ignorance” denotes a kind of zero point as indicated in figure 14.01.

) m θ

Fig. 14.01. Spectrum of states of previous knowledge about the binomial parameters ( ,m θ) . state of knowledge

“complete knowledge”

spectrum of

intermediate states 0 “complete ignorance”

To derive a prior which corresponds to the zero point in fig. 14.01, we need a clarification of the phrase “complete ignorance”. In the present context we will use the phrase

“complete ignorance” when our previous information about the parameters of interest is negligible relative to the information an experiment or observation can provide [Box and Tiao, 1973]. Thus in our search for a prior which reflects “complete ignorance” we will be looking for a probability distribution whose influence on the posterior distribution is marginal, that is, the posterior distribution should be dominated by the likelihood function as this is the factor through which observations modify our prior knowledge. Prior distributions guaranteed to play a minimal role in the posterior distribution are generally termed noninformative priors, a term which has already been used several times in the present report. Various approaches to generate noninformative priors are available as much

work has been done in this area [see for example Bernardo, 1979, Robert, 1994, Yang and Berger, 1998].

The derivation of a noninformative prior is of central importance but not sufficient in the present context as we want priors matching the intermediate states in fig. 14.01 as well.

This together with the fact that the parameter space of ( , is makes the identification of suitable priors a challenging task. An intuitively appealing approach introducing so-called reference priors is due to Bernardo [Bernardo, 1979, Bernardo et al., 1994]. Bernardo’s reference priors refers to a class of priors which in a certain sense maximize the information gained from observations. The derivation of a reference prior is in the general case technically demanding. However, the reference prior approach is adaptable to a variety of situations, and we will therefore base our derivation of prior distributions on Bernardo’s concept of reference priors.

)

m θ ×]0;1[

The introduction and derivation of reference priors in the remainder of the present chapter will cover the following topics: In paragraph 14.2, Bernardo’s definition of reference priors for the general one-dimensional case is introduced. In paragraph 14.3, we set up the constrained functional which determines the two-dimensional reference prior

. In paragraph 14.4, the reference prior corresponding to the

“zero-point” state from fig. 14.01 is presented without proof. In paragraph 14.5, the joint posterior based on the reference prior from paragraph 14.4 is shown and compared with the likelihood function . In paragraph 14.6 we discuss how to set up reference priors when partial information is available. Paragraph 14.7 closes with concluding remarks. Appendix B contains technical details as to the derivation of the reference prior.

( , ) ( | ) ( )

t m t m t

π θ =π θ π m

) )

( , | )

t m z

π θ

( | , ) p z m θ

14.2. The Reference Prior Concept

To introduce the approach suggested by Bernardo, let X be some random variable taking values in some sample space where depends on the value of a scalar parameter

, that is, . Let furthermore denote the prior distribution of θ. (

p X =x

θ p X( =x)=p x( |θ π θ( )

Assume now that an experiment e provides a single observation . Let denote the corresponding posterior distribution of θ. To quantify the information gained from the

x π θ( | )x

observation about θ, Bernardo makes use of the Kullback-Leibler entropy distance defined as

x [ ( | ), ( )]

K π θ x π θ

( | )

[ ( | ), ( )] log ( | ) .

( )

K π θ x π θ π θ x π θ x dθ π θ

⎡ ⎤

⎢ ⎥

=

⎢⎣ ⎥⎦ (14.04)

In general the Kullback-Leibler entropy distance for two normalized density functions and is defined as [Kullback, 1959]:

( ) f x ( )

g x

( )

[ ( ), ( )] log ( ) . ( )

K f x g x f x f x dx g x

⎡ ⎤

⎢ ⎥

=

⎢⎣ ⎥⎦ (14.05)

The use of the Kullback-Leibler entropy distance as a measure of information makes intuitively sense. That is, if a decision maker’s previous knowledge about the true value of

is accurate, the information gained from performing an experiment will be relatively low.

Put in another way: If the accurate previous knowledge is reflected in the prior , the posterior will almost certainly resemble the prior distribution. This induces a low value of K x . If the decision maker on the other hand is ignorant about the true value of θ, the information gained from performing an experiment will be high. In this case the posterior distribution will be dominated by the likelihood function. This usually implies that the posterior distribution and the prior distribution are far apart in space which in turn generates a large value of .

θ

π θ( ) ( | )x

π θ

[ ( | ), ( )]π θ π θ

[ ( | ), ( )]

K π θ x π θ

It can be shown that the Kullback-Leibler entropy distance is always non-negative and equals zero if and only if [Lehmann et al., 1998, p. 47]. In (14.05) the variable is for convenience assumed to be a continuous variable but might as well be discrete in which case the integration is replaced by a summation.

( ) ( ) f x =g x x

The entropy distance depends on the particular observation . The expected information provided by a single observation is obtained by averaging (14.04) over the marginal distribution of x:

[ ( | ), ( )]

K π θ x π θ x

( , ( )) I e π θ

( , ( ))I e π θ =

K[ ( | ), ( )] ( ) ,π θ x π θ p x dx (14.06) where p x( ) is given as

( )p x =

p x( | ) ( ) .θ π θ θd (14.07)

Consider now a hypothetical experiment yielding k independent observations. The expected information can be calculated as

( ) e k ( ( ), ( ))

I e k π θ

( ( ), ( ))I e k π θ =

K[ ( | ), ( )] ( )π θ ck π θ p c dck k, (14.08)

where ck =( , ,..., )x x1 2 xk , dck =dx dx1 2...dxk, and

1

( ) ( | ) ( ) ( | ) ( )

k k

k i i

p c p c d

p x d

θ π θ θ θ π θ θ

=

=

=

∫ ∏

(14.09)

In the limit k Ø ¶ we will eventually obtain perfect information about the true value of θ. The corresponding quantity I e( ( ), ( ))∞ π θ defined as

( ( ), ( )) lim ( ( ), ( )) (14.10) I e π θ k I e k π θ

∞ = →∞

measures, if it exists, our missing information about θ. The missing information depends on the function and is therefore referred to as a missing information functional. If we search for a prior distribution containing negligible information about relative to what an observation can provide, the particular prior which maximizes the missing information functional appears to be the optimal choice. Bernardo terms the maximizing prior the reference prior [Bernardo et al., 1994]. Thus the determination of a non-informative prior has been transformed into a maximization problem of an information functional.

π θ( )

θ

Even though the approach outlined above appears straightforward, the actual derivation of reference priors might get involved in specific cases. If the parameter of interest, say θ, can take only a finite number of values, the quantity is always finite. As a consequence, the reference prior for θ can be derived directly from (14.08). For a continuous , however, is typically infinite. To circumvent this problem, an asymptotic expansion of the information might be derived from which the maximizing prior can be identified.

( ( ), ( )) I eπ θ θ I e( ( ), ( ))∞ π θ

( ( ), ( )) I e k π θ