• Ingen resultater fundet

Multi Point Probability Computation

3.4 Multi Point Analysis

3.4.2 Multi Point Probability Computation

The objective of multi point probability computation is to calculate a

bet-ter probabilistic estimate of the inheritance pattern for each markers given

all available genotype information based on the single point probability

dis-tribution. Themethod we describe is inspired by [LG87 ] werethe problemis

described by the means of a hidden Markov model (HMM) 21

. This approach

iswidely used due to the factthat complexity only grows linearly with

num-berof markers introduced, but grows exponentially withsize of thepedigree.

20

Seepage9formoreoncrossover.

21

SeeAppendixDformoreonhiddenMarkovmodels.

Networks.

Furthermore,weshowthatHMMsaretoorestrictivetorepresentthe

prob-lem,butthatan ordinaryBayesiannetwork withstructure similarto aHMM

canbeused.

Bayesian Networks

In this section we use Bayesian networks as a framework for computing the

multi point probability distribution. This is consistent with thetheory used

in[Gud00 ] and[LG87 ]. Thedescriptionof Bayesiantheoryusedhereisbased

on[Jen01].

Denition 12 (Bayesian Network) A Bayesian network is a tuple

hV , E i

where:

-

V

: a set of variables each witha set of mutual exclusive states,

-

E ⊆ V × V

: a set of directed edges between variables, and

- each variable

A ∈ V

with parents

B 1 , . . . , B n

has a (conditional) proba-bility table

P(A | B 1 , . . . , B n )

attached.

If

(B, A) E

then the edge is directed from

B

to

A

and

B

is said to be

the parent of

A

. We denote the states of a variable

A

by

A = (a 0 , . . . , a k )

.

Whenmodeling aproblem usingBayesian networksthedirected edgesshould

represent the causality of the problem domain. That is, there is a edgefrom

B

to

A

ifthe state of

B

eectsthestate of

A

Thejointprobabilitydistribution 22

overallvariablesinaBayesiannetwork

iscalculated usingthechain rule of Bayesian networks:

Theorem 5 (Chain Rule) Let

BN

beaBayesiannetworkoverthevariables

V = { A 1 , . . . A n }

. Then the jointprobability distribution

P ( V )

isgiven by:

P ( V ) = Y

i

P(A i |pa(A i )),

where

pa(A i )

is the parent set of

A i

.

For the proof we referto [Jen01 ,page 21].

Hidden Markov models belong to a special kind of Bayesian network,

namelydynamicBayesiannetworks 23

. ThestructureofahiddenMarkovmodel

22

If the universe,

V

, consists of three variables

A

,

B

, and

C

then the joint probability

P(V) = P (A, B, C)

,whichisa3-dimensionaltableofsize

|A| · |B | · |C|

.

asa dynamic Bayesian network isillustrated inFigure3-10. Each step,

t i

,is

referredto asa time slice or time step. For every timeslice

i

,the conditional probability table

P (O i | S i )

is the same and all transitional probabilities be-tween timeslices arethe same,that is,

P(S i | S i −1 ) = P(S j | S j −1 )

for all

i

and

j

. Thisisconsistent withthedenition of[Jen01 ].

t 1 t 2 t 3

S 1 S 2 S 3

O 1 O 2 O 3

Figure 3-10: The structure of a hidden Markov model as a dynamic

Bayesiannetwork.

In the remaining part of this section we explain how multi point

proba-bilitycomputation sharesstructure withhidden Markovmodels. Undersome

assumptionstheproblemcanbedescribedasaHMM.Theseassumptionsstate

thatthepossibleobservablegenotypeinformation at eachmarkeristhesame,

andbetween eachmarkertherecombination fractionisthesame. Thisis

nec-essary in order to keep each conditional probability table identical over time

slices (according to the denition of HMMs). We follow the construction

in-troduced by Lander and Green in [LG87 ] and we show why the constructed

Bayesian network isaHMM underthese assumptionsonly.

As introduced by [LG87 ], the construction of the Bayesian network goes

as follows. The parent variable of time slice

i

,

v i

, becomes a variable with

states corresponding to the set of inheritance vectors. The child variable for

thesametimeslice,

G i

,hasstatescorrespondingtothepossiblegenotype infor-mation. Inthis waywekeep thecausalitythattheinheritance vectorschange

theprobability of observable genotype information. The

i

-th probability dis-tributionoverinheritancevectorsrefersto theinheritance distributionfor the

i

-th marker on the chromosome. The a priori probability table of

v i

is the

standard Mendelian inheritance distribution stating that, initially, all

inheri-tance vectors have equal probability. The conditional probability table of

v j

for

j > 1

is givenby:

P( v j = v |v j −1 = w) = θ d j −1 · (1 θ j −1 ) n d ,

where

n

isthe lengthoftheinheritancevectorand

d

istheHammingdistance

between

v

and

w

,and

θ j −1

istherecombination fractionbetweenmarker

M j −1

and

M j

. This is because the contribution from adjacent markers is given as thenumber of crossovers that have occurred, and this is exactlygiven as the

Hammingdistancebetween two markers.

structureis similarto thatof aHMM of length

m

,where

m

is thenumberof

markers,but since both

P( v j |v j −1 )

and

P ( G j |v j )

dierat eachtimeslicethis

is not a HMM. The parent node of each time slice is a variable representing

t 1 t 2 v 1 v 2

G 1 G 2

t m

G m v m

Figure 3-11: Multi point probability computation represented as a

Bayesian network with similar structure to a hidden

Markovmodel.

all inheritance vectors, whereas the child node is an instance of genotype

in-formation,thatisa stateofthe genotype variable. Thisisnotconsistent with

thedenition ofBayesian networks, but we introduce itto symbolise thatthe

genotype information is always observable, that is, we know the state of the

genotype variable. Thisobservationmight bethatnogenotypeinformation is

available. Furthermore, to propagate evidence inthe form of genotype

infor-mation we need only

P ( G i |v i )

. In this sense, the state space in each parent

node remains the same, but the observations and probabilities change over

eachtimeslice.

Using this model we are able to compute the probability of all markers

giventheavailablegenotypeinformation. Let

G i

denotetheobservedgenotype

information at locus

i

and let

G all = {G i | 1 i m }

denote the set of all

genotype information. We can then compute

P ( v 1 , . . . , v m |G all )

by applying

Bayes' rule:

P ( v 1 , . . . , v m |G all ) = P ( v 1 , . . . , v m , G all ) P ( G all ) .

Noticethatthedenominator isjustthenormalizationfactorandwe needonly

calculatethenumerator. ByapplyingthechainruleforBayesiannetworksthe

numerator isgivenas:

P( v 1 , . . . , v m , G all ) = P ( v 1 )P ( G 1 |v 1 )P ( v 2 |v 1 )P ( G 2 |v 2 )P ( v 3 |v 2 )

· · · P ( v m |v m −1 )P ( G m |v m ).

Findingtheconditional probabilitydistributionsfor theindividual

v i

isjusta

matterofmarginalizing out theothervariables.

This approach is inecient since

P ( v 1 , . . . , v m , G all )

has

2 |v|· m

entries,

|v|

P ( v 1 , . . . , v m , G all ) P ( v 1 |G 1 )P( v 2 |v 1 )P( v 2 |G 2 )P( v 3 |v 2 )

· · · P( v m |v m −1 )P( v m |G m ),

since

P ( v 1 )

is theMendelian probabilitydistribution and from(3-4)insingle point probability computation we have that

P( G i |v i ) P ( v i |G i )

. So far we

achieved no reduction in size, but the reduction becomes apparent when we

want to calculate the probability of a single variable given the genotype

in-formation. The reason is that we can use the commutative and distributive

propertiesofmarginalization [Jen01 ,page 16],which means thatto

marginal-izeout avariable we need only consider theprobability tablescontaining the

variable. Thisenablesus todenetheleft-conditionedprobability,

P i L

,andthe

right-conditioned probability,

P i R

,recursively as:

P i L +1 = P i L · X

v i

P( v i |G i )P ( v i +1 |v i ), 1 i < m

and (3-18)

P i R −1 = P i R · X

v i

P( v i |G i )P ( v i |v i −1 ), 1 < i m,

(3-19)

withbasecases

P 1 L = 1 = P m R

. From thisitfollowsthattheprobabilityof the variable

v i

conditioned byall genotype information is:

P ( v i |G all ) P i L · P ( v i |G i ) · P i R .

(3-20)

Transition:

v 1

v 2

v 3

v 4

M 1

w 1

w 2

w 3

w 4

M 2

v m w m

P (w 3 |v i ) = θ 1 d i · (1 θ n−d 1 i ) d i = Ham(v i , w 3 )

Figure 3-12: Onestepintheleft-conditionedprobabilitycalculations.

Intuitively,werstcalculateallthesinglepointprobabilitiesateachmarker

locus. Wethentransfertheupdatedprobabilitiesonestepat atimeacrossthe

chromosome towards themarkerfor which we want to calculate

P ( v i |G all )

.

between marker 1 and 2. In the gure

v 2 = w 3

is updated according to all

statesof

v 1

.

The idea behind multi point analysis is that we exploit thefact that few

crossoversaremost likelyto occurbetween two markers, sincethe

recombina-tionfractionbetweentwoadjacentmarkersisalwayslessthan

1 2

. Forinstance,

if we have observed the exact state of

v 1

as

v 3

, that is

P ( v 1 = v 3 |G 1 ) = 1

,

thentheprobabilityofall inheritancevectorsat

v 2

close24to

v 3

getincreased

probability,whereasinheritancevectorsfar 25

from

v 3

getdecreasedprobability. More precisely, the contribution from any state

v i

of

v 1

is given by the

probability oftheinheritance vector at marker1conditioned on thegenotype

informationat thatlocus,

P (v i |G 1 )

,timesthetransitionprobability

P (w 3 | v i )

.

The transition probability is given by the Hamming distance between the

two inheritance vectors which indicates the number of odd crossovers that

have occurred between two markers on a chromosome. The probability of

d

crossoversoccurringbetweentwo markerswithaninheritance vectoroflength

n

is

θ d 1 · (1 θ 1 ) n d

.

We thensum over thecontribution to

w 3

of every vector

v i

, which

corre-spondsto marginalizing out

v 1

. The sum inthen multiplied with the

condi-tionalprobabilityof

w 3

given

G 2

. Thisproductisproportionalto

P (w 3 |G 1 , G 2 )

.

We compute this product for every inheritance vector at marker

M 2

to

getthe full probability distribution,

P ( v 2 |G 1 , G 2 )

26. Theupdated probability distribution at marker 2 can then be used to calculate the left-conditioned

probability at marker 3 and so forth. The right-conditioned probability is

calculatedinat similar fashion.

Using this procedure we can compute

P( v i |G all )

for any

v i

. Thesevalues

canthenbeusedinthescoringfunctions,i.e. LODscore,todeterminelinkage

between markers and traits.