3.4 Multi Point Analysis
3.4.2 Multi Point Probability Computation
The objective of multi point probability computation is to calculate a
bet-ter probabilistic estimate of the inheritance pattern for each markers given
all available genotype information based on the single point probability
dis-tribution. Themethod we describe is inspired by [LG87 ] werethe problemis
described by the means of a hidden Markov model (HMM) 21
. This approach
iswidely used due to the factthat complexity only grows linearly with
num-berof markers introduced, but grows exponentially withsize of thepedigree.
20
Seepage9formoreoncrossover.
21
SeeAppendixDformoreonhiddenMarkovmodels.
Networks.
Furthermore,weshowthatHMMsaretoorestrictivetorepresentthe
prob-lem,butthatan ordinaryBayesiannetwork withstructure similarto aHMM
canbeused.
Bayesian Networks
In this section we use Bayesian networks as a framework for computing the
multi point probability distribution. This is consistent with thetheory used
in[Gud00 ] and[LG87 ]. Thedescriptionof Bayesiantheoryusedhereisbased
on[Jen01].
Denition 12 (Bayesian Network) A Bayesian network is a tuple
hV , E i
where:
-
V
: a set of variables each witha set of mutual exclusive states,-
E ⊆ V × V
: a set of directed edges between variables, and- each variable
A ∈ V
with parentsB 1 , . . . , B n
has a (conditional) proba-bility tableP(A | B 1 , . . . , B n )
attached.If
(B, A) ∈ E
then the edge is directed fromB
toA
andB
is said to bethe parent of
A
. We denote the states of a variableA
byA = (a 0 , . . . , a k )
.Whenmodeling aproblem usingBayesian networksthedirected edgesshould
represent the causality of the problem domain. That is, there is a edgefrom
B
toA
ifthe state ofB
eectsthestate ofA
Thejointprobabilitydistribution 22
overallvariablesinaBayesiannetwork
iscalculated usingthechain rule of Bayesian networks:
Theorem 5 (Chain Rule) Let
BN
beaBayesiannetworkoverthevariablesV = { A 1 , . . . A n }
. Then the jointprobability distributionP ( V )
isgiven by:P ( V ) = Y
i
P(A i |pa(A i )),
where
pa(A i )
is the parent set ofA i
.For the proof we referto [Jen01 ,page 21].
Hidden Markov models belong to a special kind of Bayesian network,
namelydynamicBayesiannetworks 23
. ThestructureofahiddenMarkovmodel
22
If the universe,
V
, consists of three variablesA
,B
, andC
then the joint probabilityP(V) = P (A, B, C)
,whichisa3-dimensionaltableofsize|A| · |B | · |C|
.asa dynamic Bayesian network isillustrated inFigure3-10. Each step,
t i
,isreferredto asa time slice or time step. For every timeslice
i
,the conditional probability tableP (O i | S i )
is the same and all transitional probabilities be-tween timeslices arethe same,that is,P(S i | S i −1 ) = P(S j | S j −1 )
for alli
andj
. Thisisconsistent withthedenition of[Jen01 ].t 1 t 2 t 3
S 1 S 2 S 3
O 1 O 2 O 3
Figure 3-10: The structure of a hidden Markov model as a dynamic
Bayesiannetwork.
In the remaining part of this section we explain how multi point
proba-bilitycomputation sharesstructure withhidden Markovmodels. Undersome
assumptionstheproblemcanbedescribedasaHMM.Theseassumptionsstate
thatthepossibleobservablegenotypeinformation at eachmarkeristhesame,
andbetween eachmarkertherecombination fractionisthesame. Thisis
nec-essary in order to keep each conditional probability table identical over time
slices (according to the denition of HMMs). We follow the construction
in-troduced by Lander and Green in [LG87 ] and we show why the constructed
Bayesian network isaHMM underthese assumptionsonly.
As introduced by [LG87 ], the construction of the Bayesian network goes
as follows. The parent variable of time slice
i
,v i
, becomes a variable withstates corresponding to the set of inheritance vectors. The child variable for
thesametimeslice,
G i
,hasstatescorrespondingtothepossiblegenotype infor-mation. Inthis waywekeep thecausalitythattheinheritance vectorschangetheprobability of observable genotype information. The
i
-th probability dis-tributionoverinheritancevectorsrefersto theinheritance distributionfor thei
-th marker on the chromosome. The a priori probability table ofv i
is thestandard Mendelian inheritance distribution stating that, initially, all
inheri-tance vectors have equal probability. The conditional probability table of
v j
for
j > 1
is givenby:P( v j = v |v j −1 = w) = θ d j −1 · (1 − θ j −1 ) n − d ,
where
n
isthe lengthoftheinheritancevectorandd
istheHammingdistancebetween
v
andw
,andθ j −1
istherecombination fractionbetweenmarkerM j −1
and
M j
. This is because the contribution from adjacent markers is given as thenumber of crossovers that have occurred, and this is exactlygiven as theHammingdistancebetween two markers.
structureis similarto thatof aHMM of length
m
,wherem
is thenumberofmarkers,but since both
P( v j |v j −1 )
andP ( G j |v j )
dierat eachtimeslicethisis not a HMM. The parent node of each time slice is a variable representing
t 1 t 2 v 1 v 2
G 1 G 2
t m
G m v m
Figure 3-11: Multi point probability computation represented as a
Bayesian network with similar structure to a hidden
Markovmodel.
all inheritance vectors, whereas the child node is an instance of genotype
in-formation,thatisa stateofthe genotype variable. Thisisnotconsistent with
thedenition ofBayesian networks, but we introduce itto symbolise thatthe
genotype information is always observable, that is, we know the state of the
genotype variable. Thisobservationmight bethatnogenotypeinformation is
available. Furthermore, to propagate evidence inthe form of genotype
infor-mation we need only
P ( G i |v i )
. In this sense, the state space in each parentnode remains the same, but the observations and probabilities change over
eachtimeslice.
Using this model we are able to compute the probability of all markers
giventheavailablegenotypeinformation. Let
G i
denotetheobservedgenotypeinformation at locus
i
and letG all = {G i | 1 ≤ i ≤ m }
denote the set of allgenotype information. We can then compute
P ( v 1 , . . . , v m |G all )
by applyingBayes' rule:
P ( v 1 , . . . , v m |G all ) = P ( v 1 , . . . , v m , G all ) P ( G all ) .
Noticethatthedenominator isjustthenormalizationfactorandwe needonly
calculatethenumerator. ByapplyingthechainruleforBayesiannetworksthe
numerator isgivenas:
P( v 1 , . . . , v m , G all ) = P ( v 1 )P ( G 1 |v 1 )P ( v 2 |v 1 )P ( G 2 |v 2 )P ( v 3 |v 2 )
· · · P ( v m |v m −1 )P ( G m |v m ).
Findingtheconditional probabilitydistributionsfor theindividual
v i
isjustamatterofmarginalizing out theothervariables.
This approach is inecient since
P ( v 1 , . . . , v m , G all )
has2 |v|· m
entries,|v|
P ( v 1 , . . . , v m , G all ) ∝ P ( v 1 |G 1 )P( v 2 |v 1 )P( v 2 |G 2 )P( v 3 |v 2 )
· · · P( v m |v m −1 )P( v m |G m ),
since
P ( v 1 )
is theMendelian probabilitydistribution and from(3-4)insingle point probability computation we have thatP( G i |v i ) ∝ P ( v i |G i )
. So far weachieved no reduction in size, but the reduction becomes apparent when we
want to calculate the probability of a single variable given the genotype
in-formation. The reason is that we can use the commutative and distributive
propertiesofmarginalization [Jen01 ,page 16],which means thatto
marginal-izeout avariable we need only consider theprobability tablescontaining the
variable. Thisenablesus todenetheleft-conditionedprobability,
P i L
,andtheright-conditioned probability,
P i R
,recursively as:P i L +1 = P i L · X
v i
P( v i |G i )P ( v i +1 |v i ), 1 ≤ i < m
and (3-18)P i R −1 = P i R · X
v i
P( v i |G i )P ( v i |v i −1 ), 1 < i ≤ m,
(3-19)withbasecases
P 1 L = 1 = P m R
. From thisitfollowsthattheprobabilityof the variablev i
conditioned byall genotype information is:P ( v i |G all ) ∝ P i L · P ( v i |G i ) · P i R .
(3-20)Transition:
• •
• •
•
• v 1
v 2
v 3
v 4
M 1
w 1
w 2
w 3
w 4
M 2
v m w m
P (w 3 |v i ) = θ 1 d i · (1 − θ n−d 1 i ) d i = Ham(v i , w 3 )
Figure 3-12: Onestepintheleft-conditionedprobabilitycalculations.
Intuitively,werstcalculateallthesinglepointprobabilitiesateachmarker
locus. Wethentransfertheupdatedprobabilitiesonestepat atimeacrossthe
chromosome towards themarkerfor which we want to calculate
P ( v i |G all )
.between marker 1 and 2. In the gure
v 2 = w 3
is updated according to allstatesof
v 1
.The idea behind multi point analysis is that we exploit thefact that few
crossoversaremost likelyto occurbetween two markers, sincethe
recombina-tionfractionbetweentwoadjacentmarkersisalwayslessthan
1 2
. Forinstance,if we have observed the exact state of
v 1
asv 3
, that isP ( v 1 = v 3 |G 1 ) = 1
,thentheprobabilityofall inheritancevectorsat
v 2
close24tov 3
getincreasedprobability,whereasinheritancevectorsfar 25
from
v 3
getdecreasedprobability. More precisely, the contribution from any statev i
ofv 1
is given by theprobability oftheinheritance vector at marker1conditioned on thegenotype
informationat thatlocus,
P (v i |G 1 )
,timesthetransitionprobabilityP (w 3 | v i )
.The transition probability is given by the Hamming distance between the
two inheritance vectors which indicates the number of odd crossovers that
have occurred between two markers on a chromosome. The probability of
d
crossoversoccurringbetweentwo markerswithaninheritance vectoroflength
n
isθ d 1 · (1 − θ 1 ) n − d
.We thensum over thecontribution to
w 3
of every vectorv i
, whichcorre-spondsto marginalizing out
v 1
. The sum inthen multiplied with thecondi-tionalprobabilityof
w 3
givenG 2
. ThisproductisproportionaltoP (w 3 |G 1 , G 2 )
.We compute this product for every inheritance vector at marker
M 2
togetthe full probability distribution,
P ( v 2 |G 1 , G 2 )
26. Theupdated probability distribution at marker 2 can then be used to calculate the left-conditionedprobability at marker 3 and so forth. The right-conditioned probability is
calculatedinat similar fashion.
Using this procedure we can compute
P( v i |G all )
for anyv i
. Thesevaluescanthenbeusedinthescoringfunctions,i.e. LODscore,todeterminelinkage
between markers and traits.