B R ICS R S -02-7 In g ´olfsd ´ottir et a l.: A F o rmalization o f L in k age An alysis
BRICS
Basic Research in Computer Science
A Formalization of Linkage Analysis
Anna Ing´olfsd´ottir
Anders Lyhne Christensen Jens Alsted Hansen
Jacob Johnsen John Knudsen
Jacob Illum Rasmussen
BRICS Report Series RS-02-7
Copyright c 2002, Anna Ing´olfsd´ottir & Anders Lyhne Christensen
& Jens Alsted Hansen & Jacob Johnsen & John Knudsen & Jacob Illum Rasmussen.
BRICS, Department of Computer Science University of Aarhus. All rights reserved.
Reproduction of all or part of this work is permitted for educational or research use on condition that this copyright notice is included in any copy.
See back inner page for a list of recent BRICS Report Series publications.
Copies may be obtained by contacting:
BRICS
Department of Computer Science University of Aarhus
Ny Munkegade, building 540 DK–8000 Aarhus C
Denmark
Telephone: +45 8942 3360 Telefax: +45 8942 3255 Internet: BRICS@brics.dk
BRICS publications are in general accessible through the World Wide Web and anonymous FTP through these URLs:
http://www.brics.dk ftp://ftp.brics.dk
This document in subdirectory RS/02/7/
Anna Ingólfsdóttir Anders Lyhne Christensen
Jens Alsted Hansen Jacob Johnsen John Knudsen
Jacob IllumRasmussen
February 2000
Inthisreportaformalizationofgeneticlinkageanalysisisintroduced. Linkage
analysisisa computationallyhardbiomathematical method,which purposeis
tolocategenesonthehumangenome. Itisrootedinthenewareaofbioinfor-
maticsand noformalization of themethodhaspreviously beenestablished.
Initially,thebiologicalmodelispresented. Onthebasisofthisbiologicalmodel
we establish a formalization that enables reasoning about algorithms used in
linkage analysis. The formalization applies both for single and multi point
linkage analysis. We illustrate the usage of the formalization in correctness
proofsof central algorithmsand optimisationsfor linkage analysis.
A further use of the formalization is to reason about alternative methods for
linkage analysis. We discuss the use of MTBDDs and PDGs in linkage anal-
ysis,since they have proven ecient for other computationallyhard problems
involvinglarge state spaces.
We conclude that none of the techniques discussed are directly applicable to
linkage analysis, however further research is needed in order to investigated
whethera modiedversionof oneor more of theseareapplicable.
List of Symbols v
1 Introduction 1
1.1 This report . . . 3
1.2 Acknowledgments . . . 3
2 Introduction to Human Genetics 5 2.1 Chromosome . . . 5
2.2 DNA . . . 6
2.3 Alleles . . . 7
2.4 Meiosis. . . 8
2.5 Locating traitgenes . . . 12
2.6 Summary . . . 13
2.7 Furtherreading . . . 13
3 Linkage Analysis 15 3.1 Introduction to Linkage Analysis . . . 15
3.1.1 Performing Linkage Analysis . . . 16
3.1.2 Missing andIncomplete Data . . . 17
3.2 Single Point Analysis . . . 19
3.2.1 FormalModel . . . 19
3.2.2 Single Point ProbabilityComputation . . . 26
3.3 Algorithms for Single Point Probability Computation . . . 28
3.3.1 Founderalleleassignment inpropositional logic . . . 29
3.3.2 Kruglyak'sAlgorithmfor SinglePointProbabilityCom- putation . . . 33
3.3.3 Single Point ProbabilityComputation inAllegro . . . . 38
3.4 Multi Point Analysis . . . 55
3.4.1 FormalModel . . . 56
3.4.2 Multi Point Probability Computation . . . 56
3.4.3 Multi point calculationusing Fourier Transforms . . . . 61
3.5 Summary . . . 65
4 Towards an Improved Method 67 4.1 Levelof Focus . . . 68
4.1.1 Methods Properties. . . 69
4.2 Hardness ofLinkage Analysis . . . 70
4.3 Symbolic Representation . . . 75
4.3.1 EvaluatingMTBDDs . . . 75
4.3.2 EvaluatingPDG's . . . 79
4.3.3 FutureHeuristics . . . 80
4.4 Utilising theStructural Informationof Pedigrees . . . 80
4.5 Summary . . . 83
5 Conclusion 85 A Linkage Analysis Tools 87 B LOD and NPL score 92 B.0.1 LOD score. . . 92
B.0.2 NPL score . . . 93
C Basic Probability Theory 94
D Markov Models 97
References 101
List of Figures 103
List of Tables 106
Glossary 107
P
PedigreeP = h F, N, f ather, mother i
,seeDenition 1.f ather
Fatherfunctionf ather : N → V
,seeDenition 1.mother
Motherfunctionmother : N → V
,seeDenition 1.V
Set of individuals in a pedigree (P
),V = N ∪ F
, seeDenition1.
F
Setof founders ina pedigree (P
),F ⊆ V
, seeDenition1.
N
Setof non-founders in a pedigree (P
),N ⊂ V
, see De-nition 1.
ancestor
Ancestorfunctionancestor : V → 2 V
,seeDenition 1.G
PedigreegraphG = h P, →i
, seeDenition 2.A
Setofalleles for a pedigree, seeDenition 3.M
AmarkerM = h A M , π M i
,seeDenition 4.A M
Theset of allelicstatesfor the markerM
,seeDenition4.
π M
A functionπ M : A M → [0, 1]
, maps the allelic states ofA M
to itsallele frequency,seeDenition 4.G
Genotype informationG = hL , astates i
, see Denition3.2.1.
L
Setofgenotypedindividualsforapedigree,seeDenition 3.2.1.astates
Allelicstate functionastates : L → {{a 1 , a 2 } | a 1 , a 2 ∈ A M }
,see Denition3.2.1.v
Inheritance vector function:v : N × { p, m } → { p, m }
,seeDenition 6.
v
Theset of allinheritance vectorsfor a pedigree, see Sec- tion3.2.2.p
Parameterfortheinheritancevectorfunction,denotesthe paternal allele,seeDenition 6.m
Parameterfortheinheritancevectorfunction,denotesthe maternal allele,seeDenition 6.Z M
Afounderalleleassignment,Z M : F × { p, m } → A M
. forthe marker
M
,see Denition 7.F
Allele assignment functionF : (N × { p, m } → { p, m } ) → ((V ×{ p, m } ) → (F ×{ p, m } ))
,whichgivenaninheritance vector gives thefounderalleles received byan individualonthe maternal or paternal side,see Denition8.
F v F v = F (v)
,seeDenition 8.P (v |G )
The probability of the inheritance vectorv
given somegenotype information
G
,seeSection 3.2.2.P ( G| v)
Theprobabilityofsome genotypeinformationG
givenaninheritance vector
v
, seeSection 3.2.2.P (Z M )
Theprobabilityofthefounderalleleassignment,seeSec- tion 3.2.2.F G v
Thesetofallcompatible founderalleleassignmentsgiven theinheritancevectorv
,andthegenotypeinformationG
,seeSection 3.2.2.
f, f 1 , f 2 , . . .
Founder alleles.a 1 , a 2 , . . .
Allelicstates.$
Denotes eitherp
for paternal orm
for maternal ($ ∈ { p, m }
).f ( n,$ )
The founderallelefor foundern ∈ F
,seeSection 3.3.1.f F v ( n,$ )
The founder allele inherited byn ∈ V
given the inheri-tance vector
v
,seeSection 3.3.1.ϕ v G
Founderalleleassignmentformulafortheinheritancevec- torv
and thegenotype informationG
,seeTheorem 1.CC
The setof connected componentsfor a graph.U
The setof unassignedfounderalleles, seepage 42.E
Thesetofambiguouslyassignedfounderalleles, seepage 42.A
The set of (unambiguously) assigned founder alleles, see page 42.θ i
The recombination fraction between markersM i
andM i +1
,see Section3.4.Ham(v, w)
The Hamming distance between two inheritance vectorsv
andw
,seeDenition 11.Introduction
Linkageanalysisisawellestablishedmethodusedtolocategenesinthehuman
genome. To be more exact, the gene, which is under investigation, is a gene
that is responsible for some trait that an individual possesses. Examples of
traitsunderinvestigationrangefromthe moresimple,suchasbloodtypeand
eyecolor,tothemoreseriousthatmight predisposean individualfor adeadly
disease.
Diseasecausinggenesforsomemajordiseaseshavealreadybeendiscovered
byperforming linkage analysis, e.g. Parkinson's disease, obesity,and anxiety,
[dNC01]. Although environmental factors inuence the disease development
onsome ofthese diseases,itwould be ignorant not to have their geneticcom-
poundsinmind for the following reasons, [dPGD01 ]:
•
Their causes arenot fullyunderstood.•
Current treatmentsare oflimitedeectiveness.•
There is currently no means oftailoring treatment to thecause.•
Genetic diseases are getting an proportionally increased inuence on peoples health in the western world, e.g. the percentage of childhooddeaths inUnited Kingdom hospitals attributable to geneticcauses have
increased from 16.5%in1914 to 50% in1976, [JCBW99 ].
Linkage analysis is a statistical method, which might locate certain trait
causinggenes,basedonsomebiologicalmodel. However,duetothecomplexity
of current algorithms incorporated in linkage analysis, the analysis is only
tractable on moderately sized data sets. Analysis of larger data sets bestows
theanalystwithanimprovedlikelihoodoflocatingthegenesforthetraitunder
investigation.
Ournalgoalistoenableareductionofthelinkageanalysiscomplexityby
thenumber of individuals involved in theanalysis. Dealing withexponential
state spaces is one of the areas where formal methods applied in the eld
of verication has proven successful in the past decades, and the advances
have often resulted in several orders of magnitude reductions in complexity.
These formal methods include symmetry reduction, partial order reduction,
abstraction,and compositionality.
To reach our nal goal, we dene three major sub goals. The rst is the
establishment of a formalmodelthat enables reasoning about thecorrectness
andcomplexity ofdierent approachesto linkage analysis. Secondly,we anal-
ysethestrategieswhichseem fruitfulto pursuefurther,inorderto reducethe
complexity of linkage analysis. The third and nal sub goal is to construct a
newmethod for linkage analysiswhich enables linkage analysisof larger data
sets.
We have withthis project moved into the relatively new area of bioinfor-
matics,which spansa numberof dierent sciences,suchasbiology,statistics,
mathematics,andcomputerscience. We areunawareofanyonewhohavepre-
viouslyattemptedtoformalizelinkageanalysisinacomputerscienticcontext.
Currently,several dierentsoftwarepackages exist,which canperform linkage
analysis. A brief overview of the major packages are given in Appendix A.
However, we have not been able to nd any formal framework, in which it
is possible to reason about the correctness and complexity of existing algo-
rithms. Furthermore, aformal framework would beconvenient for evaluating
new methods and for investigating the deeper theoretical nature of linkage
analysis.
The work represented here is based on a collaboration with the Icelandic
company deCODE Genetics (http://www.decode.is). deCODE Genetics is a
company conducting research into inherited causes of common diseases. The
research at deCODE is based on the Icelandic population for three major
reasons,the original geneticpoolis small,literaturedocuments thefamilyre-
lationsmanygenerationsback,andnallythe juridicalfoundationshavebeen
established. deCODE currently employs more than 600 people divided into
several departments ranging fromtheResearch Laboratory andtheDatabase
Division,to the Statistics Department with whichwe have collaborated. The
Statistics Department is, among other areas, responsible for developing soft-
warecapable ofdoing statistical analysisof genetic data. Part of our project
group has spent time working with Ph.D. Daniel Fannar Gudbjartsson, the
creatorof Allegro,which is thesoftwarepackage used at deCODE for linkage
analysis (See Appendix A). The stay gave the group a chance to verify and
extend our understanding of the methods currently used within the eld of
linkage analysis.
Wehave investigated to whatextent theareaof linkage analysisis formalized
inacomputerscienticcontext. Toour knowledge,noformalframeworkexist
todescribelinkageanalysis. Forthisreasonwedevelopsuchaframework,tobe
able to reason about the correctness of current and future algorithms. After
establishing the formal framework, we attempt to determine the complexity
of linkage analysis and based on this investigation we discuss approaches to
reduce the complexity in the average case. More specically, we investigate
howwell MTBDDsandPDGsworkforsymbolicrepresentationof probability
distributions inourproblem domain.
This reportis meant to be a self contained document. Thus readers with
a mathematical, engineering, or computerscience background should be able
toreadandunderstanditwithlittleor nopriorknowledgeonlinkage analysis
and genetics. Readers that do not have this background might experience
diculties in understanding the more computer scientic aspects, although
someintroductionisgivenintoselectedareasintheappendices. Inthebackof
thisreport we have provided a glossary(Seepage 107)of themost important
geneticterms, for quickreference, anda listof symbolscan befound on page
v.
The structure of the report is as follows: The rst part is an introduc-
tionto thebiological model. This partis not exhaustive, however it provides
a sucient background for understanding the motivation behind the two fol-
lowing parts. In the second part we build a formal framework, for proving
the correctness of the algorithms used in linkage analysis. In the nal part
ofthe reportwe state the properties which a newmethod should preserve, in
order to produce the same results as well established methods. Furthermore,
wediscussthe performanceofalternativestrategiesfor therepresentation and
computationaltasksinlinkage analysis.
1.2 Acknowledgments
First we would like to acknowledge Daniel Fannar Gudbjartsson, the creator
of Allegro, for his help and guidance as well as other people inthe Statistics
Department, and SverrirÞorvaldssonand GisliMàsson, at deCODE for their
patienceandwillingness to answerquestionsandgive explanations.
We would also like to thank Professor of Computer Science Finn Verner
JensenandAssistantProfessorofComputer ScienceThomasNielsen,Aalborg
Universityfor their helpwithBayesian theoryand hiddenMarkovmodels.
We would like to thank the following people for inspiration: Professor of
ComputerScienceKimG.LarsenandAssociateProfessorofComputerScience
MathematicsDepartmentat AalborgUniversity,ResearchAssistantProfessor
atBRICSAarhusChristianN.StormPedersen,ProfessorofComputerScience
at the University of Birmingham Martha Kwiatkowska and Ph.D. student at
theUniversityofBirmingham David Parker.
Introduction to Human
Genetics
This chapter introduces the reader to the terms and concepts of human ge-
neticswhich areessential forunderstanding linkage analysis. Readers whoare
familiarwithgeneticconcepts likealleles,mendelian inheritance,linkageanal-
ysis,andsoonmaywanttoskipthischapter. Weremindthereaderunfamiliar
with any of these concepts, that there is a glossary on page 107, where it is
possible to nda shortexplanation anda referenceto manyof thetermsand
concepts present inthis chapter.
Thisintroductionisbasedonvarioussources, suchas,[SR99],[Ott99 ],and
[KC97 ]. Forfurtherreading,we recommendthe readerto seeSection 2.7fora
shortdiscussionof some literature.
The structure is asfollows. We begin witha rather high leveldescription
ofgenetics,byintroducingthe twomajor concepts ofchromosomesandDNA.
Thisisfollowedbythe introductionofalleles andallelicstates, whichdescribe
thedierentformsagenecanassume. Twoimportant conceptsininheritance,
meiosis and recombination, are introduced. Furthermore, we introduce the
conceptof linkage analysis andnally summarizethechapter.
2.1 Chromosome
Insideeachhumancellthereare23pairsofchromosomes,22pairsofautosomes
common for both males and females, and one pair of sex-chromosome which
diers between the two sexes. A chromosome consists of two identical sister
chromatides - each a doublehelix DNA, strand, [SR99, page 30], thus a total
of92 DNAstrands, constituting 46chromosomes. During fertilization afetus
receivesonechromosomefromeachparent(actuallyitisonechromatide,which
lateris transformed into achromosome,but inthis context can bethought of
Thecomplete setof geneticmaterial is calledThe Human Genome.
2.2 DNA
Themolecule thatencodes geneticinformation of humans isdeoxyribonucleic
acid(DNA).DNAisadoublehelixmoleculethatconsistsofasugar-phosphate
backboneandfournitrogenousbases-adenine,thymine,cytosine,andguanine
(A, T, C, and G), [SR99, Chapter 1]. The nitrogenous bases are also called
base pairs (bps), as they always pair: A with T, and C with G. The base
pairsaresituatedbetween thetwo helixes,therebybinding thetwobackbones
together.
A combination ofthree bps are termeda codon 1
. Each codon speciesan
amino acid or a stop code (see Figure 2-1). After a stop code, the bps can
possibly specify a non-coding region, which for example could be hundreds
of C-T combinations. Thus, 3 bps code a codon, codons code the sequence
of amino acids, which again determines the genetic code for a single protein
molecule(agene),[Lan97 ,page246]. Proteins functionasenzymes,hormones,
etc. and therefore inuence many physiological and psychological traits, e.g.
eye-color and schizophrenia. Thus, DNA is the recipe for proteins where the
genetic code consists of nitrogenous bases interpreted in sequences of three.
When the genetic code is translated to a protein, each codon species an
amino acid. Amino acids are the building blocks of proteins. When a stop
codon is reached the translation process is ended and the protein is nished
andreadyto perform itsfunction,such asto signalother cellstostart orstop
theproduction of enzymes,incasetheprotein justbuilt isa hormone.
Some traitsare moredependent on outside stimuli than others, e.g. some
traitsmightnotshowatall,duetotheenvironmentalfactors,whileothertraits
only show in the presence of one or more outside stimuli. For example, it is
believedthatageneexists,thatpredisposesindividualsforobesity 2
. However,
theobesitytraitisonly showninthe casewhere plentiful nutrients have been
availablefor an individualwiththedisease gene.
The sequenceof base pairs encodinga single protein is calleda gene,and
thesection or position onthe chromosome harboringa geneisdenoted asthe
locus (plural: loci)of the gene.
Awellcharacterized locuscanserveasageneticmarker,bywellcharacter-
izedwemeanthat thelocusof thegeneisknownand thatthedierent allelic
statesof the geneis known (allelic stateswill be explained shortly). Markers
1
Also termedatriplet,[KC97 ,page325].
2
Welearnedthatagroupofstatisticiansandbiologistsarecurrentlyworkingonisolating
thegeneforthis disease,during ourstayat deCODEGenetics. Someresults havealready
beenreleasedasnews,[dNC01 ].
A single codone C
A C T
G
G A A
T T T
A C G C
G A T
T
T T
T A
A A C C A
C
C
G C G
G G
G Backbone
(DNA strand/chromatide) A single bp (A−T)
Figure 2-1: A subsection of a chromatide, which is also known as a
(subsectionofa)strand. ThemolecularstructureisaDNA
molecule.NoticethepairingAwithTandCwithGcom-
posing the18bps,thatarecoding6codons.
canbeusedasreferencepointsonachromosome andthey canbeanalysed in
order to determine thegenotype of an individual at a given marker, [Gud00,
page 9].
2.3 Alleles
An individual has two alleles at each locus of a gene, since one chromosome
is received from each parent. An allelic state refers to a specic encoding
(in codons) of a specic allele. The allelic states on the two chromosomes
arenot necessarily dierent, but they can be, hence the sequences of codons
codingthe alleles,andtherebygenes, aredierent. Anindividualissaidto be
homozygous ifthe maternal and the paternal allelicstates(the allelereceived
fromthemotherandtheallelereceivedfromthefather)atalocusareidentical.
If the allelic states are dierent, the individual is said to be heterozygous at
thatlocus.
Two alleles at a locus make up the genotype of that locus for a specic
individual. Ifapersonisheterozygousitispossiblethatonlyone ofthealleles
areexpressed. In this case, the allele expressedis called dominant, while the
allele not expressed is calledrecessive. If both alleles are expressed, they are
saidto becodominant. The expressionof agenotypeis calledthephenotype.
Example 1 As an example consider the human blood type. The gene for the
human blood type is located on chromosome 9 at band q34, [Lan97, page 2]
(band refers to a position on the chromosome). Three allelic states exists for
blood type:
A
,B
, and0
. Each person has two alleles, however there existsonlyfourblood types 3
(phenotypes):
A
,B
,A / B
, and0
,even though there areare six possible combinations of allelic states:
A / A
,B / B
,0 / 0
,A / B
,A / 0
,and
B / 0
(note that the ordering is irrelevant). If an individual has the allelic 3For simplicity we haveomittedthe fact, that apersoncaneitherberesus negative or
states
A / A
,B / B
, or0 / 0
the individual is homozygous. In the event that an individualhas the allelic statesA / 0
the individualis heterozygous, andfor theseallelesA
isdominantand0
recessive,sinceonlytheA
alleleisexpressed,givingthepersonthe
A
bloodtype, justasifthepersonhadthegenotypeA / A
.B
is also dominant with respect to0
, since a person will have blood typeB
if the person has either the genotype
B / B
orB / 0
. However,A
andB
arecodominant, since anindividual withthegenotype
A / B
willhave neitherbloodtype just
A
orB
, butA / B
. The nalbloodtype0
isonlyexpressed ifapersonhas the genotype
0 / 0
.2.4 Meiosis
Inthissectionwepresentthemeiosisinasimpliedversion,however,sucient
for understanding the later chapters. Due to thecomplexity of thisbiological
processwe consider itout ofscopeto explain itindetail 4
.
Meiosisisanessentialprocess ofthehumanreproduction. Abstractly,itis
a cell division process that produces gametes, female egg cellsor male sperm
cells,whichcombineinreproductionandbecometheseedforanewindividual.
Meiosis isa process where a cell withone chromosome pair (a diploid cell)is
dividedinto fourcellswithonechromatideineach(fourhaploidcells). Thisis
illustratedintheleft half(beforethesecond
⇒
) ofFigure 2-2.Selection based on Mendels first law (random) Paternal
Maternal chromosome chromosome
a new individual Chromosome pair of Similar process
Paternal diploid cell
Reproduction
gamete a maternal that creates Paternal gametes (4 haploid cells)
Meiosis
Figure 2-2: Meiosisillustratedforasinglechromosomepairof amale
and(partially)afemale. Theresultsoftheirmeiosisislater
combinedinreproductiontoformthechromosomepairof
anew individual. Note that this is simplied, in reality
there are 22 more chromosomes in each cell, to which a
similarprocesstakesplacein parallel.
The probability that agiven gamete is passed on to a child isassumed to
be random. Hence, probability
1 4
for each of the four combinations. A single 4See[KC97,Chapter2]foradetaileddescriptionofmeiosis.
2-2). Since a given allele is present in precisely two of the four gametes, the
probabilitythatallele beingtransmittedto anospringis
1 2
(from2 · 1 4
). Thishypothesis,thatthesourceof inheritance fora singlealleleisequallylikely to
beeitherofthetwo grandparents, wasstated byGregorJohann Mendelinthe
18'thcentury and isknownasMendel's rst law,[KC97 , Chapter3].
Thepairofallelesreceivedfromboththeparentsatagivenlocusconstitute
achild'sgenotypeatthatgivenlocus. AsanexampleseethepedigreeinFigure
2-3.
Female individual Male individual Genotype information
Two alleles at two loci on Information for one locus
the same chromosome
X x A a
Grandfather Grandmother
Father Mother
Son Daughter
X X
A A a a
x x
A a
X x x x
a a
A a A a
x x X x
Legend:
Figure 2-3: Anexampleofapedigree. Thefamilyisconstitutedbytwo
grandparents,twoparents,andtwochildren. Boldletters
nexttotheindividualsindicategenotypedinformation,e.g.
the daughter is
A/a
at the rst marker andX/x
at asecondmarker.
Recombination
Recombination refersto new combinations of genes. This phenomenon arises
during meiosis when the two chromosomes of a pair, one inherited from the
fatherandone fromthemother,are(mostprobably)recombined toformfour
newunique chromatides, asdepicted inFigure 2-2.
The cause of a recombination is the biological event called crossover 5
. A
crossoverinvolvesphysical breakage ofthe chromosome into one paternal and
one maternal chromatide, and a joining of the paternal and maternal ends,
[SR99, page 40], we have illustrated this in Figure 2-4. Note that we write
A / a
whenapersonhastheallelicstatesA
anda
at agivenlocus,whereaswewrite
A X a
x
whenA
andX
resideononechromosomewhilea
andx
resideontheother. When two alleles at two dierent loci reside on the same chromosome
exampleweassumetoknowwhich allelesresideonwhichchromosomes, hence
thephases of thealleles areassumedto beknown. Whenwe do notknowthe
phasesof a set ofalleles, as itis oftenthe case for linkage analysis, thephase
isunknown.
Example 2 IntheinheritanceexampleofFigure2-3thesonisarecombinant,
whereas the daughter isa nonrecombinant, with respect tothe two loci shown.
Inthe example theson isthe recombinant, since thealleleson thefather's two
chromosomes,
A X
anda
x
,have recombined toformthen newA
x
combination,soA
originates froma dierent chromosome thanx
.Meiosis assuresthatthe number ofchromosomesina humanindividualis
constant, since only oneof the four gametes ispassed on to anew individual.
Thisgamete isthencopied inorder to constitute afull chromosome withtwo
sisterchromatides for thenew individual. Furthermore, meiosisand recombi-
nationisthe reason thatthechild's chromosomes, constituting achromosome
pair,aremostlikelynot identicaltoanyofthefourchromosomesourcesofthe
twoparents. Recombinationsensuresdiversityanduniquenessinapopulation.
a
b
C
Resulting 4 strands B
C A
B
C
A a
b
c
b
c A
B
C
A a
B
c a
b
C
B
C A
b
c
A a
B
c a
b
c
2 chromosomes
Crossover
Start meiosis End meiosis
4 strands
1 chromosome pair
Figure 2-4: The gure illustrates how a crossover might occur on a
chromosome,wherethreemarkergenesareresiding. Hence
this meiosis leads to the gametes {
A, B, C
}, {A, b, c
},{
a, B, c
}, and {a, b, C
}, where all but {A, B, C
} are re-combinations,[Ott99,page10]. Theelementsatthestart-
ingpointisonechromosomepair,twochromosomes,which
eachconsistsoftwostrands(DNAmolecules).
Theprobabilitythatanoddnumberofcrossoversoccurbetweentwolociis
termedthe recombination fraction. Note thatifan even number of crossovers
same parental chromosome. Thus the individual inheriting this chromosome
have inherited the genesat this pairof locifrom thesameancestral origin.
Example 3 Assume that a new chromatide is assembled fromone end tothe
other. Assume that the rst part of the chromatide is based on one of the
paternal chromatides, a crossover occurs, hence the chromatide is now being
constructed from a maternal chromatide. If a new crossover occurs (thus we
have had two crossovers so far) the source of the assembly is now back to a
paternal chromatide. Multiple switchescan occurbetweenthe paternalandthe
maternalchromatides.
Inasense,recombinationsdonotoccurcompletelyatrandom, [SR99,page
271]. The concept of interference states that in the immediate vicinity of a
crossover, it is virtually impossible for another crossover to occur. Thus the
recombination fractionremains very close to zero for some distancealong the
chromosome near a crossover. The recombination fraction between two loci
is related to their physical distance. The mathematical relationship between
physical distanceand the recombination fraction is given by a map function.
ThemostwidelyusedmapfunctionistheHaldanemapfunction,[Gud00,page
7]:
d = − 1
2 ln(1 − 2θ), θ = 1
2 1 − e −2 d ,
where
θ
is the recombination fraction andd
is the distance in Morgans.On average, approximately one recombination occur for each
10 8
bps. Thedistancebetweentwolociisdenedas1Morgan(M),ifonerecombination can
be expectedbetween themduring meiosis(hence
1
M isroughlyequivalent to10 8
bps). Thus, there is a correspondence between the distance between two lociandthelikelihoodofarecombination. Thegreater thedistance,theclosertherecombination fractionapproaches
1 2
,[Ott99 ,page 14]. Notethatbetweentwo loci residing on dierent chromosome pairs the recombination fractionis
1 2
,aseachmeiosisoneitherchromosomeareindependent. Twolociaresaidto be linked ifthe recombination fractionbetween themis lessthan1
2
,otherwisethe two loci are unlinked. This is also an observation made by Mendel, in
Mendel's second law, he states that unlinked, segregating gene pairs assort
independently atmeiosis,[Ott99 ,page 38]. Wewillreferto Mendel'srstand
secondlawasMendelian inheritance 6
.
withrespecttophysicaldistance. Itcandeviatealongachromosome anddur-
ingfemalemeiosisthe recombination fractionison average twicethat ofmale
meiosis. In the present study we assume that the recombination fraction is
constant, throughout the genome andthe same for both male and female 7
.
2.5 Locating trait genes
The Human Genome project deals with sequencing the chromosomes sothat
theorder of genes is established. However, when this work is completed, the
taskstill remains to uncoverwhichgenes areresponsible for whattrait 8
.
Doctors, physicians, psychologists, psychiatrist, etc. might suspectthat a
traitwithintheireldiscausedbysomegeneticmutation. Thesuspicionmight
arisefromthefactthatthetraitisoftencarriedbymanyinsomefamiliesand
bynone inotherfamilies. Ifthe trait causing gene(s) couldbe located, i.e. if
thelocus or loci could be identied, it is possible to infer the protein(s) that
causea given trait. Thisknowledgecan beusedfor:
1. Diagnosis: If the locus and the allelic state which is responsible for
some trait is known, a blood sample would be sucient to diagnose a
person,since theresultof asequencing ofthepersonsDNAat thelocus
could be matched against thesequenceof thetrait causing allelicstate.
It would also be possible to assess the chance of a person developing a
trait, by knowing the personsgenotypeat thetrait locusor loci.
2. Prescribeeectivedrugs: Somepeoplerespondtosomedrugs,while
others donot. If the responsiveness isdependent on one or more genes,
then a determination of a persons allelic state for those genes can lead
to the correct drugprescription.
3. Cure: If the proteins responsible for a trait are known, it might be
possible to develop a drug which surpassesthe eect of the proteins or
the proteinsthemselves.
Two kinds of genetic studies exist inthe literature. Both aim at locating
genes causing traits: Association studies and linkage analysis, [Cur01 ]. In
associationstudiestheaimistolocateatraitgenebylookingformarkeralleles
more frequent ina group of unrelated individuals carrying a trait, than to a
groupof people not carrying the trait. E.g. iftwo groups of 1000 individuals
are studied, and 80% of the individuals in one group - the group of people
7
Gudbjartssonmakesthesameassumptionsin,[Gud00 ].
8
Currently,thereisquitealargeamountofdataavailable,[BO98],howeverthecompu-
tationalpowerisinsucienttoprocessthisdata,[JIS93 ].
by30%ofindividualsintheothergroupofpeoplenotcarryingthetrait,there
issomeevidencethatthetraitandthealleleareclosetooneanother(i.e. they
are linked). Association studies are only useful when the trait causing gene
is located very close to a marker gene, because it is based on a rather sharp
divisionof thetwo groups(eithertheyhave thetrait andthemarkerallele,or
theydo not have either).
In linkage analysis related people are studied and dierent pedigrees are
studied independently. The aim inlinkage analysisis to nd the locusof one
or more genes causing a trait, by looking at inheritance patterns of markers
andcomparingthesewiththeinheritance patternsofthetraitinquestion. By
analysinghowdierentmarkersaresegregatedcompared tothesetof eected
individualsthelocationofthetraitgenecanbeapproximated. Linkageanalysis
isthefocusofthe remaining partof this report.
2.6 Summary
Inthis chapter we have introduced the reader to thebiologicalbackground of
thisreport, inorderto giveabasisfor understandingthebiologicalcontext of
therest of this report. We have explained the structure ofa DNA molecules,
andhowthey combine to the 23 chromosomes thatencodes thegeneticinfor-
mationofanindividual. Furthermore,we introducedthereader totheprocess
ofgeneticinheritance. Howtheallelesareinheritedisbasedonsomeprobabil-
ity(Mendelianinheritance), whichagain isinuenced bybiologicalproperties
likerecombinationfraction,interference,etc. Themaintaskoflinkageanalysis
isto locatethe geneonthe humangenome,which causesthetrait of interest.
From now on we will abstract away from biological details. The focus is
now on wholegenes (markers), which issucient to describelinkage analysis
formally. Another prime concept is that of pedigrees. As stated earlier, the
glossaryonpage 107 can provide quickreference.
We now concentrate onthe taskoflinkage analysis,thus locating thepos-
sibleposition of atraitcausing gene ona single chromosome. The Mendelian
laws of inheritance have a major role in linkage analysis. It is safe only to
considera single chromosome pair, since genes ondierent chromosome pairs
assortindependently.
2.7 Further reading
Theliteratureis dividedinto two categories:
Biologyorientated: Thisliteratureincludes[KC97]and[SR99],which
puter scientist, much of the information in these books is superuous
for understanding the biological modelassumed for thealgorithms pre-
sentedinthefollowing chapters. However,theycanserveasanextensive
reference.
Statistics orientated: This literature includes [Lan97] and [Ott99 ],
which arewritten by peoplewith a background in mathematics and/or
statistics. All of the books we have read in this category have an in-
troductorysectioncontaining abrief descriptionof thebiologicalmodel
and terms used. We have found that this kind of literature are inade-
quate to the readerwitha background incomputerscience, because the
maintopic ofthesebooks,thestatisticsmodels,outweighsthebiological
descriptions.
In our case the process of learning bioinformatics has largely been an it-
erative process from biologyto statistics, back to biology, etc. Thisis mainly
becausewehave foundlittle literatureon thistopic,whichwaswrittenfroma
computerscienticperspective.
We have found two references, thatis not biology and statisticsoriented.
Bioinformatics - A Practical Guide to the Analysis of Genes and Proteins,
[BO98 ], islargely a guide bookfor people interested inlocating software and
data for genetic analysis. It contains user guides for various genetics analy-
sis software, as well as guides on how to nd and extract data from public
available databases. Another exception is Faster Sequential Genetic Linkage
Computations, [JIS93]. This is probably the most computer science oriented
paper,itincorporatestwosynthesisofthebiologicalmodelandfourimplemen-
tationlevelspeed ups of central algorithms. Something thatisomitted inthe
paperisadiscussionofthe previouslymentionedformaltechniques,forspeed-
ingup computations. This is most probably because the paper is rather old
(1993),compared to thedevelopment ofthesetechniqueswithinthecomputer
scientic community.
Linkage Analysis
Thepurposeof this chapter is to describe the conceptof linkage analysisina
formalfashion. First, we motivate linkage analysis by giving an informal de-
scriptionofthe concept,basedonhowitisperformed. Afterthiswe formalize
theunderlying biologicalmodel, this includesformalizing pedigrees, genotype
information, inheritance vectors, and founder allele assignments. The meth-
ods implemented in the software currently used to solve the linkage analysis
problemconsists of two algorithms, namely one to solve single point analysis
and one to solve multi point analysis. After introducing the biologicalmodel
we rstfocusonsingle point linkage analysis. We state theproblemformally,
followed by a formal description and proof of the correctness and complexity
ofthealgorithms implemented inGenehunter andAllegro 1
. Sincenoprevious
framework has been established we prove these two applied algorithms. The
purpose ofmulti point isto gain moreaccurate estimates of linkage. Ourfor-
malizationisexpandedinthedescriptionofmultipoint analysis. We describe
multi point probability calculation as a Bayesian network and not as a hid-
denMarkovmodel, aspreviouslydescribedintheliterature,e.g. aspresented
in [LG87 ]. Finally, we describe two reductions, one that is applied in both
Genehunterand Allegroand theother onlyinAllegro.
3.1 Introduction to Linkage Analysis
Linkage analysisis aprocess for constructing genetic linkagemaps. A genetic
linkagemapisadescriptionofhowanumberoflocionachromosomerelatein
termsof genetic distance. Hence, a genetic linkage mapconsists ofa number
ofloci for which theorder and relative distance(recombination fraction 2
) are
known.
1
SeeAppendixAforalistofsoftwareusedtoanalysegeneticdata.
2
Seepage10.
of detailed genetic linkage maps based on molecular markers 3
(from now on
just marker). A marker is a polymorphic DNA or protein sequence deriving
from a single chromosomal location, [SR99, page 551]. Historically, markers
were limited to observable traits known to follow the mendelian 4
inheritance
pattern, e.g. as blood groups. Today, however, certain highly polymorphic
structuresofthehumanDNAcanbetypeddirectlyonlargescalebyautomated
equipment. Thesestructuresarecalledmicrosatellites which aredi-, tri-,and
tetraneucleotide repeats,and singlenucleorite polymorphisms 5
.
In the following we give a overview of how linkage analysis is performed,
andwhich dataisavailable.
3.1.1 Performing Linkage Analysis
Thedata neededinorder to perform linkage analysisis:
•
A geneticlinkage map.•
A numberof pedigrees.•
Information on theinheritance ofmarkers.•
Trait statusfor someor all pedigree members.Figure 3-1 depicts a pedigree with the information for one marker. The
information is usedto infer evidence thatthe inheritance pattern of the trait
resembles the inheritance pattern of a marker. In Figure 3-1 the inheritance
patternofthetraitfollowstheinheritanceofthepaternalalleleofindividual1.
Thereasonfor thiscouldbethatthetrait causinggeneissituatedcloseto the
marker,i.e. they arelinkedwhich means recombination occurs lessfrequently
than hadthey been unlinked.
Sincecrossoversoccur,itisnotalwaysthecasethattheinheritancepattern
ofa traitfollows exactlythat ofa marker. For this reason we need ameasure
ofhow closelythe inheritancepatterns ofatrait and amarkerresembles each
other. Suchmeasuresareexpressedthroughscoringfunctions. TheLODscore
isone suchscoringfunction. It isout of the scope ofthis reportto dene any
scoring functions, but for the sake of completeness Appendix B gives formal
denitionsoftwocommonlyusedscoringfunctions,theLOD andNPLscores.
Linkage analysis is performed for a number of pedigrees on a subset of
markers across the genome. If the scoring between the trait and a marker
indicates potential linkage, then that region is further analysed by including
3
Seepage6.
4
Seepage9and11.
5
See[SR99,page273]foradetaileddescriptionofthehistory,use,andtypingofmarkers.
Legend:
a|a
Female individual Male individual
5
3 4
2 1
Genotype information
Alleles Right: Maternal Left: Paternal
Affected individual (phase known)
A|a
A|A a|a
A|a
A|a
Figure 3-1: Apedigreewithtraitstatus andexactinformationonin-
heritanceofalleles. Notethatin realitythephaseisoften
notknown.
more markers from the genetic linkage map. Given that trait causing gene
is located in theregion under investigation, the new analysis estimates more
preciselythelocation of thetraitcausing gene.
This process is iterated until no more markers are available in the prox-
imity of the trait causing gene. The genetic linkage map is then updated by
introducinga locusrepresenting thetrait causing gene.
Usingexactknowledgeoftheinheritancepatternsofmarkers,linkageanal-
ysis is straight forward, but inreality many pedigree members are not geno-
typed making it impossible to infer the precise pattern of inheritance. This
andother problemsarediscussed inthe following section.
3.1.2 Missing and Incomplete Data
Asstatedabove,markerswereoriginallylimitedtosimplecharacteristic traits,
such as blood type or eye color. Recent techniques have made it possible to
determinetheallelicstatesofmicrosatellitesinhumanDNA.Thismethodhas
onlybeen usedfora fewdecades,thereforegenotypeinformation isoftenonly
available for the youngest generations. Since pedigreesare partlyconstructed
fromwrittendocuments, suchaschurchrecords,whichcontainnoinformation
about genetic markers, some individuals in the pedigree might not be geno-
typed at all. This is often the case for pedigree founders. Furthermore, the
phaseofthe observedalleles for a markercannotbedetermined.
Due to these problems, the inheritance pattern for each marker needs to
be estimated on behalf of the genotyped individuals. Since the genotype in-
pattern becomes impossible to infer. To illustrate the problem we introduce
the concepts of identical/identity by descent(IBD) and identical/identity by
state(IBS).Two individualsareIBDifthey shareinheritance ofsome founder
allele. Two individuals are IBS if they have the same allelic state for some
marker,these alleles need not be IBD.
Legend:
A/a
A/A a/a
a/a
Female individual Male individual
3 4
2 1
A/a Genotype information (phase unknown) Alleles
Right: Maternal Left: Paternal
A/a
A/a 5 6
Figure 3-2: IBS vs. IBD. Individuals 5 and 6are IBS, but since the
motherishomozygoustheyhavenotnecessarilyinherited
thesamematernalallele. Infact inthis casetheyarenot
IDB.
Figure 3-2 is a example of two siblings which are IBS, but not IBD. The
individuals 5 and 6 share inheritance of their paternal alleles (A). The allelic
statesof their maternal alleles (a) areidentical, but they arenot identical by
descent becauseof thepattern ofinheritance depicted inthegure.
For thereasonsgiven,thepreciseinheritance patternisimpossible toinfer
from the given data, that is, several inheritance patterns can give rise to the
sameobservedgenotypeinformation. Someoftheseinheritancepatternsmight
bemorelikelythanothers,thusweneedtocomputetheindividualprobability
ofeach. Thesecomputationsare the topicof therest ofthis chapter.
Eachpossibleinheritancepatterncanthenbescoredagainsttheinheritance
pattern of the trait, using for instance the LOD score. The contribution to
the nal score of the marker from each inheritance pattern is based on the
probability ofthat pattern.
Computing the probabilities of possible inheritance pattern is performed
intwo steps, single point analysis and multi point analysis. These topics are
covered inSections3.2and 3.4, respectively.
Singlepointanalysisistheprocessofdetermininglinkagebetweensinglemark-
ers and traits using only the genotype information available at each marker.
The primary task is computing probability distributions of inheritance pat-
terns at some locus given the available genotype information at that locus.
Thisiscalledsingle point probability computation.
Tocomputetheprobabilitydistributionoverinheritancepatterns,weintro-
duceaformalframeworkforthebiologicalconceptsintroducedinthepreceding
chapter. After the introduction of the formal framework we explain formally
howthe probability distributionis calculated.
3.2.1 Formal Model
Pedigree
InSection 2.4 the concept of a pedigree was introduced. In thefollowing we
formalize thisconcept.
Denition 1 (Pedigree) Apedigree isa4-tuple
P = h F, N, f ather, mother i
witha set of individuals (nodes)
V = F ∪ N
, whereF
isthe founder set andN
is the non-founder set, andtwo functions:-
f ather : N → V
.-
mother : N → V
, satisfying the following constraints:•
A nodecannot be both a founderand a non-founder, that isF ∩ N = ∅
.•
No node isboth a father anda mother, that is:∀ n, n 0 ∈ N
andn 00 ∈ V . mother(n) = n 00 = ⇒ f ather(n 0 ) 6 = n 00
; and∀ n, n 0 ∈ N
andn 00 ∈ V . f ather(n) = n 00 = ⇒ mother(n 0 ) 6 = n 00 .
•
A nodeis never its own ancestor, that is:∀ n ∈ V . n / ∈ ancestor(n),
where
ancestor : V → 2 V
isdened recursively as:Anynon-founderhas itsfatheranditsmotherasancestors together
with the ancestors of both parents,that is:
∀ n ∈ N . ancestor(n) = ancestor(f ather(n))
∪ ancestor(mother(n))
∪ f ather(n) ∪ mother(n).
∀ n ∈ F . ancestor(n) = ∅ .
Theconstraintsonthepedigreeareintroducedtoconsistentlydescribethe
underlyingbiological model.
Denition 2 (Pedigree Graph) Thegraph
G = h P, →i
consistingofapedi-gree
P
anda transition relation→ ⊂ V × N
, where:n → n 0
if ather(n 0 ) = n
ormother(n 0 ) = n,
iscalled the pedigree graph generated by
P = h F, N, mother, f ather i
.Weusetherelations
n −−−−→ f ather n 0
andn −−−−→ mother n 0
todenotef ather(n 0 ) = n
and
mother(n 0 ) = n
,respectively. From this denitionitfollows that:→ = { (n, n 0 ) | (n, n 0 ) ∈ −−−−→ ∨ f ather (n, n 0 ) ∈ −−−−→} mother = −−−−→ ∪ f ather −−−−→ mother .
Remark: Thisimpliesthe following relationship between ancestors andthe
→
relation:
n → + n 0 ⇔ n ∈ ancestor(n 0 ),
(3-1)where
n → + n 0
denotes thatthereisa pathinthepedigreegraphoflength atleast1from
n
ton 0
.Example 4 Figure 3-4 is the representation of the example pedigree in Fig-
ure 3-3 under the formal framework. In this example the pedigree graph,
G
,has the followingstructure:
G = hP, →i,
where:P = h F, N, mother, f ather i , F = { 1, 2, 4 } ,
N = { 3, 5 } , mother(3) = 2,
f ather(3) = 1, mother(5) = 4,
andf ather(5) = 3.
Figure 3-4 illustrates how the individuals of the example above are related
interms of the
f ather
andmother
functions.Lemma 1 A pedigree graph isa directed, acyclic graph (DAG).
Legend:
1
3
5
4 2
A/a A/a
A/A a/a
a/a
A/a
Female individual Male individual Genotype information
(phase unknown)
Figure 3-3: Anexamplepedigreewithgenotypeinformationforwhich
thephaseisunknown.
1
3 father
2
mother
5 father
4
mother
Figure 3-4: AformalrepresentationofthepedigreeinFigure3-3. The
transitions havebeenlabeledto clarifywhether a parent
is
f ather
ormother
.node would be its own ancestor. Assume that a pedigree graph
G = h P, →i
contains a cycleconsistingof thenodes
V 0 ⊆ V
,where:∀ n ∈ V 0 .n → + n.
Itfollowsdirectlyfrom(3-1)onpage20,that
n ∈ ancestor(n)
,whichcon-tradictsDenition 1 ofa pedigree.
♦
Allele
Denition 3 (Allele) The set of allelesin a pedigree
P
isdened as:A = { (n, $) | n ∈ V
and$ ∈ { p, m }} .
We call the subsets
A N = { (n, $) | n ∈ N
and$ ∈ { p, m }}
andA F = { (n, $) | n ∈ F
and$ ∈ { p, m }}
the set of non-founder alleles and the set of founder alleles, respectively.(n, p)
and(n, m)
represent the paternal andmaternalalleles.
Marker
Denition 4 (Marker) Amarker
6
M
isdenedasa2-tupleM = h A M , π M i
,where:
• A M = {a 1 , a 2 , . . . , a k }
is set of allelic statesfor the markerM
, and• π M : A M → [0, 1]
is the allele frequency7 function, such that:The sumof allallele frequencies is1, that is:
X
a ∈ A M
π M (a) = 1.
Theallele frequenciesarebasedon statisticalinformation fora givenpop-
ulation.
Notice the dierence between the terms allele and allelic state. An allele
istheterm usedfor the section (also denoted position) ofa chromosome rep-
resenting a marker. An allele can have a numberof states, depending on the
allelic states for the marker. Thus, an allele can be thought of as a variable
whichcanbeassignedavalue,thatis,anallelecanbeassignedanallelicstate.
6
Seepage6.
7
Amoreappropriatetermwouldbeallelicstatefrequency,howevertoremainconsistent
withthecurrentpracticewecallitallelefrequency.
pedigree, and we always work witha single marker, which we call
M
. Thus,whenwe write
A M
orπ M
werefer to allelicstates ortheallele frequenciesfor themarkerM
.Genotype information
Denition 5 (Genotype Information) Genotype 8
information,
G = hL , astates i
, for a gene,g
, consists of a set of genotyped individualsL
anda function
astates
, where:-
L ⊆ V
, and-
astates : L → {{ a 1 , a 2 } | a 1 , a 2 ∈ A M }
.Noticethat thedenitionindicates thatthe phase isunknown since there
isnoreferencetomaternalandpaternalalleles. Ifanindividualishomozygous
atagivenlocus,the
astates
functionreturnsasetcontainingjustoneelement.Example 5 The pedigree graph inFigure3-3 showsgenotypic informationfor
allindividuals. Thusthe setofgenotypedindividuals
L
isequal tothecompletenode set
V
. The allelic states of individual 5 are:astates(5) = {A, a}
.Inheritance Vector
Aninheritancevector 9
isacompletespecicationonhowtheallelesoffounders
areinherited by each non-founder. Thisis a formal way of describing inheri-
tancepatterns. Weformalizethenotionofinheritancevectors inthefollowing
denition:
Denition 6 (Inheritance Vector) An inheritance vector
v
for some pedi-gree graph
P
isa function:v : N × { p, m } → { p, m } .
Theinheritance vectorisusedtogiveinformationabouttheinheritanceofthe
paternal,
p
, and maternal,m
, alleles for every non-founder (See the example below).8
SeeSection2.3 forthebiologicalmeaningofgenotypes andalleles.
9
The notion of inheritance vectors was introduced in [LG87 ] and further used in
[KDRDL96].
be:
v(3, p) = p, v(3, m) = m,
v(5, p) = p,
andv(5, m) = p.
Thiswould result in the following inheritance pattern (see Figure 3-5):
Individual 3 has inherited his paternal allele from the father of individual
1 and his maternal allele from the mother of individual 2. The two alleles
of individual 1 and the two alleles of individual 2 are founder alleles. Hence,
individual 3 has inherited the paternal allele of a founder, individual 1, and
the maternal allele of another founder, individual 2. Note that the father of
individual 5 (individual 3) does have parents in the pedigree. To determine
whichfounderallele individual5has inherited, weusetheinformationthatthe
paternal allele istransmitted fromthefather's father. Thismeans,that wecan
ndthe transmitted founderallele by searching the graphrecursively using the
inheritanceinformationofindividual3. Thusthepaternalalleleofindividual5
isthe sameas the paternal allele of individual 3. Since
v(5, m) = p
individual5 has inherited the paternalallele of individual 4, who is a founder.
Legend:
A/a A/a
A/A a/a
a/a
Female individual Male individual
5
3 4
2 1
A/a Genotype information (phase unknown) Alleles
Right: Maternal Left: Paternal
v(5,m)=p v(5,p)=p
v(3,p)=p v(3,m)=m
Figure 3-5: Showshowitisdeterminedwhichfounderallelesthenon-
foundersreceivegivenaninheritancepattern.
Founder Allele Assignment
Weneedadenitionofafounderalleleassignment,which assignsallelicstates
to all founder alleles. Such a denition is necessary inorder to reason about