pedigrees actually receives the allele IBD from individual 4 and individual 5,
respectively. The case where
v(4, p) = v(5, p)
is symmetrical.For the inheritance patterns in the set
v I
where the two sub-pedigrees are independentinthe examplepedigree foundin Figure 4-7,the singlepointprob-ability isthe product of the probability of the inheritance patterns for the
sub-pedigree below individual 4 and the probability of the inheritance patterns for
the sub-pedigree below individual 5 16
.
From Theorem 8 it follows that the single point probability distributions
fortheinheritance patterns
v I ⊂ v
,forwhich two subsetsofnon-foundersare inheritance independent, can be represented as two probability distributionsinsteadofone. Thisreduces thespacecomplexity,since
v I
canberepresented inO(2 b 0 )
space, usinga atrepresentation, forthesub-pedigreewiththe non-foundersinN 0
plusO(2 b 00 )
space for thenon-foundersinN 00
,whereb 0
andb 00
arethenumberofbitsneededtorepresentthetwosub-pedigrees. Ifthesubset
v I
ofv
was represented in one distribution it would requireO(2 b 0 + b 00 )
spaceinsteadof
O(2 b 0 + 2 b 00 )
space.This could open for a more compact representation for parts of the state
space, where subsets of a pedigree are independent. The principle of
decom-positioncan alsobeapplied recursivelyto independent subsets. Howwell this
workswillofcoursedependonthe pedigreestructure. Weconsidertheideaof
decomposition based on inheritance dependence a strategy which might lead
toan improved algorithm for linkage analysis. However, this requires:
1. A data structure which can represent the state space symbolically with
respectto the dierent inheritance dependencies for dierent subsetsof
the statespace.
2. Amultipointprobabilitycomputationalgorithmtoworkeciently with
sucha structure.
wouldimplythatthetaskofdeterminingthecompatible inheritancevectorsis
noteasierthansolvingaNP-completeproblem,ifsomereductioncanbe
estab-lishedtotestingmembershipoftwolanguagesinthesereductions. Methodsfor
makingamorecompactrepresentationofthestatespacewereanalysedaswell
asa method for exploiting inheritance independence to decompose the
pedi-gree. We found that MTBDDs areinappropriate because of thehigh number
of equivalence classes in the multi point probability computation.
Further-more, either the PDG structure must be adapted to our problem domain or
a set of ecient algorithms need to be developed. We formalized inheritance
independence,thatmightreducethecomplexityofbothsingleandmultipoint
linkage analysis, this also dependson whether an ecient structure and aset
of ecient algorithms can be developed. This is also subject for further
re-search. Thus, none of the techniques considered in this chapter seem to be
directlyapplicable to linkage analysis, but might serve asan inspirationfor a
newcustomisedmethodfor linkage analysis.
Conclusion
As computer scientists we have moved into a new area, namely the eld of
bioinformatics. More specically, we have worked with the area of linkage
analysis,whichisa recognisedwaytobuild geneticlinkage maps. Bylocating
genes responsible for traits, within the human genome, companies such as
deCODEGenetics hope to beable to developdiagnostics products, drugs, or
evencuresfordiseases. Duetotheamountofcomputation requiredtoextract
thestatisticaldata, onwhich evidence oflinkage is based, onlypedigrees ofa
moderatesize arefeasible today foranalysis.
The bioinformatics eld spans a variety of sciences. Originally, genetic
analysiswas performed by applying methods of biologyand statistics, but as
the technology for extracting genetic information has improved, the amount
of available data has introduced a need for eective algorithms. However,
currently, no formal framework in which to reason about thecorrectness and
complexityof the algorithmsof linkage analysisexist.
Today's literature onlinkage analysisdisplay quitea diversityinterms of
notation and denitions. Since further developments within bioinformatics,
and especially linkage analysis, require people with dierent backgrounds to
communicate and cooperate,thelackofaformalframework makesithard for
thesepeopletoexpressthemselvesprecisely. Wehavebuildaformalframework
inwhich it is possible to express and reason about linkage analysis precisely.
The formalmodel presented herecould hopefully be the formalmodel which
will serve as a common base for the community, as it is true to the widely
adoptedbiologicalmodel.
Within our framework we have been able to prove and express the
cor-rectness of two algorithms for single point probability computation, namely
Kruglyakand FastTreeTraversal - thetwo algorithmsusedinthe
soft-ware packages Genehunter and Allegro, respectively. We have also identied
thatthecorrectframeworkinwhichtodescribemultipointanalysisisBayesian
nosingleP-timesolvableformthatcanexpressthetaskofndingthe
compat-ibleinheritance vectors. Wehave madesome initial attempts to investigating
whetherthetaskofndingthecompatible inheritancevectorsisNP-complete.
However,more researchisneededtodetermine thecomplexityclassoflinkage
analysisandits sub-tasks.
We have considered two common data structures for symbolic
represen-tation to represent probability distributions over inheritance vectors, namely
MTBDDs and PDGs. Both data structuresdo not seem to work well for our
problemandwouldrequiresomemodicationsinordertoreducethe
complex-ityof linkage analysis. MTBDDs seem applicable insingle point analysis but
failinmultipoint analysis,duetodiversityoftheprobabilities inthe
distribu-tionoverinheritancevectors. PDGsseem morepromising forrepresenting the
intermediateresults,but manyunanswered questionsneedsto beresolved.
Now that we have a formal framework, we are able to reason about new
algorithms. Until now, we have not been able to nd an existing data
struc-ture which will reduce the complexity of single and multi point analysis in
practice. Anatural future work aspectis to try to developa customised data
structureforlinkage analysis,andmaybe tobasethisstructureon inheritance
dependencies. Another future research direction is investigating the deeper
theoretical nature of the complexity of linkage analysis. Finally, we mention
approximationas possible futureresearch,since approximations to problems,
knownto be computationally hard,insome cases yieldusable results.
Linkage Analysis Tools
Thisappendix contains descriptions of some the programs available for
com-putationon geneticdataand isbased on[Ott99 ]and [oGAS00 ].
Using computers in linkage analysis is not new. Initially computers were
usedtostore handcalculatedLODscore,howeverassoonas1955LOD scores
were calculated by vacuum tube based computers. Computers has
tradition-ally been exploited as powerful calculators and as storage media, but due to
vast amounts of genetic data, researchers turn to more (computer) scientic
approacheslike datamining, [Ram01].
Inthefollowingthemostpopularprogramsforlinkageanalysisandrelated
geneticcomputationsarepresentedinalphabetical ordering.
Allegro
Allegrois a tool for multi point linkage analysis, that isable to perform both
parametriclinkage analysis and analysis based on allele sharing models. The
program provides features like estimation of total number of recombinations
amongmarkers, itcomputesposteriorIBDsharingprobabilities andisableto
reconstructhaplotypes. LiteratureandsourcecodeofAllegrowaskindlymade
availabletousbydeCODEGenetics,andprovidedinsighttothecomputations
neededfor linkage analysis.
AllegroiswritteninC++byDanielF. Gudbjartsson, KristjanJonasson,
Au-gustineKong,Michael L.Frigge,deCODE Genetics, Inc. Iceland.
ASPEX
ASPEX is a package of programs for performing multi point exclusion
map-ping of aected sibling pair data for discrete traits. The tool is able to use
allele frequencies to reconstruct missinginformation, and is tailored for data
tancefrom sibpair data. Itcan do transmissiondisequilibrium testing andis
ableto verifying the degreeof relatednessof individuals withina family.
Thepackage iswritten inANSI CbyProfessor Neil Risch,Stanford.
Thesource codefor ASPEX, andpre-compiled binaries areavailable at:
ftp://lahmed.stanford.edu/pub/aspex
CRI-MAP
CRI-MAPallowsrapid,largelyautomatedconstructionofgeneticlinkagemaps,
generatesLODtables,anddetectsdataerrors. Thetoolwasoriginallydesigned
to handlecodominant lociscored on pedigrees"without missingindividuals",
suchasnuclear families, but can nowbeused ongeneral pedigrees, and some
diseaseloci.
CRI-MAPisimplementedbyProfessorPhillip Green,WashingtonUniversity.
Version 2.8 of CRI-MAP is distributedas source code in thelanguage C and
isavailable at:
http://compgen.rutgers.edu/multimap/crimap/
Genehunter
Genehunter provides a wide range of tools for performing linkage and
dise-quilibrium analyses. The package is able to perform multi point analysis on
moderate size pedigree, computing LOD and NPL scores. Genehunter can
be used to do sib pair analysis. In addition, tools are available to search for
association or disequilibrium in addition to linkage. Genehunter provide
cal-culations involving dozens of markers, even in pedigrees with inbreeding and
marriage loops. Additionally, the multi point inheritance information allows
thereconstructionofmaximum-likelihoodhaplotypesfor allindividualsinthe
pedigreeandinformation content mapping whichmeasures thefractionof the
totalinheritance information extracted fromthemarker data.
Genehunter hasa C source code and is developed byLeonid Kruglyak, Mark
Daly, Mary Pat Reeve-Daly, Eric Lander Whitehead Center for Genome
Re-search,MIT.
Version 2wasreleased 1998 andis available at:
http://www.fhcrc.org/labs/kruglyak/Downloads/
HOMOZ
HOMOZ is a program for rapid multi point mapping of recessively inherited
diseasegenesinnuclear familiesincluding homozygositymapping. Itincludes
anecientalgorithmforcomputationofmultipointLODscoresinsmall
pedi-greesincludingthosewithinbreeding loopsandmissinggenotypeinformation.
WhiteheadCenterfor GenomeResearch,MIT.
HOMOZcan bedownloadedat:
ftp://ftp-genome.wi.mit.edu/distribution/software/homoz/
LINKAGE
Thecoreofthe LINKAGEpackage isaseriesofprogramsformaximum
likeli-hoodestimation ofrecombination rates, calculation of LOD score tables, and
analysisof genetic risks. The analysis programs are divided into two groups.
Therstgroupcan beusedforgeneralpedigreeswithmarkeranddiseaseloci.
Programsinthesecondgroupareforthree-generationfamiliesandcodominant
marker loci, and are primarily intended for the construction of genetic maps
fromdataonreferencefamilies. LINKAGEiscapableofmakingtwo-pointand
multi point analysis. LINKAGE is a Pascal program of Dr. Mark Lathrop,
Centre Nationalde Genotypage, EvryCedex,France.
Version 5.5available at:
ftp://linkage.rockefeller.edu/software/linkage/
LIPED
The LIPED program carries out genetic linkage analysis. It estimates the
recombinationfraction,bycalculatingpedigreelikelihoodsforvariousassumed
valuesoftherecombinationfraction. Theprogramisabletohandleageofonset
data. Onlytwo locican behandled at atime.
LIPED is written inFortran IV (1974) byProfessor Jurg Ott, Laboratory of
StatisticalGenetics, RockefellerUniversity.
Latest versionisfrom June 1995 andis availableat:
ftp://linkage.rockefeller.edu/software/liped
Loki
Loki analyses a quantitative trait observed on large pedigrees using Markov
chains, Monte Carlo, multi point linkage, and segregation analysis. The trait
maybe determined bymultiple loci.
Loki is implemented inC by Postdoctoral Researcher Dr Simon Heath,
Uni-versityof Washington.
LokiVersion2.3 wasreleasedNovember2000 and isavailable at:
ftp://ftp.u.washington.edu/pub/user-supported/pangaea/PANGAEA/Lok i
MAP+
The MAP+ program is written to construct high resolution linkage maps.
MAP+ is Fortran based and implemented by Dr. A. Collins, J. Teague and
ProfessorN.E. Morton, UniversityofSouthampton.
MAP+wasreleased in1996 andis available at:
http://cedar.genetics.soton.ac.uk/pub/PROGRAMS/map+
MAPMAKER
TheMAPMAKERsoftwarepackageisconsistingoftwopartsMAPMAKER/EXP
and MAPMAKER/QTL. MAPMAKER/EXP performs full multi point
link-ageanalysisandestimatesrecombinationfractionsfordominant,recessive,and
codominant markers. MAPMAKER/QTL is a companion program to
MAP-MAKER/EXP which allows mapping of genes controlling polygenetic
quan-titative traits in F2 inter-crosses and BC1 back-crosses relative to a genetic
linkage map.
MAPMAKERisprogrammedinCbyEricLander,Ph. D.,Directorof
White-headCenterfor Genome Research,MIT.
Version 3.0ofthe toolfrom 1992 isavailable at:
ftp://ftp-genome.wi.mit.edu/distribution/software/mapmaker3
MENDEL
MENDELdoesgeneticanalysisofhumanpedigreedataundermodelsinvolving
a small number of loci. MENDEL is useful for segregation analysis, linkage
calculations,geneticcounseling, allelefrequencyestimation, and relatedkinds
of problems. MENDEL is able to handle looping pedigrees eciently, which
tendsto be problematic inother tools.
MENDELisimplementedinFortran77 by ProfessorKenneth Lange, UCLA.
For registration anddownload visit:
http://www.biomath.medsch.ucla.edu/faculty/klange/software.html
Pedigree Analysis Package(PAP)
PAPmaybeusedforsegregationanalysis,variancecomponentsanalysis,
link-ageanalysis,measuredgenotypeanalysis,orgeneticmodeltting. Thegenetic
modelmay contain anynumber of lociand alleles. Phenotypes maybe
simu-latedassuming any modelavailable inPAP.Linkage analysison fourloci.
PAP comprisesa setofFortran77 programsproduced byAssociateProfessor
SandraJ.Hasstedt,Department ofHuman Genetics,University ofUtah.
Revision4.0 fromMarch 1994 isavailable at:
ftp://ftp.genetics.utah.edu/pub/software/pap
VITESSE is a software package that computes likelihoods. VITESSE uses
thealgorithms of set-recodingand fuzzy inheritance to reduce thenumber of
genotypesneededforexactcomputationofthelikelihood,whichacceleratesthe
calculation. It also representsmulti locus genotypes locus-by-locus to reduce
thememory requirements.
VITESSE is developed in ANSI C by Assistant Professors Jerey O'Connell
andAssociate ProfessorDanE. Weeksat the University ofPittsburgh.
Version 1.0publishedin1995 isavailableat:
ftp://watson.hgen.pitt.edu/pub/vitesse/
LOD and NPL score
Ascoring function isa measure ofthe likelihood of linkage between a marker
andatrait,forwhichthelocusofthemarkerisknown,butforwhichthelocus
orlociofthetraitcausing gene(s)areunknown. Notethatwhenoneislooking
forthe geneticcauseof atraita numberof pedigrees areused.
It isnot within the scope of thisprojectto go into detailwiththescoring
functions,asthey rely the statisticalpart oflinkage analysiswhich is not our
focus. Wewill,however,brieydescribetwocommonlyusedscoringfunctions:
The Log Likelihood of Odds (LOD) score, and the Non-Parametric Linkage
(NPL)score.
B.0.1 LOD score
TheLOD scoreis ameasure ofhowplausible an observed setof dataisgiven
a model, [Ott99 , page 38]. It is the logarithm of the odds of the observed
data if linkage is assumed (recombination fraction
θ < 1 2
) compared to thenull-hypothesis (no linkage)where therecombination fraction
θ
isequal to1 2
.Formallythe LOD scoreis, [Ott99 ,page 39]:
LOD(θ) = log 10 L(θ) L( 1 2 )
,
(B-1)where
L
is the likelihood of a hypothesisH
given some observation dataF
,dened as:
L(H) = P(F | H)
. Theodds infavor one hypothesisH 1
versusH 2
areexpressedbythe likelihoodratio:
R = L(H 1 ) L(H 2 ) .
Maximum likelihood estimation is used to nd a
θ
which maximises theLOD-scorefor allmarkers(See[Ott99 ,Chapter 3.3]for moreon thestatistics
ofLOD scoresand maximum likelihood estimation). The probability of each
inheritance vector
v ∈ v
is used in the expression for the score (See [Gud00]fora more detailedpresentation oftheLOD score computation).