Summary

pedigrees actually receives the allele IBD from individual 4 and individual 5,

respectively. The case where

v(4, p) = v(5, p)

^is symmetrical.

For the inheritance patterns in the set

v _I

^where ^the ^two sub-pedigrees are independentinthe examplepedigree foundin Figure 4-7,the singlepoint

prob-ability isthe product of the probability of the inheritance patterns for the

sub-pedigree below individual 4 and the probability of the inheritance patterns for

the sub-pedigree below individual 5 16

From Theorem 8 it follows that the single point probability distributions

fortheinheritance patterns

v _I ⊂ v

^,^for^which ^two ^subsets^ofnon-foundersare inheritance independent, can be represented as two probability distributions

insteadofone. Thisreduces thespacecomplexity,since

v _I

^can^berepresented in

O(2 ^b ⁰ )

^space, ^using^a ^atrepresentation, forthesub-pedigreewiththe non-foundersin

N ⁰

^plus

O(2 ^b ⁰⁰ )

^space ^for ^thenon-foundersin

N ⁰⁰

^,^where

b ⁰

^and

b ⁰⁰

arethenumberofbitsneededtorepresentthetwosub-pedigrees. Ifthesubset

v _I

^of

v

^was represented in one distribution it would require

O(2 ^b ⁰ ⁺ ^b ⁰⁰ )

^space

insteadof

O(2 ^b ⁰ + 2 ^b ⁰⁰ )

^space.

This could open for a more compact representation for parts of the state

space, where subsets of a pedigree are independent. The principle of

decom-positioncan alsobeapplied recursivelyto independent subsets. Howwell this

workswillofcoursedependonthe pedigreestructure. Weconsidertheideaof

decomposition based on inheritance dependence a strategy which might lead

toan improved algorithm for linkage analysis. However, this requires:

1. A data structure which can represent the state space symbolically with

respectto the dierent inheritance dependencies for dierent subsetsof

the statespace.

2. Amultipointprobabilitycomputationalgorithmtoworkeciently with

sucha structure.

wouldimplythatthetaskofdeterminingthecompatible inheritancevectorsis

noteasierthansolvingaNP-completeproblem,ifsomereductioncanbe

estab-lishedtotestingmembershipoftwolanguagesinthesereductions. Methodsfor

makingamorecompactrepresentationofthestatespacewereanalysedaswell

asa method for exploiting inheritance independence to decompose the

pedi-gree. We found that MTBDDs areinappropriate because of thehigh number

of equivalence classes in the multi point probability computation.

Further-more, either the PDG structure must be adapted to our problem domain or

a set of ecient algorithms need to be developed. We formalized inheritance

independence,thatmightreducethecomplexityofbothsingleandmultipoint

linkage analysis, this also dependson whether an ecient structure and aset

of ecient algorithms can be developed. This is also subject for further

re-search. Thus, none of the techniques considered in this chapter seem to be

directlyapplicable to linkage analysis, but might serve asan inspirationfor a

newcustomisedmethodfor linkage analysis.

Conclusion

As computer scientists we have moved into a new area, namely the eld of

bioinformatics. More specically, we have worked with the area of linkage

analysis,whichisa recognisedwaytobuild geneticlinkage maps. Bylocating

genes responsible for traits, within the human genome, companies such as

deCODEGenetics hope to beable to developdiagnostics products, drugs, or

evencuresfordiseases. Duetotheamountofcomputation requiredtoextract

thestatisticaldata, onwhich evidence oflinkage is based, onlypedigrees ofa

moderatesize arefeasible today foranalysis.

The bioinformatics eld spans a variety of sciences. Originally, genetic

analysiswas performed by applying methods of biologyand statistics, but as

the technology for extracting genetic information has improved, the amount

of available data has introduced a need for eective algorithms. However,

currently, no formal framework in which to reason about thecorrectness and

complexityof the algorithmsof linkage analysisexist.

Today's literature onlinkage analysisdisplay quitea diversityinterms of

notation and denitions. Since further developments within bioinformatics,

and especially linkage analysis, require people with dierent backgrounds to

communicate and cooperate,thelackofaformalframework makesithard for

thesepeopletoexpressthemselvesprecisely. Wehavebuildaformalframework

inwhich it is possible to express and reason about linkage analysis precisely.

The formalmodel presented herecould hopefully be the formalmodel which

will serve as a common base for the community, as it is true to the widely

adoptedbiologicalmodel.

Within our framework we have been able to prove and express the

cor-rectness of two algorithms for single point probability computation, namely

Kruglyakand FastTreeTraversal - thetwo algorithmsusedinthe

soft-ware packages Genehunter and Allegro, respectively. We have also identied

thatthecorrectframeworkinwhichtodescribemultipointanalysisisBayesian

nosingleP-timesolvableformthatcanexpressthetaskofndingthe

compat-ibleinheritance vectors. Wehave madesome initial attempts to investigating

whetherthetaskofndingthecompatible inheritancevectorsisNP-complete.

However,more researchisneededtodetermine thecomplexityclassoflinkage

analysisandits sub-tasks.

We have considered two common data structures for symbolic

represen-tation to represent probability distributions over inheritance vectors, namely

MTBDDs and PDGs. Both data structuresdo not seem to work well for our

problemandwouldrequiresomemodicationsinordertoreducethe

complex-ityof linkage analysis. MTBDDs seem applicable insingle point analysis but

failinmultipoint analysis,duetodiversityoftheprobabilities inthe

distribu-tionoverinheritancevectors. PDGsseem morepromising forrepresenting the

intermediateresults,but manyunanswered questionsneedsto beresolved.

Now that we have a formal framework, we are able to reason about new

algorithms. Until now, we have not been able to nd an existing data

struc-ture which will reduce the complexity of single and multi point analysis in

practice. Anatural future work aspectis to try to developa customised data

structureforlinkage analysis,andmaybe tobasethisstructureon inheritance

dependencies. Another future research direction is investigating the deeper

theoretical nature of the complexity of linkage analysis. Finally, we mention

approximationas possible futureresearch,since approximations to problems,

knownto be computationally hard,insome cases yieldusable results.

Linkage Analysis Tools

Thisappendix contains descriptions of some the programs available for

com-putationon geneticdataand isbased on[Ott99 ]and [oGAS00 ].

Using computers in linkage analysis is not new. Initially computers were

usedtostore handcalculatedLODscore,howeverassoonas1955LOD scores

were calculated by vacuum tube based computers. Computers has

tradition-ally been exploited as powerful calculators and as storage media, but due to

vast amounts of genetic data, researchers turn to more (computer) scientic

approacheslike datamining, [Ram01].

Inthefollowingthemostpopularprogramsforlinkageanalysisandrelated

geneticcomputationsarepresentedinalphabetical ordering.

Allegro

Allegrois a tool for multi point linkage analysis, that isable to perform both

parametriclinkage analysis and analysis based on allele sharing models. The

program provides features like estimation of total number of recombinations

amongmarkers, itcomputesposteriorIBDsharingprobabilities andisableto

reconstructhaplotypes. LiteratureandsourcecodeofAllegrowaskindlymade

availabletousbydeCODEGenetics,andprovidedinsighttothecomputations

neededfor linkage analysis.

AllegroiswritteninC++byDanielF. Gudbjartsson, KristjanJonasson,

Au-gustineKong,Michael L.Frigge,deCODE Genetics, Inc. Iceland.

ASPEX

ASPEX is a package of programs for performing multi point exclusion

map-ping of aected sibling pair data for discrete traits. The tool is able to use

allele frequencies to reconstruct missinginformation, and is tailored for data

tancefrom sibpair data. Itcan do transmissiondisequilibrium testing andis

ableto verifying the degreeof relatednessof individuals withina family.

Thepackage iswritten inANSI CbyProfessor Neil Risch,Stanford.

Thesource codefor ASPEX, andpre-compiled binaries areavailable at:

ftp://lahmed.stanford.edu/pub/aspex

CRI-MAP

CRI-MAPallowsrapid,largelyautomatedconstructionofgeneticlinkagemaps,

generatesLODtables,anddetectsdataerrors. Thetoolwasoriginallydesigned

to handlecodominant lociscored on pedigrees"without missingindividuals",

suchasnuclear families, but can nowbeused ongeneral pedigrees, and some

diseaseloci.

CRI-MAPisimplementedbyProfessorPhillip Green,WashingtonUniversity.

Version 2.8 of CRI-MAP is distributedas source code in thelanguage C and

isavailable at:

http://compgen.rutgers.edu/multimap/crimap/

Genehunter

Genehunter provides a wide range of tools for performing linkage and

dise-quilibrium analyses. The package is able to perform multi point analysis on

moderate size pedigree, computing LOD and NPL scores. Genehunter can

be used to do sib pair analysis. In addition, tools are available to search for

association or disequilibrium in addition to linkage. Genehunter provide

cal-culations involving dozens of markers, even in pedigrees with inbreeding and

marriage loops. Additionally, the multi point inheritance information allows

thereconstructionofmaximum-likelihoodhaplotypesfor allindividualsinthe

pedigreeandinformation content mapping whichmeasures thefractionof the

totalinheritance information extracted fromthemarker data.

Genehunter hasa C source code and is developed byLeonid Kruglyak, Mark

Daly, Mary Pat Reeve-Daly, Eric Lander Whitehead Center for Genome

Re-search,MIT.

Version 2wasreleased 1998 andis available at:

http://www.fhcrc.org/labs/kruglyak/Downloads/

HOMOZ

HOMOZ is a program for rapid multi point mapping of recessively inherited

diseasegenesinnuclear familiesincluding homozygositymapping. Itincludes

anecientalgorithmforcomputationofmultipointLODscoresinsmall

pedi-greesincludingthosewithinbreeding loopsandmissinggenotypeinformation.

WhiteheadCenterfor GenomeResearch,MIT.

HOMOZcan bedownloadedat:

ftp://ftp-genome.wi.mit.edu/distribution/software/homoz/

LINKAGE

Thecoreofthe LINKAGEpackage isaseriesofprogramsformaximum

likeli-hoodestimation ofrecombination rates, calculation of LOD score tables, and

analysisof genetic risks. The analysis programs are divided into two groups.

Therstgroupcan beusedforgeneralpedigreeswithmarkeranddiseaseloci.

Programsinthesecondgroupareforthree-generationfamiliesandcodominant

marker loci, and are primarily intended for the construction of genetic maps

fromdataonreferencefamilies. LINKAGEiscapableofmakingtwo-pointand

multi point analysis. LINKAGE is a Pascal program of Dr. Mark Lathrop,

Centre Nationalde Genotypage, EvryCedex,France.

Version 5.5available at:

ftp://linkage.rockefeller.edu/software/linkage/

LIPED

The LIPED program carries out genetic linkage analysis. It estimates the

recombinationfraction,bycalculatingpedigreelikelihoodsforvariousassumed

valuesoftherecombinationfraction. Theprogramisabletohandleageofonset

data. Onlytwo locican behandled at atime.

LIPED is written inFortran IV (1974) byProfessor Jurg Ott, Laboratory of

StatisticalGenetics, RockefellerUniversity.

Latest versionisfrom June 1995 andis availableat:

ftp://linkage.rockefeller.edu/software/liped

Loki

Loki analyses a quantitative trait observed on large pedigrees using Markov

chains, Monte Carlo, multi point linkage, and segregation analysis. The trait

maybe determined bymultiple loci.

Loki is implemented inC by Postdoctoral Researcher Dr Simon Heath,

Uni-versityof Washington.

LokiVersion2.3 wasreleasedNovember2000 and isavailable at:

ftp://ftp.u.washington.edu/pub/user-supported/pangaea/PANGAEA/Lok i

MAP+

The MAP+ program is written to construct high resolution linkage maps.

MAP+ is Fortran based and implemented by Dr. A. Collins, J. Teague and

ProfessorN.E. Morton, UniversityofSouthampton.

MAP+wasreleased in1996 andis available at:

http://cedar.genetics.soton.ac.uk/pub/PROGRAMS/map+

MAPMAKER

TheMAPMAKERsoftwarepackageisconsistingoftwopartsMAPMAKER/EXP

and MAPMAKER/QTL. MAPMAKER/EXP performs full multi point

link-ageanalysisandestimatesrecombinationfractionsfordominant,recessive,and

codominant markers. MAPMAKER/QTL is a companion program to

MAP-MAKER/EXP which allows mapping of genes controlling polygenetic

quan-titative traits in F2 inter-crosses and BC1 back-crosses relative to a genetic

linkage map.

MAPMAKERisprogrammedinCbyEricLander,Ph. D.,Directorof

White-headCenterfor Genome Research,MIT.

Version 3.0ofthe toolfrom 1992 isavailable at:

ftp://ftp-genome.wi.mit.edu/distribution/software/mapmaker3

MENDEL

MENDELdoesgeneticanalysisofhumanpedigreedataundermodelsinvolving

a small number of loci. MENDEL is useful for segregation analysis, linkage

calculations,geneticcounseling, allelefrequencyestimation, and relatedkinds

of problems. MENDEL is able to handle looping pedigrees eciently, which

tendsto be problematic inother tools.

MENDELisimplementedinFortran77 by ProfessorKenneth Lange, UCLA.

For registration anddownload visit:

http://www.biomath.medsch.ucla.edu/faculty/klange/software.html

Pedigree Analysis Package(PAP)

PAPmaybeusedforsegregationanalysis,variancecomponentsanalysis,

link-ageanalysis,measuredgenotypeanalysis,orgeneticmodeltting. Thegenetic

modelmay contain anynumber of lociand alleles. Phenotypes maybe

simu-latedassuming any modelavailable inPAP.Linkage analysison fourloci.

PAP comprisesa setofFortran77 programsproduced byAssociateProfessor

SandraJ.Hasstedt,Department ofHuman Genetics,University ofUtah.

Revision4.0 fromMarch 1994 isavailable at:

ftp://ftp.genetics.utah.edu/pub/software/pap

VITESSE is a software package that computes likelihoods. VITESSE uses

thealgorithms of set-recodingand fuzzy inheritance to reduce thenumber of

genotypesneededforexactcomputationofthelikelihood,whichacceleratesthe

calculation. It also representsmulti locus genotypes locus-by-locus to reduce

thememory requirements.

VITESSE is developed in ANSI C by Assistant Professors Jerey O'Connell

andAssociate ProfessorDanE. Weeksat the University ofPittsburgh.

Version 1.0publishedin1995 isavailableat:

ftp://watson.hgen.pitt.edu/pub/vitesse/

LOD and NPL score

Ascoring function isa measure ofthe likelihood of linkage between a marker

andatrait,forwhichthelocusofthemarkerisknown,butforwhichthelocus

orlociofthetraitcausing gene(s)areunknown. Notethatwhenoneislooking

forthe geneticcauseof atraita numberof pedigrees areused.

It isnot within the scope of thisprojectto go into detailwiththescoring

functions,asthey rely the statisticalpart oflinkage analysiswhich is not our

focus. Wewill,however,brieydescribetwocommonlyusedscoringfunctions:

The Log Likelihood of Odds (LOD) score, and the Non-Parametric Linkage

(NPL)score.

B.0.1 LOD score

TheLOD scoreis ameasure ofhowplausible an observed setof dataisgiven

a model, [Ott99 , page 38]. It is the logarithm of the odds of the observed

data if linkage is assumed (recombination fraction

θ < ¹ ₂

⁾ ^compared ^to ^the

null-hypothesis (no linkage)where therecombination fraction

θ

^is^equal ^to

¹ ₂

Formallythe LOD scoreis, [Ott99 ,page 39]:

LOD(θ) = log ₁₀ L(θ) L( ¹ ₂ )

,

^(B-1)

where

L

^is ^the ^likelihood ^of ^a ^hypothesis

H

^given ^some observation data

F

dened as:

L(H) = P(F | H)

^. ^The^odds ⁱⁿ^favor ^one ^hypothesis

H ₁

^versus

H ₂

areexpressedbythe likelihoodratio:

R = L(H ₁ ) L(H ₂ ) .

Maximum likelihood estimation is used to nd a

θ

^which ^maximises ^the

LOD-scorefor allmarkers(See[Ott99 ,Chapter 3.3]for moreon thestatistics

ofLOD scoresand maximum likelihood estimation). The probability of each

inheritance vector

v ∈ v

^is ^used ⁱⁿ ^the ^expression ^for ^the ^score ^(See ^[Gud00]

fora more detailedpresentation oftheLOD score computation).

In document BRICS Basic Research in Computer Science (Sider 91-101)

v(4, p) = v(5, p)

v I

v I ⊂ v

v I

O(2 b 0 )

N 0

O(2 b 00 )

N 00

b 0

b 00

v I

v

O(2 b 0 + b 00 )

O(2 b 0 + 2 b 00 )

θ < 1 2

θ

1 2

LOD(θ) = log 10 L(θ) L( 1 2 )

,

L

H

F

L(H) = P(F | H)

H 1

H 2

R = L(H 1 ) L(H 2 ) .

θ

v ∈ v

v _I

v _I ⊂ v

v _I

O(2 ^b ⁰ )

N ⁰

O(2 ^b ⁰⁰ )

N ⁰⁰

b ⁰

b ⁰⁰

v _I

O(2 ^b ⁰ ⁺ ^b ⁰⁰ )

O(2 ^b ⁰ + 2 ^b ⁰⁰ )

θ < ¹ ₂

¹ ₂

LOD(θ) = log ₁₀ L(θ) L( ¹ ₂ )

H ₁

H ₂

R = L(H ₁ ) L(H ₂ ) .