BRICS Basic Research in Computer Science

(1)

B R ICS R S -02-7 In g ´olfsd ´ottir et a l.: A F o rmalization o f L in k age An alysis

BRICS

Basic Research in Computer Science

A Formalization of Linkage Analysis

Anna Ing´olfsd´ottir

Anders Lyhne Christensen Jens Alsted Hansen

Jacob Johnsen John Knudsen

Jacob Illum Rasmussen

BRICS Report Series RS-02-7

(2)

Copyright c 2002, Anna Ing´olfsd´ottir & Anders Lyhne Christensen

& Jens Alsted Hansen & Jacob Johnsen & John Knudsen & Jacob Illum Rasmussen.

BRICS, Department of Computer Science University of Aarhus. All rights reserved.

Reproduction of all or part of this work is permitted for educational or research use on condition that this copyright notice is included in any copy.

See back inner page for a list of recent BRICS Report Series publications.

Copies may be obtained by contacting:

BRICS

Department of Computer Science University of Aarhus

Ny Munkegade, building 540 DK–8000 Aarhus C

Denmark

Telephone: +45 8942 3360 Telefax: +45 8942 3255 Internet: BRICS@brics.dk

BRICS publications are in general accessible through the World Wide Web and anonymous FTP through these URLs:

http://www.brics.dk ftp://ftp.brics.dk

This document in subdirectory RS/02/7/

(3)

Anna Ingólfsdóttir Anders Lyhne Christensen

Jens Alsted Hansen Jacob Johnsen John Knudsen

Jacob IllumRasmussen

February 2000

(4)

Inthisreportaformalizationofgeneticlinkageanalysisisintroduced. Linkage

analysisisa computationallyhardbiomathematical method,which purposeis

tolocategenesonthehumangenome. Itisrootedinthenewareaofbioinfor-

maticsand noformalization of themethodhaspreviously beenestablished.

Initially,thebiologicalmodelispresented. Onthebasisofthisbiologicalmodel

we establish a formalization that enables reasoning about algorithms used in

linkage analysis. The formalization applies both for single and multi point

linkage analysis. We illustrate the usage of the formalization in correctness

proofsof central algorithmsand optimisationsfor linkage analysis.

A further use of the formalization is to reason about alternative methods for

linkage analysis. We discuss the use of MTBDDs and PDGs in linkage anal-

ysis,since they have proven ecient for other computationallyhard problems

involvinglarge state spaces.

We conclude that none of the techniques discussed are directly applicable to

linkage analysis, however further research is needed in order to investigated

whethera modiedversionof oneor more of theseareapplicable.

(5)

List of Symbols v

1 Introduction 1

1.1 This report . . . 3

1.2 Acknowledgments . . . 3

2 Introduction to Human Genetics 5 2.1 Chromosome . . . 5

2.2 DNA . . . 6

2.3 Alleles . . . 7

2.4 Meiosis. . . 8

2.5 Locating traitgenes . . . 12

2.6 Summary . . . 13

2.7 Furtherreading . . . 13

3 Linkage Analysis 15 3.1 Introduction to Linkage Analysis . . . 15

3.1.1 Performing Linkage Analysis . . . 16

3.1.2 Missing andIncomplete Data . . . 17

3.2 Single Point Analysis . . . 19

3.2.1 FormalModel . . . 19

3.2.2 Single Point ProbabilityComputation . . . 26

3.3 Algorithms for Single Point Probability Computation . . . 28

3.3.1 Founderalleleassignment inpropositional logic . . . 29

3.3.2 Kruglyak'sAlgorithmfor SinglePointProbabilityCom- putation . . . 33

3.3.3 Single Point ProbabilityComputation inAllegro . . . . 38

3.4 Multi Point Analysis . . . 55

3.4.1 FormalModel . . . 56

3.4.2 Multi Point Probability Computation . . . 56

3.4.3 Multi point calculationusing Fourier Transforms . . . . 61

(6)

3.5 Summary . . . 65

4 Towards an Improved Method 67 4.1 Levelof Focus . . . 68

4.1.1 Methods Properties. . . 69

4.2 Hardness ofLinkage Analysis . . . 70

4.3 Symbolic Representation . . . 75

4.3.1 EvaluatingMTBDDs . . . 75

4.3.2 EvaluatingPDG's . . . 79

4.3.3 FutureHeuristics . . . 80

4.4 Utilising theStructural Informationof Pedigrees . . . 80

4.5 Summary . . . 83

5 Conclusion 85 A Linkage Analysis Tools 87 B LOD and NPL score 92 B.0.1 LOD score. . . 92

B.0.2 NPL score . . . 93

C Basic Probability Theory 94

D Markov Models 97

References 101

List of Figures 103

List of Tables 106

Glossary 107

(7)

P

^Pedigree

P = h F, N, f ather, mother i

^,^see^Denition ^1.

f ather

^F^ather^function

f ather : N → V

^,^see^Denition ^1.

mother

^Mother^function

mother : N → V

^,^see^Denition ^1.

V

^Set ^of individuals in a pedigree (

P

^),

V = N ∪ F

^, ^see

Denition1.

F

^Set^of ^founders ⁱⁿ^a ^pedigree ⁽

P

^),

F ⊆ V

^, ^see^Denition

1.

N

^Set^of non-founders in a pedigree (

P

^),

N ⊂ V

^, ^see ^De-

nition 1.

ancestor

^Ancestor^function

ancestor : V → 2 ^V

^,^see^Denition ^1.

G

^Pedigree^graph

G = h P, →i

^, ^see^Denition ^2.

A

^Setôfâlleles ^for â ^pedigree, ^see^Denition ^3.

M

^A^marker

M = h A M , π M i

^,^see^Denition ^4.

A M

^The^set ^of ^allelic^states^for ^the ^marker

M

^,^see^Denition

4.

π M

^A ^function

π M : A M → [0, 1]

^, ^maps ^the ^allelic ^states ^of

A M

^to ^its^allele ^frequency,^see^Denition ^4.

G

^Genotype information

G = hL , astates i

^, ^see ^Denition

3.2.1.

L

^Set^of^genotypedindividualsforapedigree,seeDenition 3.2.1.

astates

^Allelic^state ^function

astates : L → {{a ₁ , a ₂ } | a ₁ , a ₂ ∈ A _M }

^,^see ^Denition^3.2.1.

v

Inheritance vector function:

v : N × { p, m } → { p, m }

^,

seeDenition 6.

v

^The^set ^of ^allinheritance vectorsfor a pedigree, see Sec- tion3.2.2.

p

^Parameter^for^theinheritancevectorfunction,denotesthe paternal allele,seeDenition 6.

m

^Parameter^for^theinheritancevectorfunction,denotesthe maternal allele,seeDenition 6.

Z M

^A^founder^alleleassignment,

Z M : F × { p, m } → A M

^. ^for

the marker

M

^,^see ^Denition ^7.

F

^Allele ^assignment ^function

F : (N × { p, m } → { p, m } ) → ((V ×{ p, m } ) → (F ×{ p, m } ))

^,^which^given^aninheritance vector gives thefounderalleles received byan individual

onthe maternal or paternal side,see Denition8.

F ^v F ^v = F (v)

^,^see^Denition ^8.

(8)

P (v |G )

^The probability of the inheritance vector

v

^given ^some

genotype information

G

^,^see^Section ^3.2.2.

P ( G| v)

^Theprobabilityofsome genotypeinformation

G

^given^an

inheritance vector

v

^, ^see^Section ^3.2.2.

P (Z _M )

^Theprobabilityofthefounderalleleassignment,seeSec- tion 3.2.2.

F _G ^v

^The^setôfâll^compatible ^founderâlleleassignmentsgiven theinheritancevector

v

^,^and^the^genotypeinformation

G

^,

seeSection 3.2.2.

f, f ₁ , f ₂ , . . .

^F^ounder ^alleles.

a ₁ , a ₂ , . . .

^Allelic^states.

$

^Denotes ^either

p

^for ^paternal ^or

m

^for ^maternal ⁽

$ ∈ { p, m }

^).

f ₍ _n,$ ₎

^The ^founder^allele^for ^founder

n ∈ F

f _F v ( n,$ )

^The ^founder ^allele ^inherited ^by

n ∈ V

^given ^the ^inheri-

tance vector

v

ϕ ^v _G

^Fôunderâlleleâssignment^formula^for^theinheritancevector

v

^and ^the^genotype information

G

^,^see^Theorem ^1.

CC

^The ^set^of ^connected ^components^for ^a ^graph.

U

^The ^setôf ûnassigned^founderâlleles, ^see^page ^42.

E

^The^set^ofambiguouslyassignedfounderalleles, seepage 42.

A

^The ^set ^of (unambiguously) assigned founder alleles, see page 42.

θ _i

^The recombination fraction between markers

M _i

^and

M i +1

^,^see ^Section^3.4.

Ham(v, w)

^The ^Hamming ^distance ^between ^two inheritance vectors

v

^and

w

^,^see^Denition ^11.

(9)

Introduction

Linkageanalysisisawellestablishedmethodusedtolocategenesinthehuman

genome. To be more exact, the gene, which is under investigation, is a gene

that is responsible for some trait that an individual possesses. Examples of

traitsunderinvestigationrangefromthe moresimple,suchasbloodtypeand

eyecolor,tothemoreseriousthatmight predisposean individualfor adeadly

disease.

Diseasecausinggenesforsomemajordiseaseshavealreadybeendiscovered

byperforming linkage analysis, e.g. Parkinson's disease, obesity,and anxiety,

[dNC01]. Although environmental factors inuence the disease development

onsome ofthese diseases,itwould be ignorant not to have their geneticcom-

poundsinmind for the following reasons, [dPGD01 ]:

•

^Their ^causes ^are^not ^fullyunderstood.

•

^Current ^treatments^are ^of^limitedeectiveness.

•

^There ^is ^currently ^no ^means ^of^tailoring ^treatment ^to ^the^cause.

•

^Genetic ^diseases ^are ^getting ^an proportionally increased inuence on peoples health in the western world, e.g. the percentage of childhood

deaths inUnited Kingdom hospitals attributable to geneticcauses have

increased from 16.5%in1914 to 50% in1976, [JCBW99 ].

Linkage analysis is a statistical method, which might locate certain trait

causinggenes,basedonsomebiologicalmodel. However,duetothecomplexity

of current algorithms incorporated in linkage analysis, the analysis is only

tractable on moderately sized data sets. Analysis of larger data sets bestows

theanalystwithanimprovedlikelihoodoflocatingthegenesforthetraitunder

investigation.

Ournalgoalistoenableareductionofthelinkageanalysiscomplexityby

(10)

thenumber of individuals involved in theanalysis. Dealing withexponential

state spaces is one of the areas where formal methods applied in the eld

of verication has proven successful in the past decades, and the advances

have often resulted in several orders of magnitude reductions in complexity.

These formal methods include symmetry reduction, partial order reduction,

abstraction,and compositionality.

To reach our nal goal, we dene three major sub goals. The rst is the

establishment of a formalmodelthat enables reasoning about thecorrectness

andcomplexity ofdierent approachesto linkage analysis. Secondly,we anal-

ysethestrategieswhichseem fruitfulto pursuefurther,inorderto reducethe

complexity of linkage analysis. The third and nal sub goal is to construct a

newmethod for linkage analysiswhich enables linkage analysisof larger data

sets.

We have withthis project moved into the relatively new area of bioinfor-

matics,which spansa numberof dierent sciences,suchasbiology,statistics,

mathematics,andcomputerscience. We areunawareofanyonewhohavepre-

viouslyattemptedtoformalizelinkageanalysisinacomputerscienticcontext.

Currently,several dierentsoftwarepackages exist,which canperform linkage

analysis. A brief overview of the major packages are given in Appendix A.

However, we have not been able to nd any formal framework, in which it

is possible to reason about the correctness and complexity of existing algo-

rithms. Furthermore, aformal framework would beconvenient for evaluating

new methods and for investigating the deeper theoretical nature of linkage

analysis.

The work represented here is based on a collaboration with the Icelandic

company deCODE Genetics (http://www.decode.is). deCODE Genetics is a

company conducting research into inherited causes of common diseases. The

research at deCODE is based on the Icelandic population for three major

reasons,the original geneticpoolis small,literaturedocuments thefamilyre-

lationsmanygenerationsback,andnallythe juridicalfoundationshavebeen

established. deCODE currently employs more than 600 people divided into

several departments ranging fromtheResearch Laboratory andtheDatabase

Division,to the Statistics Department with whichwe have collaborated. The

Statistics Department is, among other areas, responsible for developing soft-

warecapable ofdoing statistical analysisof genetic data. Part of our project

group has spent time working with Ph.D. Daniel Fannar Gudbjartsson, the

creatorof Allegro,which is thesoftwarepackage used at deCODE for linkage

analysis (See Appendix A). The stay gave the group a chance to verify and

extend our understanding of the methods currently used within the eld of

linkage analysis.

(11)

Wehave investigated to whatextent theareaof linkage analysisis formalized

inacomputerscienticcontext. Toour knowledge,noformalframeworkexist

todescribelinkageanalysis. Forthisreasonwedevelopsuchaframework,tobe

able to reason about the correctness of current and future algorithms. After

establishing the formal framework, we attempt to determine the complexity

of linkage analysis and based on this investigation we discuss approaches to

reduce the complexity in the average case. More specically, we investigate

howwell MTBDDsandPDGsworkforsymbolicrepresentationof probability

distributions inourproblem domain.

This reportis meant to be a self contained document. Thus readers with

a mathematical, engineering, or computerscience background should be able

toreadandunderstanditwithlittleor nopriorknowledgeonlinkage analysis

and genetics. Readers that do not have this background might experience

diculties in understanding the more computer scientic aspects, although

someintroductionisgivenintoselectedareasintheappendices. Inthebackof

thisreport we have provided a glossary(Seepage 107)of themost important

geneticterms, for quickreference, anda listof symbolscan befound on page

v.

The structure of the report is as follows: The rst part is an introduc-

tionto thebiological model. This partis not exhaustive, however it provides

a sucient background for understanding the motivation behind the two fol-

lowing parts. In the second part we build a formal framework, for proving

the correctness of the algorithms used in linkage analysis. In the nal part

ofthe reportwe state the properties which a newmethod should preserve, in

order to produce the same results as well established methods. Furthermore,

wediscussthe performanceofalternativestrategiesfor therepresentation and

computationaltasksinlinkage analysis.

1.2 Acknowledgments

First we would like to acknowledge Daniel Fannar Gudbjartsson, the creator

of Allegro, for his help and guidance as well as other people inthe Statistics

Department, and SverrirÞorvaldssonand GisliMàsson, at deCODE for their

patienceandwillingness to answerquestionsandgive explanations.

We would also like to thank Professor of Computer Science Finn Verner

JensenandAssistantProfessorofComputer ScienceThomasNielsen,Aalborg

Universityfor their helpwithBayesian theoryand hiddenMarkovmodels.

We would like to thank the following people for inspiration: Professor of

ComputerScienceKimG.LarsenandAssociateProfessorofComputerScience

(12)

MathematicsDepartmentat AalborgUniversity,ResearchAssistantProfessor

atBRICSAarhusChristianN.StormPedersen,ProfessorofComputerScience

at the University of Birmingham Martha Kwiatkowska and Ph.D. student at

theUniversityofBirmingham David Parker.

(13)

Introduction to Human

Genetics

This chapter introduces the reader to the terms and concepts of human ge-

neticswhich areessential forunderstanding linkage analysis. Readers whoare

familiarwithgeneticconcepts likealleles,mendelian inheritance,linkageanal-

ysis,andsoonmaywanttoskipthischapter. Weremindthereaderunfamiliar

with any of these concepts, that there is a glossary on page 107, where it is

possible to nda shortexplanation anda referenceto manyof thetermsand

concepts present inthis chapter.

Thisintroductionisbasedonvarioussources, suchas,[SR99],[Ott99 ],and

[KC97 ]. Forfurtherreading,we recommendthe readerto seeSection 2.7fora

shortdiscussionof some literature.

The structure is asfollows. We begin witha rather high leveldescription

ofgenetics,byintroducingthe twomajor concepts ofchromosomesandDNA.

Thisisfollowedbythe introductionofalleles andallelicstates, whichdescribe

thedierentformsagenecanassume. Twoimportant conceptsininheritance,

meiosis and recombination, are introduced. Furthermore, we introduce the

conceptof linkage analysis andnally summarizethechapter.

2.1 Chromosome

Insideeachhumancellthereare23pairsofchromosomes,22pairsofautosomes

common for both males and females, and one pair of sex-chromosome which

diers between the two sexes. A chromosome consists of two identical sister

chromatides - each a doublehelix DNA, strand, [SR99, page 30], thus a total

of92 DNAstrands, constituting 46chromosomes. During fertilization afetus

receivesonechromosomefromeachparent(actuallyitisonechromatide,which

lateris transformed into achromosome,but inthis context can bethought of

(14)

Thecomplete setof geneticmaterial is calledThe Human Genome.

2.2 DNA

Themolecule thatencodes geneticinformation of humans isdeoxyribonucleic

acid(DNA).DNAisadoublehelixmoleculethatconsistsofasugar-phosphate

backboneandfournitrogenousbases-adenine,thymine,cytosine,andguanine

(A, T, C, and G), [SR99, Chapter 1]. The nitrogenous bases are also called

base pairs (bps), as they always pair: A with T, and C with G. The base

pairsaresituatedbetween thetwo helixes,therebybinding thetwobackbones

together.

A combination ofthree bps are termeda codon 1

. Each codon speciesan

amino acid or a stop code (see Figure 2-1). After a stop code, the bps can

possibly specify a non-coding region, which for example could be hundreds

of C-T combinations. Thus, 3 bps code a codon, codons code the sequence

of amino acids, which again determines the genetic code for a single protein

molecule(agene),[Lan97 ,page246]. Proteins functionasenzymes,hormones,

etc. and therefore inuence many physiological and psychological traits, e.g.

eye-color and schizophrenia. Thus, DNA is the recipe for proteins where the

genetic code consists of nitrogenous bases interpreted in sequences of three.

When the genetic code is translated to a protein, each codon species an

amino acid. Amino acids are the building blocks of proteins. When a stop

codon is reached the translation process is ended and the protein is nished

andreadyto perform itsfunction,such asto signalother cellstostart orstop

theproduction of enzymes,incasetheprotein justbuilt isa hormone.

Some traitsare moredependent on outside stimuli than others, e.g. some

traitsmightnotshowatall,duetotheenvironmentalfactors,whileothertraits

only show in the presence of one or more outside stimuli. For example, it is

believedthatageneexists,thatpredisposesindividualsforobesity 2

. However,

theobesitytraitisonly showninthe casewhere plentiful nutrients have been

availablefor an individualwiththedisease gene.

The sequenceof base pairs encodinga single protein is calleda gene,and

thesection or position onthe chromosome harboringa geneisdenoted asthe

locus (plural: loci)of the gene.

Awellcharacterized locuscanserveasageneticmarker,bywellcharacter-

izedwemeanthat thelocusof thegeneisknownand thatthedierent allelic

statesof the geneis known (allelic stateswill be explained shortly). Markers

1

Also termedatriplet,[KC97 ,page325].

2

Welearnedthatagroupofstatisticiansandbiologistsarecurrentlyworkingonisolating

thegeneforthis disease,during ourstayat deCODEGenetics. Someresults havealready

beenreleasedasnews,[dNC01 ].

(15)

A single codone C

A C T

G

G A A

T T T

A C G C

G A T

T

T T

T A

A A C C A

C

G C G

G G

G Backbone

(DNA strand/chromatide) A single bp (A−T)

Figure 2-1: A subsection of a chromatide, which is also known as a

(subsectionofa)strand. ThemolecularstructureisaDNA

molecule.NoticethepairingAwithTandCwithGcom-

posing the18bps,thatarecoding6codons.

canbeusedasreferencepointsonachromosome andthey canbeanalysed in

order to determine thegenotype of an individual at a given marker, [Gud00,

page 9].

2.3 Alleles

An individual has two alleles at each locus of a gene, since one chromosome

is received from each parent. An allelic state refers to a specic encoding

(in codons) of a specic allele. The allelic states on the two chromosomes

arenot necessarily dierent, but they can be, hence the sequences of codons

codingthe alleles,andtherebygenes, aredierent. Anindividualissaidto be

homozygous ifthe maternal and the paternal allelicstates(the allelereceived

fromthemotherandtheallelereceivedfromthefather)atalocusareidentical.

If the allelic states are dierent, the individual is said to be heterozygous at

thatlocus.

Two alleles at a locus make up the genotype of that locus for a specic

individual. Ifapersonisheterozygousitispossiblethatonlyone ofthealleles

areexpressed. In this case, the allele expressedis called dominant, while the

allele not expressed is calledrecessive. If both alleles are expressed, they are

saidto becodominant. The expressionof agenotypeis calledthephenotype.

Example 1 As an example consider the human blood type. The gene for the

human blood type is located on chromosome 9 at band q34, [Lan97, page 2]

(band refers to a position on the chromosome). Three allelic states exists for

blood type:

A

^,

B

^, ^and

0

^. Êach ^person ^has ^two âlleles, ^however ^there êxists

onlyfourblood types 3

(phenotypes):

A

^,

B

^,

A / B

^, ^and

0

^,^even ^though ^there ^are

are six possible combinations of allelic states:

A / A

^,

B / B

^,

0 / 0

^,

A / B

^,

A / 0

^,

and

B / 0

^(note ^that ^the ^ordering ^is irrelevant). If an individual has the allelic 3

For simplicity we haveomittedthe fact, that apersoncaneitherberesus negative or

(16)

states

A / A

^,

B / B

^, ^or

0 / 0

^the ^individual ^is homozygous. In the event that an individualhas the allelic states

A / 0

^the ^individual^is heterozygous, andfor thesealleles

A

^is^dominant^and

0

^r^ecessive,^since^only^the

A

âlleleîsêxpressed,

givingthepersonthe

A

^blood^type, ^just^as^if^the^person^had^the^genotype

A / A

^.

B

^is ^also ^dominant ^with ^respect ^to

0

^, ^since ^a ^person ^will ^have ^blood ^typ^e

B

if the person has either the genotype

B / B

^or

B / 0

^. ^However,

A

^and

B

^are

codominant, since anindividual withthegenotype

A / B

^will^have ^neither^blood

type just

A

^or

B

^, ^but

A / B

^. ^The ^nal^blood^type

0

îsônlyêxpressed îfâ^person

has the genotype

0 / 0

^.

2.4 Meiosis

Inthissectionwepresentthemeiosisinasimpliedversion,however,sucient

for understanding the later chapters. Due to thecomplexity of thisbiological

processwe consider itout ofscopeto explain itindetail 4

.

Meiosisisanessentialprocess ofthehumanreproduction. Abstractly,itis

a cell division process that produces gametes, female egg cellsor male sperm

cells,whichcombineinreproductionandbecometheseedforanewindividual.

Meiosis isa process where a cell withone chromosome pair (a diploid cell)is

dividedinto fourcellswithonechromatideineach(fourhaploidcells). Thisis

illustratedintheleft half(beforethesecond

⇒

⁾ ^of^Figure ^2-2.

Selection based on Mendels first law (random) Paternal

Maternal chromosome chromosome

a new individual Chromosome pair of Similar process

Paternal diploid cell

Reproduction

gamete a maternal that creates Paternal gametes (4 haploid cells)

Meiosis

Figure 2-2: Meiosisillustratedforasinglechromosomepairof amale

and(partially)afemale. Theresultsoftheirmeiosisislater

combinedinreproductiontoformthechromosomepairof

anew individual. Note that this is simplied, in reality

there are 22 more chromosomes in each cell, to which a

similarprocesstakesplacein parallel.

The probability that agiven gamete is passed on to a child isassumed to

be random. Hence, probability

1 4

^for ^each ^of ^the ^four combinations. A single 4

See[KC97,Chapter2]foradetaileddescriptionofmeiosis.

(17)

2-2). Since a given allele is present in precisely two of the four gametes, the

probabilitythatallele beingtransmittedto anospringis

1 2

^(from

2 · ¹ ₄

^). ^This

hypothesis,thatthesourceof inheritance fora singlealleleisequallylikely to

beeitherofthetwo grandparents, wasstated byGregorJohann Mendelinthe

18'thcentury and isknownasMendel's rst law,[KC97 , Chapter3].

Thepairofallelesreceivedfromboththeparentsatagivenlocusconstitute

achild'sgenotypeatthatgivenlocus. AsanexampleseethepedigreeinFigure

2-3.

Female individual Male individual Genotype information

Two alleles at two loci on Information for one locus

the same chromosome

X x A a

Grandfather Grandmother

Father Mother

Son Daughter

X X

A A a a

x x

A a

X x x x

a a

A a A a

x x X x

Legend:

Figure 2-3: Anexampleofapedigree. Thefamilyisconstitutedbytwo

grandparents,twoparents,andtwochildren. Boldletters

nexttotheindividualsindicategenotypedinformation,e.g.

the daughter is

A/a

^at ^the ^rst ^marker ^and

X/x

^at ^a

secondmarker.

Recombination

Recombination refersto new combinations of genes. This phenomenon arises

during meiosis when the two chromosomes of a pair, one inherited from the

fatherandone fromthemother,are(mostprobably)recombined toformfour

newunique chromatides, asdepicted inFigure 2-2.

The cause of a recombination is the biological event called crossover 5

. A

crossoverinvolvesphysical breakage ofthe chromosome into one paternal and

one maternal chromatide, and a joining of the paternal and maternal ends,

[SR99, page 40], we have illustrated this in Figure 2-4. Note that we write

A / a

^when^a^person^has^the^allelic^states

A

^and

a

^at ^a^given^locus,^whereas^we

write

A X a

x

^when

A

^and

X

^reside^on^one^chromosome^while

a

^and

x

^reside^on^the

other. When two alleles at two dierent loci reside on the same chromosome

(18)

exampleweassumetoknowwhich allelesresideonwhichchromosomes, hence

thephases of thealleles areassumedto beknown. Whenwe do notknowthe

phasesof a set ofalleles, as itis oftenthe case for linkage analysis, thephase

isunknown.

Example 2 IntheinheritanceexampleofFigure2-3thesonisarecombinant,

whereas the daughter isa nonrecombinant, with respect tothe two loci shown.

Inthe example theson isthe recombinant, since thealleleson thefather's two

chromosomes,

A X

^and

a

x

^,^have ^recombined ^to^form^then ^new

A

x

combination,so

A

ôriginates ^fromâ ^dierent ^chrômosome ^than

x

^.

Meiosis assuresthatthe number ofchromosomesina humanindividualis

constant, since only oneof the four gametes ispassed on to anew individual.

Thisgamete isthencopied inorder to constitute afull chromosome withtwo

sisterchromatides for thenew individual. Furthermore, meiosisand recombi-

nationisthe reason thatthechild's chromosomes, constituting achromosome

pair,aremostlikelynot identicaltoanyofthefourchromosomesourcesofthe

twoparents. Recombinationsensuresdiversityanduniquenessinapopulation.

a

b

C

Resulting 4 strands B

C A

B

C

A a

b

c

b

c A

B

C

A a

B

c a

b

C

B

C A

b

c

A a

B

c a

b

c

2 chromosomes

Crossover

Start meiosis End meiosis

4 strands

1 chromosome pair

Figure 2-4: The gure illustrates how a crossover might occur on a

chromosome,wherethreemarkergenesareresiding. Hence

this meiosis leads to the gametes {

A, B, C

^}, ^{

A, b, c

^},

{

a, B, c

^}, ^and ^{

a, b, C

^}, ^where ^all ^but ^{

A, B, C

^} ^are ^re-

combinations,[Ott99,page10]. Theelementsatthestart-

ingpointisonechromosomepair,twochromosomes,which

eachconsistsoftwostrands(DNAmolecules).

Theprobabilitythatanoddnumberofcrossoversoccurbetweentwolociis

termedthe recombination fraction. Note thatifan even number of crossovers

(19)

same parental chromosome. Thus the individual inheriting this chromosome

have inherited the genesat this pairof locifrom thesameancestral origin.

Example 3 Assume that a new chromatide is assembled fromone end tothe

other. Assume that the rst part of the chromatide is based on one of the

paternal chromatides, a crossover occurs, hence the chromatide is now being

constructed from a maternal chromatide. If a new crossover occurs (thus we

have had two crossovers so far) the source of the assembly is now back to a

paternal chromatide. Multiple switchescan occurbetweenthe paternalandthe

maternalchromatides.

Inasense,recombinationsdonotoccurcompletelyatrandom, [SR99,page

271]. The concept of interference states that in the immediate vicinity of a

crossover, it is virtually impossible for another crossover to occur. Thus the

recombination fractionremains very close to zero for some distancealong the

chromosome near a crossover. The recombination fraction between two loci

is related to their physical distance. The mathematical relationship between

physical distanceand the recombination fraction is given by a map function.

ThemostwidelyusedmapfunctionistheHaldanemapfunction,[Gud00,page

7]:

d = − 1

2 ln(1 − 2θ), θ = 1

2 1 − e ⁻² ^d ,

where

θ

^is ^the recombination fraction and

d

^is ^the ^distance ⁱⁿ ^Morgans.

On average, approximately one recombination occur for each

10 ⁸

^bps. ^The

distancebetweentwolociisdenedas1Morgan(M),ifonerecombination can

be expectedbetween themduring meiosis(hence

1

^M ^is^roughly^equivalent ^to

10 ⁸

^bps). ^Thus, ^there ^is ^a correspondence between the distance between two lociandthelikelihoodofarecombination. Thegreater thedistance,thecloser

therecombination fractionapproaches

1 2

^,^{[Ott99 ,}^page ^14]. ^Note^that^between

two loci residing on dierent chromosome pairs the recombination fractionis

1 2

^,âsêach^meiosisônêither^chromosomeâreindependent. Twolociaresaidto be linked ifthe recombination fractionbetween themis lessthan

1

2

^,^otherwise

the two loci are unlinked. This is also an observation made by Mendel, in

Mendel's second law, he states that unlinked, segregating gene pairs assort

independently atmeiosis,[Ott99 ,page 38]. Wewillreferto Mendel'srstand

secondlawasMendelian inheritance 6

.

(20)

withrespecttophysicaldistance. Itcandeviatealongachromosome anddur-

ingfemalemeiosisthe recombination fractionison average twicethat ofmale

meiosis. In the present study we assume that the recombination fraction is

constant, throughout the genome andthe same for both male and female 7

.

2.5 Locating trait genes

The Human Genome project deals with sequencing the chromosomes sothat

theorder of genes is established. However, when this work is completed, the

taskstill remains to uncoverwhichgenes areresponsible for whattrait 8

.

Doctors, physicians, psychologists, psychiatrist, etc. might suspectthat a

traitwithintheireldiscausedbysomegeneticmutation. Thesuspicionmight

arisefromthefactthatthetraitisoftencarriedbymanyinsomefamiliesand

bynone inotherfamilies. Ifthe trait causing gene(s) couldbe located, i.e. if

thelocus or loci could be identied, it is possible to infer the protein(s) that

causea given trait. Thisknowledgecan beusedfor:

1. Diagnosis: If the locus and the allelic state which is responsible for

some trait is known, a blood sample would be sucient to diagnose a

person,since theresultof asequencing ofthepersonsDNAat thelocus

could be matched against thesequenceof thetrait causing allelicstate.

It would also be possible to assess the chance of a person developing a

trait, by knowing the personsgenotypeat thetrait locusor loci.

2. Prescribeeectivedrugs: Somepeoplerespondtosomedrugs,while

others donot. If the responsiveness isdependent on one or more genes,

then a determination of a persons allelic state for those genes can lead

to the correct drugprescription.

3. Cure: If the proteins responsible for a trait are known, it might be

possible to develop a drug which surpassesthe eect of the proteins or

the proteinsthemselves.

Two kinds of genetic studies exist inthe literature. Both aim at locating

genes causing traits: Association studies and linkage analysis, [Cur01 ]. In

associationstudiestheaimistolocateatraitgenebylookingformarkeralleles

more frequent ina group of unrelated individuals carrying a trait, than to a

groupof people not carrying the trait. E.g. iftwo groups of 1000 individuals

are studied, and 80% of the individuals in one group - the group of people

7

Gudbjartssonmakesthesameassumptionsin,[Gud00 ].

8

Currently,thereisquitealargeamountofdataavailable,[BO98],howeverthecompu-

tationalpowerisinsucienttoprocessthisdata,[JIS93 ].

(21)

by30%ofindividualsintheothergroupofpeoplenotcarryingthetrait,there

issomeevidencethatthetraitandthealleleareclosetooneanother(i.e. they

are linked). Association studies are only useful when the trait causing gene

is located very close to a marker gene, because it is based on a rather sharp

divisionof thetwo groups(eithertheyhave thetrait andthemarkerallele,or

theydo not have either).

In linkage analysis related people are studied and dierent pedigrees are

studied independently. The aim inlinkage analysisis to nd the locusof one

or more genes causing a trait, by looking at inheritance patterns of markers

andcomparingthesewiththeinheritance patternsofthetraitinquestion. By

analysinghowdierentmarkersaresegregatedcompared tothesetof eected

individualsthelocationofthetraitgenecanbeapproximated. Linkageanalysis

isthefocusofthe remaining partof this report.

2.6 Summary

Inthis chapter we have introduced the reader to thebiologicalbackground of

thisreport, inorderto giveabasisfor understandingthebiologicalcontext of

therest of this report. We have explained the structure ofa DNA molecules,

andhowthey combine to the 23 chromosomes thatencodes thegeneticinfor-

mationofanindividual. Furthermore,we introducedthereader totheprocess

ofgeneticinheritance. Howtheallelesareinheritedisbasedonsomeprobabil-

ity(Mendelianinheritance), whichagain isinuenced bybiologicalproperties

likerecombinationfraction,interference,etc. Themaintaskoflinkageanalysis

isto locatethe geneonthe humangenome,which causesthetrait of interest.

From now on we will abstract away from biological details. The focus is

now on wholegenes (markers), which issucient to describelinkage analysis

formally. Another prime concept is that of pedigrees. As stated earlier, the

glossaryonpage 107 can provide quickreference.

We now concentrate onthe taskoflinkage analysis,thus locating thepos-

sibleposition of atraitcausing gene ona single chromosome. The Mendelian

laws of inheritance have a major role in linkage analysis. It is safe only to

considera single chromosome pair, since genes ondierent chromosome pairs

assortindependently.

2.7 Further reading

Theliteratureis dividedinto two categories:

Biologyorientated: Thisliteratureincludes[KC97]and[SR99],which

(22)

puter scientist, much of the information in these books is superuous

for understanding the biological modelassumed for thealgorithms pre-

sentedinthefollowing chapters. However,theycanserveasanextensive

reference.

Statistics orientated: This literature includes [Lan97] and [Ott99 ],

which arewritten by peoplewith a background in mathematics and/or

statistics. All of the books we have read in this category have an in-

troductorysectioncontaining abrief descriptionof thebiologicalmodel

and terms used. We have found that this kind of literature are inade-

quate to the readerwitha background incomputerscience, because the

maintopic ofthesebooks,thestatisticsmodels,outweighsthebiological

descriptions.

In our case the process of learning bioinformatics has largely been an it-

erative process from biologyto statistics, back to biology, etc. Thisis mainly

becausewehave foundlittle literatureon thistopic,whichwaswrittenfroma

computerscienticperspective.

We have found two references, thatis not biology and statisticsoriented.

Bioinformatics - A Practical Guide to the Analysis of Genes and Proteins,

[BO98 ], islargely a guide bookfor people interested inlocating software and

data for genetic analysis. It contains user guides for various genetics analy-

sis software, as well as guides on how to nd and extract data from public

available databases. Another exception is Faster Sequential Genetic Linkage

Computations, [JIS93]. This is probably the most computer science oriented

paper,itincorporatestwosynthesisofthebiologicalmodelandfourimplemen-

tationlevelspeed ups of central algorithms. Something thatisomitted inthe

paperisadiscussionofthe previouslymentionedformaltechniques,forspeed-

ingup computations. This is most probably because the paper is rather old

(1993),compared to thedevelopment ofthesetechniqueswithinthecomputer

scientic community.

(23)

Linkage Analysis

Thepurposeof this chapter is to describe the conceptof linkage analysisina

formalfashion. First, we motivate linkage analysis by giving an informal de-

scriptionofthe concept,basedonhowitisperformed. Afterthiswe formalize

theunderlying biologicalmodel, this includesformalizing pedigrees, genotype

information, inheritance vectors, and founder allele assignments. The meth-

ods implemented in the software currently used to solve the linkage analysis

problemconsists of two algorithms, namely one to solve single point analysis

and one to solve multi point analysis. After introducing the biologicalmodel

we rstfocusonsingle point linkage analysis. We state theproblemformally,

followed by a formal description and proof of the correctness and complexity

ofthealgorithms implemented inGenehunter andAllegro 1

. Sincenoprevious

framework has been established we prove these two applied algorithms. The

purpose ofmulti point isto gain moreaccurate estimates of linkage. Ourfor-

malizationisexpandedinthedescriptionofmultipoint analysis. We describe

multi point probability calculation as a Bayesian network and not as a hid-

denMarkovmodel, aspreviouslydescribedintheliterature,e.g. aspresented

in [LG87 ]. Finally, we describe two reductions, one that is applied in both

Genehunterand Allegroand theother onlyinAllegro.

3.1 Introduction to Linkage Analysis

Linkage analysisis aprocess for constructing genetic linkagemaps. A genetic

linkagemapisadescriptionofhowanumberoflocionachromosomerelatein

termsof genetic distance. Hence, a genetic linkage mapconsists ofa number

ofloci for which theorder and relative distance(recombination fraction 2

) are

known.

1

SeeAppendixAforalistofsoftwareusedtoanalysegeneticdata.

2

Seepage10.

(24)

of detailed genetic linkage maps based on molecular markers 3

(from now on

just marker). A marker is a polymorphic DNA or protein sequence deriving

from a single chromosomal location, [SR99, page 551]. Historically, markers

were limited to observable traits known to follow the mendelian 4

inheritance

pattern, e.g. as blood groups. Today, however, certain highly polymorphic

structuresofthehumanDNAcanbetypeddirectlyonlargescalebyautomated

equipment. Thesestructuresarecalledmicrosatellites which aredi-, tri-,and

tetraneucleotide repeats,and singlenucleorite polymorphisms 5

.

In the following we give a overview of how linkage analysis is performed,

andwhich dataisavailable.

3.1.1 Performing Linkage Analysis

Thedata neededinorder to perform linkage analysisis:

•

^A ^genetic^linkage ^map.

•

^A ^number^of ^pedigrees.

•

Information on theinheritance ofmarkers.

•

^Trait ^status^for ^some^or ^all ^pedigree ^members.

Figure 3-1 depicts a pedigree with the information for one marker. The

information is usedto infer evidence thatthe inheritance pattern of the trait

resembles the inheritance pattern of a marker. In Figure 3-1 the inheritance

patternofthetraitfollowstheinheritanceofthepaternalalleleofindividual1.

Thereasonfor thiscouldbethatthetrait causinggeneissituatedcloseto the

marker,i.e. they arelinkedwhich means recombination occurs lessfrequently

than hadthey been unlinked.

Sincecrossoversoccur,itisnotalwaysthecasethattheinheritancepattern

ofa traitfollows exactlythat ofa marker. For this reason we need ameasure

ofhow closelythe inheritancepatterns ofatrait and amarkerresembles each

other. Suchmeasuresareexpressedthroughscoringfunctions. TheLODscore

isone suchscoringfunction. It isout of the scope ofthis reportto dene any

scoring functions, but for the sake of completeness Appendix B gives formal

denitionsoftwocommonlyusedscoringfunctions,theLOD andNPLscores.

Linkage analysis is performed for a number of pedigrees on a subset of

markers across the genome. If the scoring between the trait and a marker

indicates potential linkage, then that region is further analysed by including

3

Seepage6.

4

Seepage9and11.

5

See[SR99,page273]foradetaileddescriptionofthehistory,use,andtypingofmarkers.

(25)

Legend:

a|a

Female individual Male individual

5 3 4

2 1

Genotype information

Alleles Right: Maternal Left: Paternal

Affected individual (phase known)

A|a

A|A a|a

A|a

Figure 3-1: Apedigreewithtraitstatus andexactinformationonin-

heritanceofalleles. Notethatin realitythephaseisoften

notknown.

more markers from the genetic linkage map. Given that trait causing gene

is located in theregion under investigation, the new analysis estimates more

preciselythelocation of thetraitcausing gene.

This process is iterated until no more markers are available in the prox-

imity of the trait causing gene. The genetic linkage map is then updated by

introducinga locusrepresenting thetrait causing gene.

Usingexactknowledgeoftheinheritancepatternsofmarkers,linkageanal-

ysis is straight forward, but inreality many pedigree members are not geno-

typed making it impossible to infer the precise pattern of inheritance. This

andother problemsarediscussed inthe following section.

3.1.2 Missing and Incomplete Data

Asstatedabove,markerswereoriginallylimitedtosimplecharacteristic traits,

such as blood type or eye color. Recent techniques have made it possible to

determinetheallelicstatesofmicrosatellitesinhumanDNA.Thismethodhas

onlybeen usedfora fewdecades,thereforegenotypeinformation isoftenonly

available for the youngest generations. Since pedigreesare partlyconstructed

fromwrittendocuments, suchaschurchrecords,whichcontainnoinformation

about genetic markers, some individuals in the pedigree might not be geno-

typed at all. This is often the case for pedigree founders. Furthermore, the

phaseofthe observedalleles for a markercannotbedetermined.

Due to these problems, the inheritance pattern for each marker needs to

be estimated on behalf of the genotyped individuals. Since the genotype in-

(26)

pattern becomes impossible to infer. To illustrate the problem we introduce

the concepts of identical/identity by descent(IBD) and identical/identity by

state(IBS).Two individualsareIBDifthey shareinheritance ofsome founder

allele. Two individuals are IBS if they have the same allelic state for some

marker,these alleles need not be IBD.

Legend:

A/a

A/A a/a

a/a

Female individual Male individual

3 4

2 1

A/a Genotype information (phase unknown) Alleles

Right: Maternal Left: Paternal

A/a

A/a 5 6

Figure 3-2: IBS vs. IBD. Individuals 5 and 6are IBS, but since the

motherishomozygoustheyhavenotnecessarilyinherited

thesamematernalallele. Infact inthis casetheyarenot

IDB.

Figure 3-2 is a example of two siblings which are IBS, but not IBD. The

individuals 5 and 6 share inheritance of their paternal alleles (A). The allelic

statesof their maternal alleles (a) areidentical, but they arenot identical by

descent becauseof thepattern ofinheritance depicted inthegure.

For thereasonsgiven,thepreciseinheritance patternisimpossible toinfer

from the given data, that is, several inheritance patterns can give rise to the

sameobservedgenotypeinformation. Someoftheseinheritancepatternsmight

bemorelikelythanothers,thusweneedtocomputetheindividualprobability

ofeach. Thesecomputationsare the topicof therest ofthis chapter.

Eachpossibleinheritancepatterncanthenbescoredagainsttheinheritance

pattern of the trait, using for instance the LOD score. The contribution to

the nal score of the marker from each inheritance pattern is based on the

probability ofthat pattern.

Computing the probabilities of possible inheritance pattern is performed

intwo steps, single point analysis and multi point analysis. These topics are

covered inSections3.2and 3.4, respectively.

(27)

Singlepointanalysisistheprocessofdetermininglinkagebetweensinglemark-

ers and traits using only the genotype information available at each marker.

The primary task is computing probability distributions of inheritance pat-

terns at some locus given the available genotype information at that locus.

Thisiscalledsingle point probability computation.

Tocomputetheprobabilitydistributionoverinheritancepatterns,weintro-

duceaformalframeworkforthebiologicalconceptsintroducedinthepreceding

chapter. After the introduction of the formal framework we explain formally

howthe probability distributionis calculated.

3.2.1 Formal Model

Pedigree

InSection 2.4 the concept of a pedigree was introduced. In thefollowing we

formalize thisconcept.

Denition 1 (Pedigree) Apedigree isa4-tuple

P = h F, N, f ather, mother i

witha set of individuals (nodes)

V = F ∪ N

^, ^where

F

^is^the ^founder ^set ^and

N

^is ^the non-founder set, andtwo functions:

-

f ather : N → V

^.

-

mother : N → V

^, ^satisfying ^the ^following constraints:

•

Â ^node^cannot ^be ^both â ^founderând â non-founder, that is

F ∩ N = ∅

^.

•

^No ^node îs^both â ^father ândâ ^mother, ^that îs:

∀ n, n ⁰ ∈ N

^and

n ⁰⁰ ∈ V . mother(n) = n ⁰⁰ = ⇒ f ather(n ⁰ ) 6 = n ⁰⁰

^; ^and

∀ n, n ⁰ ∈ N

^and

n ⁰⁰ ∈ V . f ather(n) = n ⁰⁰ = ⇒ mother(n ⁰ ) 6 = n ⁰⁰ .

•

Â ^nodeîs ^never îts ôwn âncestor, ^that îs:

∀ n ∈ V . n / ∈ ancestor(n),

where

ancestor : V → 2 ^V

^is^dened recursively as:

Anynon-founderhas itsfatheranditsmotherasancestors together

with the ancestors of both parents,that is:

∀ n ∈ N . ancestor(n) = ancestor(f ather(n))

∪ ancestor(mother(n))

∪ f ather(n) ∪ mother(n).

(28)

∀ n ∈ F . ancestor(n) = ∅ .

Theconstraintsonthepedigreeareintroducedtoconsistentlydescribethe

underlyingbiological model.

Denition 2 (Pedigree Graph) Thegraph

G = h P, →i

^consisting^of^a^pedi-

gree

P

^and^a ^transition ^relation

→ ⊂ V × N

^, ^where:

n → n ⁰

ⁱ

f ather(n ⁰ ) = n

^or

mother(n ⁰ ) = n,

iscalled the pedigree graph generated by

P = h F, N, mother, f ather i

^.

Weusetherelations

n −−−−→ ^{f ather} n ⁰

^and

n −−−−→ ^mother n ⁰

^to^denote

f ather(n ⁰ ) = n

and

mother(n ⁰ ) = n

^,respectively. From this denitionitfollows that:

→ = { (n, n ⁰ ) | (n, n ⁰ ) ∈ −−−−→ ∨ ^{f ather} (n, n ⁰ ) ∈ −−−−→} ^mother = −−−−→ ∪ ^{f ather} −−−−→ ^mother .

Remark: Thisimpliesthe following relationship between ancestors andthe

→

relation:

n → ⁺ n ⁰ ⇔ n ∈ ancestor(n ⁰ ),

^(3-1)

where

n → ⁺ n ⁰

^denotes ^that^thereîsâ ^pathⁱⁿ^the^pedigree^graphôf^length ât

least1from

n

^to

n ⁰

^.

Example 4 Figure 3-4 is the representation of the example pedigree in Fig-

ure 3-3 under the formal framework. In this example the pedigree graph,

G

^,

has the followingstructure:

G = hP, →i,

^where:

P = h F, N, mother, f ather i , F = { 1, 2, 4 } ,

N = { 3, 5 } , mother(3) = 2,

f ather(3) = 1, mother(5) = 4,

^and

f ather(5) = 3.

Figure 3-4 illustrates how the individuals of the example above are related

interms of the

f ather

^and

mother

^functions.

Lemma 1 A pedigree graph isa directed, acyclic graph (DAG).

(29)

Legend:

1

3

5 4 2

A/a A/a

A/A a/a

a/a

A/a

Female individual Male individual Genotype information

(phase unknown)

Figure 3-3: Anexamplepedigreewithgenotypeinformationforwhich

thephaseisunknown.

1 3 father

2 mother

5 father

4 mother

Figure 3-4: AformalrepresentationofthepedigreeinFigure3-3. The

transitions havebeenlabeledto clarifywhether a parent

is

f ather

^or

mother

^.

(30)

node would be its own ancestor. Assume that a pedigree graph

G = h P, →i

contains a cycleconsistingof thenodes

V ⁰ ⊆ V

^,^where:

∀ n ∈ V ⁰ .n → ⁺ n.

Itfollowsdirectlyfrom(3-1)onpage20,that

n ∈ ancestor(n)

^,^which^con-

tradictsDenition 1 ofa pedigree.

♦

Allele

Denition 3 (Allele) The set of allelesin a pedigree

P

^is^dened ^as:

A = { (n, $) | n ∈ V

^and

$ ∈ { p, m }} .

We call the subsets

A N = { (n, $) | n ∈ N

^and

$ ∈ { p, m }}

^and

A F = { (n, $) | n ∈ F

^and

$ ∈ { p, m }}

^the ^set ^of non-founder alleles and the set of founder alleles, respectively.

(n, p)

^and

(n, m)

^represent ^the ^paternal ^and

maternalalleles.

Marker

Denition 4 (Marker) Amarker

6

M

îs^denedâsâ^2-tuple

M = h A _M , π _M i

^,

where:

• A M = {a ₁ , a ₂ , . . . , a k }

îs ^set ôf âl^lelic ^states^for ^the ^marker

M

^, ^and

• π _M : A _M → [0, 1]

^is ^the ^allele ^frequency⁷ ^function, ^such ^that:

The sumof allallele frequencies is1, that is:

X

a ∈ A _M

π _M (a) = 1.

Theallele frequenciesarebasedon statisticalinformation fora givenpop-

ulation.

Notice the dierence between the terms allele and allelic state. An allele

istheterm usedfor the section (also denoted position) ofa chromosome rep-

resenting a marker. An allele can have a numberof states, depending on the

allelic states for the marker. Thus, an allele can be thought of as a variable

whichcanbeassignedavalue,thatis,anallelecanbeassignedanallelicstate.

6

Seepage6.

7

Amoreappropriatetermwouldbeallelicstatefrequency,howevertoremainconsistent

withthecurrentpracticewecallitallelefrequency.

(31)

pedigree, and we always work witha single marker, which we call

M

^. ^Thus,

whenwe write

A _M

^or

π _M

^we^refer ^to âllelic^states ôr^theâllele frequenciesfor themarker

M

^.

Genotype information

Denition 5 (Genotype Information) Genotype 8

information,

G = hL , astates i

^, ^for ^a ^gene,

g

^, ^consists ôf â ^set ôf ^genotyped individuals

L

anda function

astates

^, ^where:

-

L ⊆ V

^, ^and

-

astates : L → {{ a ₁ , a ₂ } | a ₁ , a ₂ ∈ A M }

^.

Noticethat thedenitionindicates thatthe phase isunknown since there

isnoreferencetomaternalandpaternalalleles. Ifanindividualishomozygous

atagivenlocus,the

astates

^function^returnsâ^set^containing^justôneêlement.

Example 5 The pedigree graph inFigure3-3 showsgenotypic informationfor

allindividuals. Thusthe setofgenotypedindividuals

L

^is^equal ^to^the^complete

node set

V

^. ^The âllelic ^states ôf îndividual ⁵ âre:

astates(5) = {A, a}

^.

Inheritance Vector

Aninheritancevector 9

isacompletespecicationonhowtheallelesoffounders

areinherited by each non-founder. Thisis a formal way of describing inheri-

tancepatterns. Weformalizethenotionofinheritancevectors inthefollowing

denition:

Denition 6 (Inheritance Vector) An inheritance vector

v

^for ^some ^pedi-

gree graph

P

^is^a ^function:

v : N × { p, m } → { p, m } .

Theinheritance vectorisusedtogiveinformationabouttheinheritanceofthe

paternal,

p

^, ^and ^maternal,

m

^, ^alleles ^for ^every non-founder (See the example below).

8

SeeSection2.3 forthebiologicalmeaningofgenotypes andalleles.

9

The notion of inheritance vectors was introduced in [LG87 ] and further used in

[KDRDL96].

(32)

be:

v(3, p) = p, v(3, m) = m,

v(5, p) = p,

^and

v(5, m) = p.

Thiswould result in the following inheritance pattern (see Figure 3-5):

Individual 3 has inherited his paternal allele from the father of individual

1 and his maternal allele from the mother of individual 2. The two alleles

of individual 1 and the two alleles of individual 2 are founder alleles. Hence,

individual 3 has inherited the paternal allele of a founder, individual 1, and

the maternal allele of another founder, individual 2. Note that the father of

individual 5 (individual 3) does have parents in the pedigree. To determine

whichfounderallele individual5has inherited, weusetheinformationthatthe

paternal allele istransmitted fromthefather's father. Thismeans,that wecan

ndthe transmitted founderallele by searching the graphrecursively using the

inheritanceinformationofindividual3. Thusthepaternalalleleofindividual5

isthe sameas the paternal allele of individual 3. Since

v(5, m) = p

^individual

5 has inherited the paternalallele of individual 4, who is a founder.

Legend:

A/a A/a

A/A a/a

a/a

Female individual Male individual

5 3 4

2 1

A/a Genotype information (phase unknown) Alleles

Right: Maternal Left: Paternal

v(5,m)=p v(5,p)=p

v(3,p)=p v(3,m)=m

Figure 3-5: Showshowitisdeterminedwhichfounderallelesthenon-

foundersreceivegivenaninheritancepattern.

Founder Allele Assignment

Weneedadenitionofafounderalleleassignment,which assignsallelicstates

to all founder alleles. Such a denition is necessary inorder to reason about

BRICS Basic Research in Computer Science

B R ICS R S -02-7 In g ´olfsd ´ottir et a l.: A F o rmalization o f L in k age An alysis

BRICS

Basic Research in Computer Science

A Formalization of Linkage Analysis

Anna Ing´olfsd´ottir

Anders Lyhne Christensen Jens Alsted Hansen

Jacob Johnsen John Knudsen

Jacob Illum Rasmussen

BRICS Report Series RS-02-7

Copyright c 2002, Anna Ing´olfsd´ottir & Anders Lyhne Christensen

& Jens Alsted Hansen & Jacob Johnsen & John Knudsen & Jacob Illum Rasmussen.

BRICS, Department of Computer Science University of Aarhus. All rights reserved.

Reproduction of all or part of this work is permitted for educational or research use on condition that this copyright notice is included in any copy.

See back inner page for a list of recent BRICS Report Series publications.

Copies may be obtained by contacting:

BRICS

Department of Computer Science University of Aarhus

Ny Munkegade, building 540 DK–8000 Aarhus C

Denmark

Telephone: +45 8942 3360 Telefax: +45 8942 3255 Internet: BRICS@brics.dk

BRICS publications are in general accessible through the World Wide Web and anonymous FTP through these URLs:

http://www.brics.dk ftp://ftp.brics.dk

This document in subdirectory RS/02/7/

P

P = h F, N, f ather, mother i

f ather

f ather : N → V

mother

mother : N → V

V

P

V = N ∪ F

F

P

F ⊆ V

N

P

N ⊂ V

ancestor

ancestor : V → 2 V

G

G = h P, →i

A

M

M = h A M , π M i

A M

M

π M

π M : A M → [0, 1]

A M

G

G = hL , astates i

L

astates

astates : L → {{a 1 , a 2 } | a 1 , a 2 ∈ A M }

v

v : N × { p, m } → { p, m }

v

p

m

Z M

Z M : F × { p, m } → A M

M

F

F : (N × { p, m } → { p, m } ) → ((V ×{ p, m } ) → (F ×{ p, m } ))

F v F v = F (v)

P (v |G )

v

G

P ( G| v)

G

v

P (Z M )

F G v

v

G

f, f 1 , f 2 , . . .

a 1 , a 2 , . . .

$

ancestor : V → 2 ^V

astates : L → {{a ₁ , a ₂ } | a ₁ , a ₂ ∈ A _M }

F ^v F ^v = F (v)

P (Z _M )

F _G ^v

f, f ₁ , f ₂ , . . .

a ₁ , a ₂ , . . .

f ₍ _n,$ ₎

f _F v ( n,$ )

ϕ ^v _G

θ _i

M _i