In[Gud00 ]Gudbjartssondenesfoundercouplereduction asanadditionalway
of reducing the number of computations performed during single and multi
point linkage analysis. Intuitively, founder reduction states that we cannot
distinguish between the maternal and paternal alleles of founders, that is,
we can switch between their two alleles. Founder couple reduction states,
intuitively,thatwecannotdistinguishthemaleandfemalefounderofafounder
couple. Afounder couple istwo founderswho share ospringinthepedigree.
Westate this formallyinthefollowing theorem:
Theorem 7 (Founder Couple Reduction) Let
h, h 0 ∈ F
be the male andfemalefounder,respectively, of a founder couple insome pedigree
P = hF, N, f ather, motheri
. LetC
denote theset of childrenofh
andh 0
, thatis
C = { n ∈ N | f ather(n) = h
andmother(n) = h 0 }
. Letv
andw
be twoinheritance vectors, suchthat for any
n ∈ N
:w(n, $) =
v(n, $) : n ∈ C
v(n, $) : f ather(n) ∈ C
ormother(n) ∈ C v(n, $) :
otherwise.
Given some genotype information
G = hL, astatesi
on the pedigree whereh, h 0 ∈ L /
and that neitherh
norh 0
has any children inV
not inC
, thefollowingalways holds:
P(v|G ) = P(w|G ).
(3-24)We do not prove this, but refer the keen reader to [Gud00 , page 35] for
moreinformation on foundercouple reduction.
usingthe formalframework.
Two algorithms, Kruglyak and FastTreeTravesal, for nding
com-patiblefounderalleleassignmentsandprobabilities ofinheritance vectorswere
analysed for correctness and complexity using the formal framework. These
algorithmwereprovedbydeninginterpretationsoftheoutputfromtheto
al-gorithmsin termsoffounderallele assignments. Theseinterpretations proved
equals to the denition of compatibility introduced in the framework. The
complexityof Kruglyak is
O( | N | · 2 2·| N | )
and thecomplexityof FastTree-Traversal is
O(2 2·| N | )
. Both algorithms can be reduced in complexity beusingfounderreduction andfounder couplereduction.
The formalframework wasextended to describe multi point linkage
anal-ysis. We found that multi point linkage analysis cannot, as described in the
literature[LG87 ],bedescribed asahiddenMarkovmodel. Instead weargued
thataBayesian networkwhichsharesstructurewith hiddenMarkovmodelsis
amore accurate descriptionof the problem.
Towards an Improved Method
Aswe have documented, the current methods for linkage analysis are
compu-tationally heavy. The best known algorithm is thealgorithm used inAllegro
(SeeSection 3.3.3) 1
. The purposeof this chapter isto present a discussionof
strategiesfor the development of a method,bywhich linkage analysiscan be
performed more eciently. The strategies considered are inspired by
classi-cal concepts of computerscience such as: Compositionality, abstraction, and
symbolic representation.
Apossibilityforspeedingupthecomputation isto approximatethe
prob-ability distribution over inheritance vectors. Approximation is an important
futureworkaspectpointedoutin[Gud00]. Theapproximationapproachopens
a range of possibilities, but what in the end is needed is still a probability
distribution that is consistent with the underlying biology, so linkage can be
detected. Thus small deviations seem tolerable. However, approximation is
out of scope for this project, because our interest is not inproving statistical
properties, relative to the true probability distribution, of an approximation
method, rather we wish to apply a computer scientic approach to linkage
analysis. We believe that itis still possible to develop better methods, which
produceexact results. Thereforewe willnot pursueapproximationasa
possi-bleoptimisation any furtherinthis report.
Anotherinteresting aspectis toestablishthe computational complexityof
linkage analysis. If it turns out to be NP-complete, it makes little sense to
lookfor polynomial-time solutions. Initially, itseemsappealing to establisha
reductionfromasatisability-basedproblemtotheproblemofsortingout the
incompatibleinheritance vectors.
The structure of this chapter is as follows. First, we determine the
rel-evant focus for approaching the problem. Then, before presenting possible
1
Apaperwasrecentlybeenpublishedthatclaimsseveralordersofmagnitudein
improve-mentsoftimeand memoryrequirements,relative toGenehunterversion 2.0(Kruglyak).
an optimisation relative to other methods, e.g. when an algorithm is correct
and how to compare the complexities, is needed. We then investigate the
theoreticalhardnessof linkage analysis. Strategies for improvements are then
considered. Finally,we summarise the resultsof thechapter.
4.1 Level of Focus
Multipointlinkageanalysismightbeimprovedonseverallevels. Optimisations
andimprovements can bedone on everything fromthecodelevel to changing
thebiologicalmodel. Basically,wehave dividedthetaskintothreemainareas
(see Figure 4-1). Our interpretation of the biological model is that it is a
model, which is based purely on biological entities and their relations 2
. The
representation and algorithms levelfocuseson representing theneeded data
and providing eective algorithms. The source code level refers to
optimi-sationsthat arevery low-leveland focuseson implementing thedetails of the
algorithms.
Biological model
Representation and algorithms
Source code
Figure 4-1: Tasksofestablishingtoolsforlinkageanalysis.
Optimisationson thesourcecodelevelcouldeitherbedoneby
implement-ingtheexistingalgorithm emphasisingon code optimisations- or by
optimis-ing or distributing an existing implementation. However, neither of the two
approachesseem appealing. Optimising on thesource code levelis infeasible,
sinceitcanonlybeexpectedtoyieldminorimprovementsandcurrentcompiler
techniquesalready takes careof the most obvious optimisations. Distributing
the computation does not yield improved algorithms, in terms of the work
performed, but isa possibilityfor minimisingthetimeusedon each data set.
2
Note that [JIS93 ] refers to the representation and algorithm level as the biological
level. Wedisagreewiththis classication,becausewedonotconsidermodications ofan
algorithm or adata structure, to adierent one withequivalent semantics, as tampering
withthe biological model. Ourinterpretationofthebiological modelisthatit isamodel,
whichisbasedpurelyonbiological entitiesandtheirrelations.
beabletodevelopadierentsetofalgorithms, whichhavealowercomplexity
asthe set of algorithms currently in practice. The biological model is widely
acceptedinthebiologicalandstatisticscommunity,thereforewedonot ndit
meaningfultondsolutionsonthislevel 3
. Weaimatimprovingthemultipoint
linkage analysis on some intermediate level, while maintaining the biological
model 4
. This implies that the notions of inheritance vectors, recombination
fractions,singleandmultipointprobabilitydistributionsstillapply. However,
westill want to developa datastructureanda setofalgorithms, bywhich we
areableto reducethe computationalcomplexity ofmulti point analysis.
Since we do not intend to change the biological model, a new method
for doing the analysis should of course be correct in the sense that results
obtained using the existing method and results obtained by a new method
should be identical. We are now ready to describe the properties that an
improved method shouldposses.
4.1.1 Methods Properties
The most fundamental result of new methods for linkage analysis should be
probability distributions over inheritance vectors, given all observed genotype
data,whichmustbesuitableforfurtherstatisticalanalysis. Whatwemeanby
suitable for further statistical analysis, is that it must be possible to extract
thenecessarydatafor applyingthe scoringfunctions(seeAppendix B),which
againallowsthe potential detection oflinkage.
Theprobability distributions produced bynewmethods, should not dier
fromthe probabilities obtained bythe left andright conditioned probabilities
onpage60(which againequalsthe distributionsproduced byGenehunterand
Allegro). The reason for this is that they are defacto methods for
obtain-ing the probability distributions and adhere to the biological model. What
can be optimised arethetime and space complexities thatcurrently exist for
determiningthesedistributions. Thebestknowncurrentmethodistheone
im-plemented inAllegro(see Section 3.3.3), the worst case complexity ofAllegro
isstill exponential inthe numberof non-founders.
The software currently available, including Allegro, all use at
represen-tationsofprobability distributions overinheritance vectors(arraysof oating
point numbers indexed bythe decimal values of theinheritance vectors). An
improvement should include thesingle point probability computation to
rem-edy this representation of probability distributions, otherwise a new method
wouldhave to do the left and right conditioned probability computation step
3
Wealsoconsidertheinsightneededfortamperingwiththebiologicalmodeloutofscope
forthisproject.
4
Allegrousesapproximately75%ofthetimeincomputingthemultipointdistributions,
Transformation technique is not within the scope of this project. Hence, we
will require that a new method includes an improvement of the single point
probability computation, and produces a more compact representation of the
probability distributions, or else it is not possible to perform faster average
multipoint analysisthan inexponential time.
To clarify thegoals:
Denition 13 (Method Properties) The properties adhered to by an
im-proved method are the following(let
N EW x (y)
denote the result of calculatingy
witha new method forx
):I. Single point probability computation:
N EW singlepoint ( v i ) = X
Z M ∈F M v
P (Z M ).
II. Multi point probability computation:
N EW multipoint ( v i ) = P i L · P ( v i |G i ) · P i R .
III. Extractstatisticalinformation: TheinformationneededtocomputeLOD,
NPL, andother scoring functions mustbe available.
IV. The complexity of some new cases of the new method should be better
than the average complexity of the best known current method 5
.