BRICS R S-00-1 L yngsø & P edersen: P seudoknots in RN A S econdary Structur es
BRICS
Basic Research in Computer Science
Pseudoknots in RNA Secondary Structures
Rune B. Lyngsø
Christian N. S. Pedersen
BRICS Report Series RS-00-1
ISSN 0909-0878 January 2000
Copyright c 2000, Rune B. Lyngsø & Christian N. S. Pedersen.
BRICS, Department of Computer Science University of Aarhus. All rights reserved.
Reproduction of all or part of this work is permitted for educational or research use on condition that this copyright notice is included in any copy.
See back inner page for a list of recent BRICS Report Series publications.
Copies may be obtained by contacting:
BRICS
Department of Computer Science University of Aarhus
Ny Munkegade, building 540 DK–8000 Aarhus C
Denmark
Telephone: +45 8942 3360 Telefax: +45 8942 3255 Internet: BRICS@brics.dk
BRICS publications are in general accessible through the World Wide Web and anonymous FTP through these URLs:
http://www.brics.dk ftp://ftp.brics.dk
This document in subdirectory RS/00/1/
Rune B. Lyngsø
∗
Christian N. S.Pedersen
∗
Abstract
RNAmoleculesare sequences ofnucleotidesthat serveas more thanmere
intermediaries between DNAand proteins, e.g. as catalytic molecules. Com-
putational predictionof RNAsecondarystructureis among thefew structure
predictionproblemsthat canbesolved satisfactoryinpolynomialtime. Most
workhasbeendonetopredictstructuresthatdonotcontainpseudoknots. Al-
lowing pseudoknots introducemodellingand computationalproblems. In this
paperwe consider the problem of predicting RNA secondary structurewhen
certain typesof pseudoknots are allowed. We rst present analgorithm that
in time
O(n 5 )
and spaceO(n 3 )
predicts the secondary structure of an RNAsequenceoflength
n
inamodelthatallows certain kindsofpseudoknots. We then prove that the general problem of predicting RNA secondary structurecontaining pseudoknotsis NP-completefor alarge classof reasonablemodels
ofpseudoknots.
1 Introduction
An RNA molecule is a sequence of nucleotides that often is just an intermediary
betweenDNAandproteins. SomeRNAmoleculesdohoweverhavevitalimportance,
e.g.intranslationofmRNAtoproteins. ThethreedimensionalstructureofanRNA
moleculeistoalargeextentdeterminedbyinteractionsbetweenpairsofnucleotides,
calledbase pairings. Thesecondarystructure ofanRNAmoleculeistheset ofbase
pairingsinthethreedimensionalstructureofthemolecule. Thesecondarystructure
canthus be used in its own right to look for information, e.g.active sites, oras a
steppingstonetowardspredictionofhigherstructurallevels.
Ifthethreedimensional,ortertiary,structure ofanRNAmoleculeisavailableit
isofcourseeasyto determinethesecondarystructure. Butdeterminingthetertiary
structureisacomplicatedandtimeconsumingtask. Whenthetertiarystructureofan
RNA moleculeis notavailable,the authoritativewayof determining the secondary
structure of an RNA molecule is by comparative modelling. Given a number of
related RNA sequences the common secondary structure is inferred by identifying
compensatorymutations, that is, by identifying pairs of positions where mutations
ofthebase in oneofthe positions is accompaniedbya mutation ofthe basein the
otherpositiontoretaintheirbasepairingcapability. Thedrawbackofthistechnique
is that it requires several related RNA sequences to be available. Moreover, since
∗
BasicResearchInComputerScience(BRICS), Centre oftheDanishNationalResearchFoun- dation, Department of Computer Science, University of Aarhus, Ny Munkegade, 8000 Århus C,Denmark.E-mail:
{
rlyngsoe,cstorm}
@brics.dk .diculttofully automatecomparativemodelling.
Thus computational methods forpredicting the secondarystructure of an RNA
sequence are in demand. To construct such methods it is necessary to model the
biologicalrealitythat governs structureformation. Inspiredby thelawsof thermo-
dynamics this is often done in terms of energy minimisation. Using a model that
describes how to assign free energies to legal secondary structures, the secondary
structureofanRNA sequenceispredictedasthestructureof leastfreeenergy. The
biologicalrelevanceofthepredictedstructure andthecomputationalresourcessuch
astimeandspacethatareneededtocomputeit,dependentirelyonthechoiceoflegal
structuresandfreeenergies. Mostworkhasbeendevotedtoconstructalgorithmsfor
RNAsecondarystructurepredictionwhenthelegalstructuresarelimitedtosecondary
structuresthatdonotcontainpseudoknots,thatis, donotcontainoverlappingbase
pairs. Nussinovetal.in [7]presentanalgorithmusingasimplefreeenergyfunction
that is minimised when the secondary structure contains the maximum number of
complementary base pairs. The algorithm takes time
O ( n 3 )
for predictingthe sec-ondarystructureofanRNAsequenceoflength
n
. AmorecomplexmodelforthefreeenergyofsecondarystructuresisproposedbyTinocoetal.in[15]. Thismodelstates
thatthe freeenergyofasecondarystructure isthe sumofindependentenergiesfor
each loop in the structure. Based on this model of free energy, Zukerand Stiegler
in [19],and Nussinov and Jacobsenin [6], presentalgorithms that taketime
O ( n 3 )
for predicting the secondarystructure of an RNA sequence of length
n
. Since theideasof these algorithms form the basis of the widely used mfold server for RNA
secondarystructure prediction,theyare commonlyreferredto asmfoldalgorithms,
oralgorithmsofthemfoldtype.
Thereasonthat legalstructuresareoftenrequirednottocontainpseudoknotsis
notthat pseudoknots do not occur in real world structures, but rather because of
modellingandcomputationalconsiderations. Itisstillanopenquestionhowtocon-
structa reasonablemodel offree energyfor structures containingpseudoknots that
alsomakesitpossibletoconstructecientstructurepredictionalgorithms. Rivasand
Eddyin [10] presentan algorithmthat in time
O ( n 6 )
andspaceO ( n 4 )
predictsthesecondarystructureof anRNA sequence oflength
n
in amodelthat allowscertainkindsof pseudoknots. In this paper we study the problem of predictingRNA sec-
ondarystructure containingpseudoknots further. Insection 2webriey reviewthe
ideasof themfoldalgorithms. Extending onthese ideas,wein section3presentan
algorithmforpredictingRNAsecondarystructurewhencertaintypesofpseudoknots
areallowed. We compare thepresentedalgorithm with the algorithm presented by
Rivasand Eddyin [10]. Insection4weshowthat predictingRNAsecondarystruc-
turescontainingpseudoknots of arbitrary typesis NP-complete for alargeclass of
reasonablefreeenergyfunctions. Finally, in section5wediscuss theimplicationsof
theNP-completenessresult.
2 Terminology
ForanRNAsequence
s
,|s| = n
,asecondarystructureisasetS
ofbasepairsi ·j
with1 ≤ i < j ≤ n
,suchthat∀i ·j, i 0 ·j 0 ∈ S : i = i 0 ⇔ j = j 0
. Eachbasecanthustakepartinat mostonebase pair. Thebase pairsofasecondarystructuredescribethebase
i k l j
= min
r,s
i r k l j s
,
i k l j
r s
,
i k
l j r s
Figure1: GeneralrecursionschemefortheRivasandEddyRNAsecondarystructure
predictionalgorithm.
pairinginteractionsformedbyhydrogenbondinginacorrespondingtertiarystructure.
ItisusuallyassumedthatRNAsecondarystructuresdonotcontainpseudoknots. Two
basepairsformapseudoknotiftheyareoverlapping,i.e.twobasepairs
i · j, i 0 · j 0 ∈ S
formapseudoknotif
i < i 0 < j < j 0
. Thetermpseudoknotisalsousedasashorthandforotheroverlappingstructuresthanbasepairs,e.g.twohelicesofstackingbasepairs,
whenthebase pairsofthesestructures formpseudoknots.
Thereareofcoursegoodreasonsforintroducingthisrestriction,prominentamong
whichisasimplicationoflegalstructures. Thesimplicationofnotallowingpseudo-
knotsensuresthattwobase pairs
i · j, i 0 · j 0 ∈ S
areeithernested,i.e.i < i 0 < j 0 < j
,ordisjoint, i.e.
i < j < i 0 < j 0
. In many situations this allows us to rst handleonebase pairandthentheother(if theyarenested),orhandlethem independently
(if they are disjoint). The pseudoknot restriction is thus crucial in algorithms for
e.g.structure prediction [1,3,6,11,19], partition function calculations [5], compar-
ing secondarystructures [18], and simultaneous alignmentand structure prediction
of RNA sequences [2,12]. Inthe following we will exemplify this by giving a brief
summaryofanalgorithmof themfoldtypeforsecondarystructure prediction. The
summary is also aimed at introducing the terminology we will use in section 3. A
moredetailedsummary canbefoundin e.g.Turneretal.[16].
An mfold algorithm predicts secondary structures by computing minimum (or
close to minimum) energy structures in the model proposed by Tinoco et al. [14]
extendedwith simplifying assumptionsaboutthe nature of the energy function for
multibranched loops. Three arrays,
V ( i, j )
holding the minimum energy of a sec-ondarystructureon
s [ i .. j ]
withbasesi
andj
formingabasepair,WM ( i, j )
holdingtheminimum energy ofa structure on
s [ i .. j ]
that is partof a multibranched loop, andW ( i )
holdingtheminimumenergyofastructureons [1 .. i ]
,arecomputedbasedontherecursions
V ( i, j ) = min
eH ( i, j ) ,
eS ( i, j, i + 1 , j − 1) + V ( i + 1 , j − 1) , min
i<i 0 <j 0 <j i 0 − i + j − j 0 > 2
{eL ( i, j, i 0 , j 0 ) + V ( i 0 , j 0 ) },
i+1<k<j min {WM ( i + 1 , k − 1) + WM ( k, j − 1) + a}
,(1)
WM ( i, j ) = min
V ( i, j ) + b, WM ( i, j − 1) + c, WM ( i + 1 , j ) + c,
i<k≤j min {WM ( i, k − 1) + WM ( k, j ) } ,
(2)
W ( i ) = min
W ( i − 1) ,
0≤k<i min {W ( k ) + V ( k + 1 , i ) } .
(3)Theserecursionsemployenergyfunctions forhairpinloops(
eH
),stackingbasepairs(
eS
),internalloopsandbulges(eL
),andmultibranchedloops(eM ( k, k 0 ) = a + bk 0 + ck
,wherek 0
isthenumberofunpairedbasesandk
thenumberofhelicesinthemulti-branched loop). With thecurrentlyused parametersfor theenergy functions these
recursions allow for an
O( |s| 3 )
time algorithm, cf. [4,16], for computing secondarystructuresof minimumenergyforanRNAsequence
s
.3 Algorithmic Results
TheTinoco model, cf. [14] describeshowto assign energiesto secondarystructures
notcontainingpseudoknots,butdoesnotaddresshowtohandlesecondarystructures
containingpseudoknots. To develop algorithms for predicting secondarystructures
containing pseudoknots, an important step is to decide on a model, i.e. to give a
description of the types of legal secondary structures, and how to assign energies
tothesestructures. Asdevelopinganalgorithmanddecidingonamodelare closely
connectedprocesses,thedescriptionofthemodelisoftenonlyinpartgivenexplicitly.
Oftenthetypesoflegalsecondarystructuresareonlydenedimplicitlythroughthe
constructedalgorithm.
AnexampleofthisisthepseudoknotmodelusedbyRivasandEddyin[10]. This
is,toourknowledge, theonlyrigorous,energybased,polynomialtimealgorithmfor
RNA secondary structure prediction including a class of pseudoknots. In gure 1
webrieysketch theideaoftheRivasandEddyalgorithm. Arraysholding energies
ofoptimalstructuresforthesubsequencefrom
i
throughj
are maintainedsimilartoequations1to3,butwiththefurtherrestrictionthatthebasesfrom
k
throughl
areyetunpaired(toallowforfuturepseudoknotinteractions). Thegeneralrecursionscheme
foranentryinoneofthesematricesistominimiseoverallpossiblewaysofsplitting
thesubsequence withan unpaired region into twonew subsequences with unpaired
regions. Thisdenesthelegalstructuresofthemodel. Theenergyparameters,cf.[10,
table 3], used were partly ne tuned by hand and partly obtained by multiplying
similarparametersforunknottedstructuresbyaweightingparameter.
Therequirementsoftime
O( |s| 6 )
and spaceO( |s| 4 )
forthis algorithm areobser-vationsthatfollowdirectlyfromgure1. Thoughpolynomial,these timeandspace
requirementsarerathersteepandin[10]anestimateof130140basesismentioned
asa rough upper bound for the size of sequences for which the algorithm is feasi-
ble. Thoughcomputational poweris everincreasing, applying Moore's law (stating
thatcomputationalpowerdoublesevery18months)stillonlyallowssequencesof300
i j
k
l
Optimalenergy
= min
i<j<k<l {
Optimalenergyofi j k l +
Optimalenergyof
j k l i }
Figure 2: A model for a class of pseudoknots. The sequence has been drawn as a
circletohighlightthat oneofthefourpartsofthesequencemightextendacrossthe
sequenceends,hereshownwithazigzaggedline.
basestenyearsfrom nowand of650bases intwentyyears. Nevertheless, theexper-
imentsbasedon thisalgorithm reported in [10] show thefeasibilityof energy-based
predictionsofRNAsecondarystructureswithpseudoknots.
To obtain a faster algorithm, we propose a more restrictedmodel for legal sec-
ondarystructures. Thelegalsecondarystructuresofourmodelarestructureswhere
wecansplitthesequenceintofourparts(oneofwhichmightextendacrosstheends
ofthe sequence) asillustratedin gure2. The splitting into four parts divides the
sequenceinto two pairsof opposing subsequences,illustratedin gure2 aspairsof
blackandgreypartsofthesequence. Eachpairofopposingsubsequencesareallowed
toformanunknottedsecondarystructureandthepseudoknottedsecondarystructure
ariseswhen thesetwosecondarystructuresarecombined.
Tofurtherexplainthetypesofsecondarystructuresallowedinthismodel,consider
apseudoknotoftypeHasillustratedingure3. ApseudoknotoftypeHconsistsof
two overlappinghelices, each closing a hairpinloop, such that some ofthe bases in
thehairpinclosed byone ofthe helices arepart ofthe otherhelix. Asindicated in
gure3,wecansplitapseudoknotof typeH intofour partssuchthat onlybasesin
non-neighbouring,oropposing,parts formbase pairs. Themodel ingure2canbe
seenasageneralisationof pseudoknotsoftypeH where
•
theoverlappingstructures canbe arbitrary, complexsecondarystructures not containingpseudoknots.•
theloopregionsclosedbytheoverlappingstructuresdonotneedto behairpin loops. They can be part of any type of loop as long asthey are consecutivestretchesofbases.
Themodelingure2thusencompassessecondarystructureswithonepseudoknotof
typeH(oroftypeBortypeI,cf.[9,gure3])amongothers.
As just explained, ourmodel allowsonlyone (albeitverycomplex) pseudoknot,
sointhatrespectourmodelisastepbackwardcomparedtothemodelusedbyRivas
and Eddy. But if we candevelop moreecient algorithms for secondarystructure
predictionin this model, itnds itsjusticationin caseswhereusing theRivasand
pairings.
Eddy algorithm is infeasible and we only expect, or are content, to nd only one
pseudoknotinteraction. In the rest of this section we will focus on developing an
ecientalgorithmforsecondarystructurepredictioninourmodel.
Astraightforwardalgorithmtosolvethisproblemwouldbetorunthroughallthe
O( |s| 4 )
choicesofsplitsandcomputetheenergyoftheoptimalstructuresofthetwopairsof subsequences. This would requiretime
O( |s| 7 )
and spaceO( |s| 2 )
. Onecanhoweverobserve,that when wecompute theenergy of theoptimal structure ofthe
subsequencefrombase
i
tobasel
withthesubsequencefrombasej
tobasek
removed,wealsocomputetheenergyoftheoptimalstructureofthesubsequencefrom base
i 0
tobase
l 0
withthesubsequencefrom basej
tobasek
removedforalli ≤ i 0 ≤ j
andk ≤ l 0 ≤ l
. Hence,byusingtheseintermediateresultsfromthedynamicprogramming algorithm,we canreduce the time requirement toO( |s| 5 )
by just running throughallthe
O( |s| 2 )
choicesof theremoved subsequence. Unfortunately, wethen haveto storesomeintermediate resultsuntil other results become available. This increasesthespacerequirementto
O( |s| 4 )
. However,amorethoroughinvestigationshowsthat theintermediate resultscomputed withk − 1
astherightendpointof the removedsubsequenceareonlycombinedwithintermediateresultscomputedwith
k
astheleftendpointoftheremovedsubsequence. Thisallowsustosplitthecomputationinto
n
independentphases,eachrequiringonlyspace
O( |s| 3 )
,thusreducingtheoverallspacerequirementto
O( |s| 3 )
whilemaintainingtheO( |s| 5 )
timerequirement.TheformalspecicationofthesketchedalgorithmforpredictingRNA secondary
structurescontainingpseudoknotsisgiveninalgorithm1. Thespecicationisrather
abstract.Itismoreanalgorithmschemathanaready-to-implementalgorithm. More
specically,animplementationwouldrequireseveraldierentarrays,storingenergies
undervariousassumptionsofbasepairingsofankingbases. Inalgorithm1weonly
showhavetomaintainonetypeofarray(
V
). Butthesametechniquecanbeusedformaintainingseveraltypesofinterdependentarraysusedinanactualimplementation
ofthealgorithm.
The
O( |s| 5 )
runningtime ofalgorithm 1should makeit feasiblefor longerRNAsequencesthan theRivasand Eddy algorithm. Forexample, ifweassumethat the
constants hidden by the O notation are similar for the two algorithms, the 130
140basesupperbound fortheRivasandEddyalgorithmimpliesanupperboundof
350375bases for ouralgorithm. This increase mightjustify therestricted model
doknotsbasedonthemodelillustratedingure2.
/*
V j,k ( i, l )
denotes theenergyof theoptimalstructurefors [ i..j ]
concatenatedwiths [ k..l ]
. */E = ∞
for
k = 1
to|s|
do/*Fix oneof the endpointsof the excludedregion */Allocatememoryforstoringandcalculating
V j,k ( i, l )
andV k−1,l ( j, i )
fori < j <
k < l
/*Computetableswith
k
(ork − 1
)asright(orleft)endpointofexcludedregion.*/
for
j = 1
tok − 1
doComputetable
V j,k
endfor
for
l = k
to|s|
doComputetable
V k−1,l
endfor
/* Combine tables. */
for
1 ≤ i < j < k < l ≤ |s|
doE = min {E, V j,k ( i, l ) + V k−1,l+1 ( j + 1 , i − 1) }
endfor
Freeallocatedmemory
endfor
ofallowingonlyonepseudoknot. Ifthisrestrictionistosevere, wecouldextendour
modelbyallowingthesequencetobesplitintosegmentsforeachofwhichtheoptimal
secondarystructureiscalculatedusingthemodelofgure2.Suchanextendedmodel
ismorecomparabletothemodelusedbyRivasandEddyintermsoflegalstructures
(though still more restricted). It is also comparable to the model used by Rivas
and Eddy in allowing secondary structure prediction in time
O( |s| 6 )
. The spacerequirementcanstillbelimitedto
O( |s| 3 )
though.We couldkeepplayingthis game of modifying models and algorithms to obtain
thebest possiblecombinationofafastalgorithm andbroadclassof legalsecondary
structures. But for any class of secondary structures with pseudoknots we should
probably not expect to do better than the requirements of time
O( |s| 3 )
and spaceO( |s| 2 )
oftheclassicmfoldalgorithm. Furthermore,inthefollowingsectionwepro- videevidencethatweshouldnotsethopestohighfordevelopingecientalgorithmshandlingsecondarystructureswithgeneraltypesofpseudoknots.
4 Complexity Results
Inthis sectionweprovethat RNA secondarystructure predictionwithpseudoknots
is
NP
-complete in a simple nearest neighbour model, cf. denition 1. This modelmight seem too simple, and probably would be if we wanted to base a secondary
structurepredictionalgorithmonit. Butwhen provingcomplexityresults,wewant
todosoinamodelthat isassimpleaspossible. Iftheproblemin thesimplemodel
is
NP
-complete, it will remainsoin anymorecomplexand realisticmodel ifxingDenition1 (The Nearest Neighbour Pseudoknot Model) Let
S
be a sec-ondarystructureon asequence
s ∈ {A, C, G, U } ∗
,with|s| = n
,that is,S
isa setofbasepairs
i · j
where1 ≤ i < j ≤ n
and∀i · j, i 0 · j 0 ∈ S : i = i 0 ⇔ j = j 0
. Theenergyof
S
isanindependent sum ofenergies ofeach ofthe basepairs inS
,E ( S ) = X
i·j∈S
E ( i · j, i + 1 , j − 1) ,
where the energy ofabasepair
i · j
only depends on•
the basepair itself,that is,the types ofbasesformingthe pair.•
the twoneighbouring basesi + 1
andj − 1
,thatis, thetypesof thesetwobases.Furthermore, if
i + 1 · j 0 ∈ S
(ori 0 · j − 1 ∈ S
) theenergy candependonj 0
(oron
i 0
).Note that the Nearest Neighbour Pseudoknot Model allowsarbitrarily complex
pseudoknotsasthereisnorestrictionthatbasepairsarenotallowedtooverlap. The
energyofabasepairintheNearestNeighbourPseudoknotModelisallowedtodepend
onnon-neighbouringbases,butonlythroughabasepairingwithaneighbouringbase.
IfwecomparethistotheTinocomodel,cf.[14],theTinocomodelallowstheenergy
of a base pair to depend, not only on the neighbouring bases and the base pairs
they mightparticipate in, but on all bases and base pairs in the loop it closes. If
weconsiderthemodelassumedbythemfoldserver,this ismorerestrictedthanthe
Tinocomodel. Stillitallowstheenergyofabasepairtodependonthetypeofloopit
closes,thesizeoftheloop,andcoaxialstackingofbasepairsinvolvingneighbouring
bases.TheNearestNeighbourPseudoknotModelcanbeseenasafurtherrestriction
ofthis where weonly allowthe energy ofa base pairto depend on stacking eects
withunpaired neighbouringbasesand basepairsinvolvingneighbouringbases. The
valueofthesestackingeectscanhoweverdependonwhethertheinvolvedbasepairs
formahelix, anordinaryloop(a bulgeormultibranched loop),orapseudoknot.
Thus, if we compare the Nearest Neighbour Pseudoknot Model to the energy
modelusedbyRivasandEddy,cf.[10],itshouldbeoflittlesurprisethattheNearest
NeighbourPseudoknotModelis arestrictionof themodel usedbyRivasand Eddy.
TheNearest NeighbourPseudoknot Model canbe obtainedfrom the energy model
used by Rivas and Eddy by xing someof the parameters. Thus an
NP
-hardnessresultforsecondarystructurepredictionintheNearestNeighbourPseudoknotModel
immediatelyimpliesthatsecondarystructurepredictionintheenergymodelusedby
RivasandEddyis
NP
-hard.Proposition1 The problemofdeterminingwhethertheoptimal secondarystructure
inthe NearestNeighbourPseudoknot Modelhas energylower thansomeenergyvalue
E
isNP
-complete.As the problem trivially is in
NP
(guess the optimal secondary structure andverifyinpolynomialtimethatithasanenergyvaluelowerthan
E
),allweneedtodoistoprovethat itis
NP
-hard. Wewill dothisbyareductiontothespecialcaseof3satwhereeachliteraloccursatmosttwotimes,cf.[8,proposition9.3]. Throughout
theproofofthepropositionwewillallowonlyWatson-Crickbasepairs,i.e.
A
pairingwith
U
andC
pairingwithG
. Thiswillbecomeexplicitinthenalspecicationofthe basepairenergyfunction,andisonlyatechnicallimitationtoreducethecomplexityoftheproof. Beforeprovingproposition1weneedsomebuildingblocks.
Denition2 The
d
digit binary representation ofk
, where0 ≤ k ≤ 2 d − 1
, overthe alphabet
{A, U}
, is the stringb {A,U} ( k, d )
of lengthd
that interpreted as a bi- narynumber withA
representing0
andU
representing1
has the valuek
. Similarlyb {C,G} ( k, d )
isthed
digit binary representation ofk
overthe alphabet{C, G}
.The
k
'th distinct{A, U }
patternusingd
digit binaryrepresentationsisthestringA . . . A
| {z }
d+2
U b {A,U} ( k, d ) AUAb {A,U} ( k, d ) U A . . . A | {z }
d+2
.
The
k
'th distinct{C, G}
pattern usingd
digit binaryrepresentationsisdenedsimi- larly.Denition3 Forastring
s
the complementarystrings ¯
isthestring constructedby reversings
andreplacing eachA
with aU
,eachU
withanA
,eachC
withaG
,andeach
G
with aC
.Theneed forthese distinct patterns isto circumventthe fact that weonly have
fourlettersinthealphabetofRNAsequences. TheywillbeusedtoconstructanRNA
sequencecorrespondingto abooleanformulaonrestricted3satform,such thatthe
energyof an optimalsecondary structure ofthe constructed RNA sequenceimplies
whethertheformulaissatisable. TheconstructedRNAsequencewillconsistoftwo
parts,apartwheretheliteralsaregroupedaccordingtotheclausesandapartwhere
theliteralsaregroupedaccordingto thevariables.
Ifwehadanalphabetofarbitrarysizewecouldusetwosymbolstorepresenteach
occurrence of aliteral, onesymbolin theclauses partand the other symbolin the
literals part. A score of minus one could be assigned for each pairing of two such
symbolswithsomeextrapairsofsymbolsbeingusedtoformstructuresnullifyingthe
benetsofpairingmorethanonesymbolinaclause,orpairingasymbolrepresenting
avariableaswellaspairingasymbolrepresentingthisvariablesnegation.
Withoutanalphabetofarbitrarysizewewillinsteadusedistinct
{C, G}
patternsandtheir complementary stringsin the clauses and variables parts, respectively, to
representtheliteralsoftheformula. Ahelixformedbetweena
{C, G}
patternanditscomplementarystringwillindicate that thecorrespondingliteralis trueand wewill
chooseenergyparametersensuringthatsuchahelixusuallycontributesnegativelyto
thetotalenergy. Thedistinct
{A, U}
patternsandtheircomplementarystringswillbe usedtoformstructuresnullifyingbenetsofhavingmorethanonetrueliteralineachclause,andofhavingbothaliteralrepresentingavariableandaliteralrepresentingits
negationbeingtrueatthesametime. Thisisensuredbychoosingenergyparameters
suchthathelicesformedbythedistinct
{A, U}
patternsalsocontributenegativelytothetotalenergy,exceptifthecasetheyshould nullify occurs. Inthatsituation they
contributezerotothetotalenergy. Theformalspecicationoftheenergyparameters
ispostponedtilltheendofthissection.
Denition4 Let
C = l 1 ∨l 2 ∨ l 3
beabooleandisjunctionofthreeliterals. The clause blockC
ofC
usingd
digit binaryrepresentationsisthe string|{z}
S 1
|{z}
L 1
|{z} S ¯ 1
|{z}
S 2
|{z}
L 2
|{z}
S 1
|{z} S ¯ 2
|{z}
L 3
|{z}
S 2
,
where the
S i
's are distinct{A, U }
patterns usingd
digit binary representations for twodierentk
's, andtheL i
'saredistinct{C, G}
patternsusingd
digitbinary repre-sentationsfor threedierent
k
's.The rationalebehind this constructionis that we can form twohelices between
distinct
{A, U }
patterns and their complementary strings within the clause block.These twohelices will span dierent
L i
's, except for the casewhere theS 1
andS 2
anking
L 2
both form helices with their complementary string. In this case, the innermostbase pairoftheS 1
helixandtheoutermostbasepairoftheS 2
helix(andviceversa)willbeneighbouringbasepairsformingpseudoknots.
Furthermore,the
L i
's spannedbysuchahelixwill bescreened. Byscreened,wemean that at least one of the anking bases of the
L i
pattern cannot form a basepairwith a base not spanned by the helix without forming a pseudoknotwith the
innermostbasepairofthehelix. The
L i
patternthuscannotformtheintendedhelixwith its complementary string in the variable block, that we will describe shortly,
withoutintroducinga pseudoknotofneighbouringbase pairs. Withoutintroducing
neighbouringpseudoknottedbasepairs,foraclauseblockwecanthusformhelicesof
twoofthedistinct patternsstraightaway,andathirdhelix ifwecanpaironeofthe
L i
patternswithitscomplementarystringinthevariablespart.Denition5 Let
x
be a variable occurring in a boolean formula where each literaloccursatmost twice. The variable block
V
ofx
usingd
digit binaryrepresentations isthestring|{z}
S 1
|{z} P ¯ 1
|{z} P ¯ 2
|{z} S ¯ 1
|{z} N ¯ 1
|{z} N ¯ 2
|{z}
S 1
,
where
S 1
isadistinct{A, U }
pattern forsomek
,theP ¯ i
'sarecomplementarystrings tothe distinct{C, G}
patterns usedfor the atmost twopositive occurrences ofx
(ifx
occurs positive only once, one of theP ¯
patterns is omitted fromV
) and theN ¯ i
'sare complementary strings to the distinct
{C, G}
patterns used for the atmost twonegative occurrences of
x
(ifx
occurs negative only once, one of theN ¯
patterns isomittedfrom
V
).Therationalebehindthisconstructionisonceagaintouseahelixformedbyone
oftheoccurrencesof
S 1
and itscomplementarystringto screenthe complementary stringscorrespondingtoeitherthe(atmost)twopositiveoccurrencesofx
orthe(atmost)twonegativeoccurrencesof
x
. Ifwearetoavoidintroducingneighbouringbase pairsformingapseudoknot,eithernoneofthedistinctS 1
patternsformahelixwiththecomplementarystring, thecomplementary stringscorresponding tothe positive
occurrencesof
x
donotform helices,orthecomplementarystringscorrespondingto thenegativeoccurrencesofx
donotformhelices. WearenowreadytoconstructtheRNAsequencerepresentingabooleanformulaonrestricted3satform.
Denition6 Let
φ
be a boolean formula on conjunctive normal form where each clause has three literals and each literal occurs at most two times. Assume thatφ
consistsof
c
clauses andusesv
variables. The RNA sequence corresponding toφ
isthesequence
s φ = C 1 C 2 . . . C c V 1 V 2 . . . V v ,
where
C i
is the clause block usingd log 2 (3 c + v ) e
digit binary representations corre- sponding to thei
'th clause ofφ
,V i
is the variable block usingd log 2 (3 c + v ) e
digitbinary representations corresponding to the
i
'th variable ofφ
, no distinct pattern isused more than once, and the patterns corresponding to a literal and their comple-
mentarystrings occur inreverseorder.
Thechoice ofnumberofdigitsweusein thebinaryrepresentationsensuresthat
we canchoose at least
max { 3 c, 2 c + v}
dierent valuesfor distinct patterns. Eachclause block uses two distinct
{A, U}
patterns and three distinct{C, G}
patterns,whileeachvariableblockusesonedistinct
{A, U }
pattern. Thuswedonotrunoutofpatterns. Wewillusethetermcomplementarypattern forthedeliberateoccurrences
ofthecomplementarystringtoadistinctpattern, thatis, thestringsindicatedbya
barredpatternindenitions 4and5.
So far we have assumed that helices only form between a distinct pattern and
the complementary string designed to form a helix with it. Helices can of course
formbetweenpartsofdistinctpatternsnotdesignedtoformhelicestogether,butthe
followinglemmalimitsthelengthofsuchhelices.
Lemma1 Let
s φ
beanRNAsequenceconstructedfromabooleanformulaφ
accord-ingtodenition6. In anystructure
S
ofs φ
,any helixofconsecutivelystacking pairs oflengthatleast4 d + 5
,whered
isthe numberofdigitsusedfor thebinaryrepresen-tations,will have at least
2 d + 3
bases at the end of a distinct pattern forming basepairs with the intendedbasesofthe complementarypattern tothisdistinct pattern.
Proof. Byconstruction any substring of
s φ
of lengthat least4 d + 5
will containatleast
2 d + 3
bases from one of the ends of a distinct pattern or its complementary pattern. Consideroneofthetwosubstringsformingthehelix. Thiswillbeoflengthatleast
4 d + 5
and thus containat least2 d + 3
bases from adistinct patternoritscomplementary pattern. Assume withoutlossofgeneralitythat itcontainstherst
2 d + 3
basesfromthek
'thdistinct{A, U }
patternusingd
digitrepresentations,that is,itcontainsthesubstringA d+2 U b {A,U} ( k, d )
. Byconstruction,theonlyoccurrences ofd + 2
consecutiveU
's preceded byanA
ins φ
are at the ends of complementary patterns to distinct{A, U }
patterns, and thusA d+2 U b {A,U} ( k, d )
forms base pairswith
¯ b {A,U} ( k 0 , d ) AU d+2
forsomek 0
(bytheassumptionthatonlyWatson-Crickbase pairsareallowed). Asb {A,U} ( k, d )
pairswith¯ b {A,U} ( k 0 , d )
itfollowsthatk = k 0
.2
Wehavenowestablishedthatanyhelixofconsiderablelengthwillcontainatleast
partofadesignedpairing. Thenextlemmaestablishesthatthiswillbeallitcontains.
Lemma2 Let
s φ
beanRNAsequenceconstructedfromabooleanformulaφ
accordingtodenition6using
d
digitbinaryrepresentations. InanystructureS
ofs φ
,therearenohelicesofmorethan
4 d +9
consecutivelystackingbasepairscontainingonlyA
'sandU
'sorcontainingonlyC
'sandG
's. Theonlyhelices oflength4 d + 9
containingonlyA
's andU
's or containing onlyC
's andG
's are helices formed by distinct patternsandtheircomplementary pattern.
Proof. Bylemma1weknowthatahelixoflength
4 d + 9
willcontainoneoftheendsofadistinctpattern pairedwithitscomplementarypattern. All wehavetoshowis,
that we cannot extenda helixformed bya distinct pattern and its complementary
patternwithanextrastackingpairofbasesofthesametype.
If the distinct pattern is a
{C, G}
pattern this is straightforward, as it will be in a clause block and thus bordered by anA
and aU
, or by twoA
's. Similarly,the complementary pattern of a distinct
{A, U }
pattern from a variable block willbe bordered by two
G
's. Finally, the complementary pattern to a distinct{A, U }
patternfromaclauseblockwillbeborderedbyan
A
ononeside,cf.denition4. Buttakingthe
S ¯ 1
complementarypattern asexample,thisA
should form anillegal (bytheWatson-Crickbase pairassumption)base pairwith either theleftmost
A
oftheprecedingclauseblockor therightmost
C
intheL 2
patterntoextendthehelix.2
Proof (ofproposition 1). Asmentionedabovethereductionwillbefrom3satwith
therestrictionthateachliteralappearsatmosttwice. Solet
φ
beavalidformulaforthis restrictionof 3sat with
c
clauses andv
variables. Inpolynomial time, we canconstruct
s φ
accordingto denition6,andthebase pairenergyfunctionE ( X i · Y j , V i+1 , W j−1 ) =
− 1
ifV i+1 · W j−1 ∈ S
andeitherX · Y, V · W ∈ {A · U, U · A}
or
X · Y, V · W ∈ {C · G, G · C}
4 d + 7
ifX · Y ∈ {A · U, U · A, C · G, G · C}
and for
j 0 6∈ {i + 1 , . . . j − 1 }
we haveV i+1 · Z j 0 , W j−1 · Z j 0 , Z j 0 · V i+1 ,
Z j 0 · W j−1 6∈ S 4 d + 8
otherwisewhere
d
isthenumberofdigitsusedforthebinaryrepresentationsins φ
andS
isthestructureforwhichtheenergyiscalculated. Thenotation
X i
isusedasashorthandtoindicatethat the
i
'thbaseisoftypeX
.Weclaimthat theoptimalsecondarystructureof
s φ
withtheaboveenergyfunc-tionhasenergy
− (3 c + v )
ifandonlyifφ
is satisable. Bytheenergyfunction,theonlyhelices forwhich thebasepairs combinedyields anegativecontribution tothe
energyofthesecondarystructurearehelicesof atleast
4 d + 9
basepairs,base pairsthatareeither all
A
'spairingwithU
'sorallC
'spairingwithG
's. Bylemma2,theonly such helices that can be formed are between distinct patterns and their com-
plementarypatterns; these helices will consistof exactly
4 d + 9
base pairsandthuscontribute
− 1
tothetotalscoreofasecondarystructure,provided thattheinnermostbasepairofthehelixdoesnothaveaneighbouringbasepairthatformsapseudoknot.
Hence,ifadistinct patternis screenedbyahelix,it cannotform ahelixyielding a
negativecontributiontothetotalenergy.
If there is an assignment of truth values to the variables of
φ
satisfyingφ
, wecanconstruct a secondarystructure
S
ons φ
with energy− (3 c + v )
based on thisassignmentbyformingthefollowingbase pairs.
•
Foreachvariableblockformingthehelixofthedistinct{A, U }
patternandthecomplementary pattern screening the complementary patterns of the literals
thatbecomefalsebytheassignment.
•
For each clauseblockformingthehelicesbetweenthedistinct{A, U }
patternsthat leave the distinct
{C, G}
pattern of a literal that becomes true by theassignmentunscreened.
•
Forming thehelices betweentheunscreeneddistinct patternsof literals intheclausespartandtheircomplementarypatterns(thatare unscreenedastheas-
signmentsatises
φ
,andasthereverseorderrequirementindenition6ensures thetwocomplementary patterns correspondingto the sameliteral nothavingneighbouringbasepairsforming apseudoknot)in thevariablespart.
Bythe discussion following denition 4, the distinct patterns of aclause block can
form at most three helices, each yielding a contribution of
− 1
, and each variableblockintroducesonlyonenewdistinctpattern;hencetheenergyof
S
of− (3 c + v )
isoptimal.
Assume now that
s φ
has an optimal structureS
of energy− (3 c + v )
. By theaboveand the discussion following denition 4, we get that each clause block will
contain a distinct pattern corresponding to a literal forming a helix with its un-
screenedcomplementarypattern in the variablespart, andthat the complementary
patternscorrespondingeithertoavariableortoitsnegationwillbescreened. Wecan
thus inferatruthassignmentto thevariablesof
φ
satisfyingφ
from theunscreenedcomplementarypatternsofliteralsin
S
.2
Theenergyfunctionspeciedintheproofofproposition1rewardsstackingsome
base pairs, penalises loops by penalising the rst base pair in a helix, and further
penalises neighbouring base pairs that form a pseudoknot. The only two remark-
ableoddities arethedisallowanceofbase pairingsbetween
G
andU
, andpenalisingstackingan
A, U
basepairwithaC, G
basepair.Onecanobservethat wecould allow
G, U
base pairswithoutchanginganythingbutinsertinga
C
betweenthetwocomplementarypatternscorrespondingtothesame literal. AsforpenalisingstackingA, U
basepairswithC, G
basepairs,thiswaschosentoeaseestablishingthefactthatnoenergybenetsareobtainedbyextendingahelix
formedbyadistinctpatternanditscomplementarypatternbyfurtherstackingbase
pairs. Aproofwheretheenergyfunctionrewardsstackingofallcombinationsof
A, U
basepairs,
C, G
basepairs andG, U
base pairscanbeachieved by amoreinvolvedconstructionoftheclausespartof
s φ
. However,to limitthecomplexityoftheproof,wehavechosentopresenttheaboveversion.
5 Discussion
The NP-completeness of the RNA secondary structure prediction problem in the
NearestNeighbour Pseudoknot Model tells us, that any algorithm allowing energy
functions suciently general to be specialised to the energy functions in the Near-
estNeighbourPseudoknotModel,andrunninginworstcasepolynomialtime,would
implyP
=
NP. The questionwhether ornot P is equalto NP is one of the fun-damental open problems in computer science. Based on strong evidence, the large
majorityofcomputerscientistsbelievethat P
6 =
NP. The NP-completenessofthe RNAsecondary structureprediction problemin the NearestNeighbourPseudoknotModelthushintsthat thereisonlylittlehopeforaworstcasepolynomialtimealgo-
rithmforRNAsecondarystructure predictionintheNearestNeighbourPseudoknot
Model, or models extending it. Moreover, it hints that any algorithm for predict-
ingRNA secondarystructureswith generalpseudoknots mostlikelyhaveto exploit
specicpropertiesofaxedenergyfunction toobtainpolynomialrunningtime .
OneapproachtoobtainapolynomialtimealgorithmforRNAsecondarystructure
predictionwith pseudoknots is to limit the types of legal pseudoknots. This is the
approachtakenbyRivasandEddyin [10]andbyusinsection3. Anotherapproach
istakenbyTabaskaetal.in[13],whereinteractionsbetweenneighbouringbasepairs
areignored,thusreducingtheproblemofRNA secondarystructureprediction(with
pseudoknots) to computea maximal weightedmatching. If we are satisedto nd
notnecessarilythe structures ofleast free energy, then heuristics canbe applied to
searchforstructuresof lowenergy. Forexample,vanBatenburget al.in[17] report
onsuccessfulexperimentswithapplyinggeneticalgorithmstotheproblemofnding
lowenergyRNA secondarystructurescontainingpseudoknots.
References
[1] S. R. Eddy and R. Durbin. RNA sequence analysis using covariance models.
Nucleic Acids Research,22:20792088,1994.
[2] J. Gorodkin, L. J. Heyer, and G. D. Stormo. Finding common sequence and
structuremotifsinasetofRNAsequences.InProceedingsofthe5thInternational
ConferenceonIntelligentSystemsforMolecularBiology (ISMB),pages120123,
1997.
[3] B. Knudsen and J.Hein. RNA secondarystructure predictionusing stochastic
context-free grammars and evolutionary history. Bioinformatics, 15:446454,
1999.
[4] R.B.Lyngsø,M.Zuker,andC.N.S.Pedersen.Fastevaluationofinternalloops
in RNAsecondarystructure prediction. Bioinformatics, 15(6):440445,1999.
[5] J.S.McCaskill. Theequilibriumpartitionfunction andbasepairbindingprob-
abilitiesforRNAsecondarystructure. Biopolymers,29:11051119,1990.
[6] R. Nussinov and A. B. Jacobson. Fast algorithm for predicting the secondary
structure ofsingle-stranded RNA. Proceedings of the NationalAcademy of Sci-
enceofthe USA,77(11):63096313,1980.
[7] R.Nussinov,G.Pieczenik,J.R.Griggs,andD.J.Kleitman.Algorithmsforloop
matchings. SIAMJournalonAppliedMathematics,35:6882,1978.
[8] C. H. Papadimitriou. Computational Complexity. Addison-Wesley Publishing
Company,1994.
[9] C.W.A.Pleij. RNApseudoknots.InR.F.GestelandandJ.F.Atkins,editors,
The RNAWorld. ColdSpring HarborLaboratoryPress,1993.
predictionincludingpseudoknots. Journalof MolecularBiology,285:20532068,
1999.
[11] Y.Sakakibara,M.Brown,R.Hughey,I.S.Mian,K.Sjölander,R.C.Underwood,
andD.Haussler. Stochasticcontext-freegrammarsfortRNAmodeling. Nucleic
AcidsResearch,22:51125120,1994.
[12] D. Sanko. Simultaneous solutionof theRNA folding,alignmentand protose-
quence problems. SIAMJournal onApplied Mathematics,45:810825,1985.
[13] J. E. Tabaska,R. B.Cary, H. N. Gabow, andG. D. Stormo. An RNA folding
method capable of identifying pseudoknots and base triples. Bioinformatics,
14(8):691699,1998.
[14] I. Tinoco, P. N. Borer, B. Dengler, M. D. Levine, O. C. Uhlenbeck, D. M.
Crothers,and J.Gralla. Improvedestimation ofsecondarystructurein ribonu-
cleicacids. NatureNewBiology,246:4041,1973.
[15] I.Tinoco,O.C.Uhlenbeck,andM.D.Levine.Estimationofsecondarystructure
in ribonucleicacids. Nature,230:362367,1971.
[16] D.H.Turner,N.Sugimoto,andS.M.Freier. RNAstructureprediction.Annual
Review ofBiophysics andBiophysical Chemistry,17:167192,1988.
[17] F. H. D. van Batenburg, A. P. Gultyaev, and C. W. A. Pleij. An APL-
programmedgenetic algorithm for the predictionof RNA secondarystructure.
Journalof Theoretical Biology,174(3):269280,1995.
[18] K.ZhangandD.Shasha.Simplefastalgorithmsfortheeditingdistancebetween
treesandrelatedproblems.SIAMJournalonComputing,18(6):12451262,1989.
[19] M.ZukerandP.Stiegler.OptimalcomputerfoldingoflargeRNAsequencesusing
thermodynamics andauxiliaryinformation. Nucleic Acids Research,9:133148,
1981.