BRICS Basic Research in Computer Science

(1)

BRICS R S-00-1 L yngsø & P edersen: P seudoknots in RN A S econdary Structur es

BRICS

Basic Research in Computer Science

Pseudoknots in RNA Secondary Structures

Rune B. Lyngsø

Christian N. S. Pedersen

BRICS Report Series RS-00-1

ISSN 0909-0878 January 2000

(2)

Copyright c 2000, Rune B. Lyngsø & Christian N. S. Pedersen.

BRICS, Department of Computer Science University of Aarhus. All rights reserved.

Reproduction of all or part of this work is permitted for educational or research use on condition that this copyright notice is included in any copy.

See back inner page for a list of recent BRICS Report Series publications.

Copies may be obtained by contacting:

BRICS

Department of Computer Science University of Aarhus

Ny Munkegade, building 540 DK–8000 Aarhus C

Denmark

Telephone: +45 8942 3360 Telefax: +45 8942 3255 Internet: BRICS@brics.dk

BRICS publications are in general accessible through the World Wide Web and anonymous FTP through these URLs:

http://www.brics.dk ftp://ftp.brics.dk

This document in subdirectory RS/00/1/

(3)

Rune B. Lyngsø

∗

Christian N. S.Pedersen

∗

Abstract

RNAmoleculesare sequences ofnucleotidesthat serveas more thanmere

intermediaries between DNAand proteins, e.g. as catalytic molecules. Com-

putational predictionof RNAsecondarystructureis among thefew structure

predictionproblemsthat canbesolved satisfactoryinpolynomialtime. Most

workhasbeendonetopredictstructuresthatdonotcontainpseudoknots. Al-

lowing pseudoknots introducemodellingand computationalproblems. In this

paperwe consider the problem of predicting RNA secondary structurewhen

certain typesof pseudoknots are allowed. We rst present analgorithm that

in time

O(n ⁵ )

^and ^space

O(n ³ )

^predicts ^the ^secondary ^structure ^of ^an ^RNA

sequenceoflength

n

ⁱⁿâ^model^thatâllows ^certain ^kindsôfpseudoknots. We then prove that the general problem of predicting RNA secondary structure

containing pseudoknotsis NP-completefor alarge classof reasonablemodels

ofpseudoknots.

1 Introduction

An RNA molecule is a sequence of nucleotides that often is just an intermediary

betweenDNAandproteins. SomeRNAmoleculesdohoweverhavevitalimportance,

e.g.intranslationofmRNAtoproteins. ThethreedimensionalstructureofanRNA

moleculeistoalargeextentdeterminedbyinteractionsbetweenpairsofnucleotides,

calledbase pairings. Thesecondarystructure ofanRNAmoleculeistheset ofbase

pairingsinthethreedimensionalstructureofthemolecule. Thesecondarystructure

canthus be used in its own right to look for information, e.g.active sites, oras a

steppingstonetowardspredictionofhigherstructurallevels.

Ifthethreedimensional,ortertiary,structure ofanRNAmoleculeisavailableit

isofcourseeasyto determinethesecondarystructure. Butdeterminingthetertiary

structureisacomplicatedandtimeconsumingtask. Whenthetertiarystructureofan

RNA moleculeis notavailable,the authoritativewayof determining the secondary

structure of an RNA molecule is by comparative modelling. Given a number of

related RNA sequences the common secondary structure is inferred by identifying

compensatorymutations, that is, by identifying pairs of positions where mutations

ofthebase in oneofthe positions is accompaniedbya mutation ofthe basein the

otherpositiontoretaintheirbasepairingcapability. Thedrawbackofthistechnique

is that it requires several related RNA sequences to be available. Moreover, since

∗

BasicResearchInComputerScience(BRICS), Centre oftheDanishNationalResearchFoun- dation, Department of Computer Science, University of Aarhus, Ny Munkegade, 8000 Århus C,

Denmark.E-mail:

{

rlyngsoe,cstorm

}

@brics.dk .

(4)

diculttofully automatecomparativemodelling.

Thus computational methods forpredicting the secondarystructure of an RNA

sequence are in demand. To construct such methods it is necessary to model the

biologicalrealitythat governs structureformation. Inspiredby thelawsof thermo-

dynamics this is often done in terms of energy minimisation. Using a model that

describes how to assign free energies to legal secondary structures, the secondary

structureofanRNA sequenceispredictedasthestructureof leastfreeenergy. The

biologicalrelevanceofthepredictedstructure andthecomputationalresourcessuch

astimeandspacethatareneededtocomputeit,dependentirelyonthechoiceoflegal

structuresandfreeenergies. Mostworkhasbeendevotedtoconstructalgorithmsfor

RNAsecondarystructurepredictionwhenthelegalstructuresarelimitedtosecondary

structuresthatdonotcontainpseudoknots,thatis, donotcontainoverlappingbase

pairs. Nussinovetal.in [7]presentanalgorithmusingasimplefreeenergyfunction

that is minimised when the secondary structure contains the maximum number of

complementary base pairs. The algorithm takes time

O ( n ³ )

^for ^predicting^the ^sec-

ondarystructureofanRNAsequenceoflength

n

^. ^A^more^complex^model^for^the^free

energyofsecondarystructuresisproposedbyTinocoetal.in[15]. Thismodelstates

thatthe freeenergyofasecondarystructure isthe sumofindependentenergiesfor

each loop in the structure. Based on this model of free energy, Zukerand Stiegler

in [19],and Nussinov and Jacobsenin [6], presentalgorithms that taketime

O ( n ³ )

for predicting the secondarystructure of an RNA sequence of length

n

^. ^Since ^the

ideasof these algorithms form the basis of the widely used mfold server for RNA

secondarystructure prediction,theyare commonlyreferredto asmfoldalgorithms,

oralgorithmsofthemfoldtype.

Thereasonthat legalstructuresareoftenrequirednottocontainpseudoknotsis

notthat pseudoknots do not occur in real world structures, but rather because of

modellingandcomputationalconsiderations. Itisstillanopenquestionhowtocon-

structa reasonablemodel offree energyfor structures containingpseudoknots that

alsomakesitpossibletoconstructecientstructurepredictionalgorithms. Rivasand

Eddyin [10] presentan algorithmthat in time

O ( n ⁶ )

^and^space

O ( n ⁴ )

^predicts^the

secondarystructureof anRNA sequence oflength

n

ⁱⁿ ^a^model^that ^allows^certain

kindsof pseudoknots. In this paper we study the problem of predictingRNA sec-

ondarystructure containingpseudoknots further. Insection 2webriey reviewthe

ideasof themfoldalgorithms. Extending onthese ideas,wein section3presentan

algorithmforpredictingRNAsecondarystructurewhencertaintypesofpseudoknots

areallowed. We compare thepresentedalgorithm with the algorithm presented by

Rivasand Eddyin [10]. Insection4weshowthat predictingRNAsecondarystruc-

turescontainingpseudoknots of arbitrary typesis NP-complete for alargeclass of

reasonablefreeenergyfunctions. Finally, in section5wediscuss theimplicationsof

theNP-completenessresult.

2 Terminology

ForanRNAsequence

s

^,

|s| = n

^,â^secondary^structureîsâ^set

S

^of^base^pairs

i ·j

^with

1 ≤ i < j ≤ n

^,^such^that

∀i ·j, i ⁰ ·j ⁰ ∈ S : i = i ⁰ ⇔ j = j ⁰

^. ^Each^base^can^thus^take^part

inat mostonebase pair. Thebase pairsofasecondarystructuredescribethebase

(5)

i k l j

= min

r,s

 

 



 

  ⁱ ^r ^k ^{l j} ^s

,

i k l j

r s

,

i k

l j r s

 

 



 

 

Figure1: GeneralrecursionschemefortheRivasandEddyRNAsecondarystructure

predictionalgorithm.

pairinginteractionsformedbyhydrogenbondinginacorrespondingtertiarystructure.

ItisusuallyassumedthatRNAsecondarystructuresdonotcontainpseudoknots. Two

basepairsformapseudoknotiftheyareoverlapping,i.e.twobasepairs

i · j, i ⁰ · j ⁰ ∈ S

formapseudoknotif

i < i ⁰ < j < j ⁰

^. ^The^term^pseudoknotîsâlsoûsedâsâ^shorthand

forotheroverlappingstructuresthanbasepairs,e.g.twohelicesofstackingbasepairs,

whenthebase pairsofthesestructures formpseudoknots.

Thereareofcoursegoodreasonsforintroducingthisrestriction,prominentamong

whichisasimplicationoflegalstructures. Thesimplicationofnotallowingpseudo-

knotsensuresthattwobase pairs

i · j, i ⁰ · j ⁰ ∈ S

âreêither^nested,î.e.

i < i ⁰ < j ⁰ < j

^,

ordisjoint, i.e.

i < j < i ⁰ < j ⁰

^. În ^many ^situations ^this âllows ûs ^to ^rst ^handle

onebase pairandthentheother(if theyarenested),orhandlethem independently

(if they are disjoint). The pseudoknot restriction is thus crucial in algorithms for

e.g.structure prediction [1,3,6,11,19], partition function calculations [5], compar-

ing secondarystructures [18], and simultaneous alignmentand structure prediction

of RNA sequences [2,12]. Inthe following we will exemplify this by giving a brief

summaryofanalgorithmof themfoldtypeforsecondarystructure prediction. The

summary is also aimed at introducing the terminology we will use in section 3. A

moredetailedsummary canbefoundin e.g.Turneretal.[16].

An mfold algorithm predicts secondary structures by computing minimum (or

close to minimum) energy structures in the model proposed by Tinoco et al. [14]

extendedwith simplifying assumptionsaboutthe nature of the energy function for

multibranched loops. Three arrays,

V ( i, j )

^holding ^the ^minimum ênergy ôf â ^sec-

ondarystructureon

s [ i .. j ]

^with^bases

i

^and

j

^forming^a^base^pair,

WM ( i, j )

^holding

theminimum energy ofa structure on

s [ i .. j ]

^that îs ^partôf â multibranched loop, and

W ( i )

^holding^the^minimumênergyôfâ^structureôn

s [1 .. i ]

^,^are^computed^based

ontherecursions

V ( i, j ) = min

eH ( i, j ) ,

eS ( i, j, i + 1 , j − 1) + V ( i + 1 , j − 1) , min

i<i ⁰ <j ⁰ <j i ⁰ − i + j − j ⁰ > 2

{eL ( i, j, i ⁰ , j ⁰ ) + V ( i ⁰ , j ⁰ ) },

i+1<k<j min {WM ( i + 1 , k − 1) + WM ( k, j − 1) + a}

^,

(1)

(6)

WM ( i, j ) = min

V ( i, j ) + b, WM ( i, j − 1) + c, WM ( i + 1 , j ) + c,

i<k≤j min {WM ( i, k − 1) + WM ( k, j ) } ,

(2)

W ( i ) = min

W ( i − 1) ,

0≤k<i min {W ( k ) + V ( k + 1 , i ) } .

⁽³⁾

Theserecursionsemployenergyfunctions forhairpinloops(

eH

^),^stacking^base^pairs

(

eS

^),^internal^loops^and^bulges⁽

eL

^),^andmultibranchedloops(

eM ( k, k ⁰ ) = a + bk ⁰ + ck

^,^where

k ⁰

îs^the^numberôfûnpaired^basesând

k

^the^number^of^helicesⁱⁿ^the^multi-

branched loop). With thecurrentlyused parametersfor theenergy functions these

recursions allow for an

O( |s| ³ )

^time ^algorithm, ^cf. ^[4,^16], ^for ^computing ^secondary

structuresof minimumenergyforanRNAsequence

s

^.

3 Algorithmic Results

TheTinoco model, cf. [14] describeshowto assign energiesto secondarystructures

notcontainingpseudoknots,butdoesnotaddresshowtohandlesecondarystructures

containingpseudoknots. To develop algorithms for predicting secondarystructures

containing pseudoknots, an important step is to decide on a model, i.e. to give a

description of the types of legal secondary structures, and how to assign energies

tothesestructures. Asdevelopinganalgorithmanddecidingonamodelare closely

connectedprocesses,thedescriptionofthemodelisoftenonlyinpartgivenexplicitly.

Oftenthetypesoflegalsecondarystructuresareonlydenedimplicitlythroughthe

constructedalgorithm.

AnexampleofthisisthepseudoknotmodelusedbyRivasandEddyin[10]. This

is,toourknowledge, theonlyrigorous,energybased,polynomialtimealgorithmfor

RNA secondary structure prediction including a class of pseudoknots. In gure 1

webrieysketch theideaoftheRivasandEddyalgorithm. Arraysholding energies

ofoptimalstructuresforthesubsequencefrom

i

^through

j

^are ^maintained^similar^to

equations1to3,butwiththefurtherrestrictionthatthebasesfrom

k

^through

l

^are^yet

unpaired(toallowforfuturepseudoknotinteractions). Thegeneralrecursionscheme

foranentryinoneofthesematricesistominimiseoverallpossiblewaysofsplitting

thesubsequence withan unpaired region into twonew subsequences with unpaired

regions. Thisdenesthelegalstructuresofthemodel. Theenergyparameters,cf.[10,

table 3], used were partly ne tuned by hand and partly obtained by multiplying

similarparametersforunknottedstructuresbyaweightingparameter.

Therequirementsoftime

O( |s| ⁶ )

^and ^space

O( |s| ⁴ )

^for^this âlgorithm âreôbser-

vationsthatfollowdirectlyfromgure1. Thoughpolynomial,these timeandspace

requirementsarerathersteepandin[10]anestimateof130140basesismentioned

asa rough upper bound for the size of sequences for which the algorithm is feasi-

ble. Thoughcomputational poweris everincreasing, applying Moore's law (stating

thatcomputationalpowerdoublesevery18months)stillonlyallowssequencesof300

(7)

i j

k

l

Optimalenergy

= min

i<j<k<l {

Ôptimalênergyôf

i j k l +

Optimalenergyof

j k l i }

Figure 2: A model for a class of pseudoknots. The sequence has been drawn as a

circletohighlightthat oneofthefourpartsofthesequencemightextendacrossthe

sequenceends,hereshownwithazigzaggedline.

basestenyearsfrom nowand of650bases intwentyyears. Nevertheless, theexper-

imentsbasedon thisalgorithm reported in [10] show thefeasibilityof energy-based

predictionsofRNAsecondarystructureswithpseudoknots.

To obtain a faster algorithm, we propose a more restrictedmodel for legal sec-

ondarystructures. Thelegalsecondarystructuresofourmodelarestructureswhere

wecansplitthesequenceintofourparts(oneofwhichmightextendacrosstheends

ofthe sequence) asillustratedin gure2. The splitting into four parts divides the

sequenceinto two pairsof opposing subsequences,illustratedin gure2 aspairsof

blackandgreypartsofthesequence. Eachpairofopposingsubsequencesareallowed

toformanunknottedsecondarystructureandthepseudoknottedsecondarystructure

ariseswhen thesetwosecondarystructuresarecombined.

Tofurtherexplainthetypesofsecondarystructuresallowedinthismodel,consider

apseudoknotoftypeHasillustratedingure3. ApseudoknotoftypeHconsistsof

two overlappinghelices, each closing a hairpinloop, such that some ofthe bases in

thehairpinclosed byone ofthe helices arepart ofthe otherhelix. Asindicated in

gure3,wecansplitapseudoknotof typeH intofour partssuchthat onlybasesin

non-neighbouring,oropposing,parts formbase pairs. Themodel ingure2canbe

seenasageneralisationof pseudoknotsoftypeH where

•

^theoverlappingstructures canbe arbitrary, complexsecondarystructures not containingpseudoknots.

•

^the^loop^regions^closed^by^theoverlappingstructuresdonotneedto behairpin loops. They can be part of any type of loop as long asthey are consecutive

stretchesofbases.

Themodelingure2thusencompassessecondarystructureswithonepseudoknotof

typeH(oroftypeBortypeI,cf.[9,gure3])amongothers.

As just explained, ourmodel allowsonlyone (albeitverycomplex) pseudoknot,

sointhatrespectourmodelisastepbackwardcomparedtothemodelusedbyRivas

and Eddy. But if we candevelop moreecient algorithms for secondarystructure

predictionin this model, itnds itsjusticationin caseswhereusing theRivasand

(8)

pairings.

Eddy algorithm is infeasible and we only expect, or are content, to nd only one

pseudoknotinteraction. In the rest of this section we will focus on developing an

ecientalgorithmforsecondarystructurepredictioninourmodel.

Astraightforwardalgorithmtosolvethisproblemwouldbetorunthroughallthe

O( |s| ⁴ )

^choicesôf^splitsând^compute^theênergyôf^theôptimal^structuresôf^the^two

pairsof subsequences. This would requiretime

O( |s| ⁷ )

^and ^space

O( |s| ² )

^. ^One^can

howeverobserve,that when wecompute theenergy of theoptimal structure ofthe

subsequencefrombase

i

^to^base

l

^with^thesubsequencefrombase

j

^to^base

k

^removed,

wealsocomputetheenergyoftheoptimalstructureofthesubsequencefrom base

i ⁰

tobase

l ⁰

^with^thesubsequencefrom base

j

^to^base

k

^removed^for^all

i ≤ i ⁰ ≤ j

^and

k ≤ l ⁰ ≤ l

^. ^Hence,^by^using^theseintermediateresultsfromthedynamicprogramming algorithm,we canreduce the time requirement to

O( |s| ⁵ )

^by ^just ^running ^through

allthe

O( |s| ² )

^choices^of ^the^removed subsequence. Unfortunately, wethen haveto storesomeintermediate resultsuntil other results become available. This increases

thespacerequirementto

O( |s| ⁴ )

^. ^However,^a^more^thoroughinvestigationshowsthat theintermediate resultscomputed with

k − 1

âs^the^rightêndpointôf ^the ^removed

subsequenceareonlycombinedwithintermediateresultscomputedwith

k

^as^the^left

endpointoftheremovedsubsequence. Thisallowsustosplitthecomputationinto

n

independentphases,eachrequiringonlyspace

O( |s| ³ )

^,^thus^reducing^the^overall^space

requirementto

O( |s| ³ )

^whilemaintainingthe

O( |s| ⁵ )

^timerequirement.

TheformalspecicationofthesketchedalgorithmforpredictingRNA secondary

structurescontainingpseudoknotsisgiveninalgorithm1. Thespecicationisrather

abstract.Itismoreanalgorithmschemathanaready-to-implementalgorithm. More

specically,animplementationwouldrequireseveraldierentarrays,storingenergies

undervariousassumptionsofbasepairingsofankingbases. Inalgorithm1weonly

showhavetomaintainonetypeofarray(

V

^). ^But^the^same^technique^can^be^used^for

maintainingseveraltypesofinterdependentarraysusedinanactualimplementation

ofthealgorithm.

The

O( |s| ⁵ )

^running^time ôfâlgorithm ¹^should ^makeît ^feasible^for ^longer^RNA

sequencesthan theRivasand Eddy algorithm. Forexample, ifweassumethat the

constants hidden by the O notation are similar for the two algorithms, the 130

140basesupperbound fortheRivasandEddyalgorithmimpliesanupperboundof

350375bases for ouralgorithm. This increase mightjustify therestricted model

(9)

doknotsbasedonthemodelillustratedingure2.

/*

V j,k ( i, l )

^denotes ^theênergyôf ^theôptimal^structure^for

s [ i..j ]

concatenatedwith

s [ k..l ]

^. ^*/

E = ∞

for

k = 1

^to

|s|

^do^/*^Fix ôneôf ^the êndpointsôf ^the êxcluded^region ^*/

Allocatememoryforstoringandcalculating

V j,k ( i, l )

^and

V k−1,l ( j, i )

^for

i < j <

k < l

/*Computetableswith

k

^(or

k − 1

⁾âs^right^(or^left)êndpointôfêxcluded^region.

*/

for

j = 1

^to

k − 1

^do

Computetable

V j,k

endfor

for

l = k

^to

|s|

^do

Computetable

V k−1,l

endfor

/* Combine tables. */

for

1 ≤ i < j < k < l ≤ |s|

^do

E = min {E, V j,k ( i, l ) + V k−1,l+1 ( j + 1 , i − 1) }

endfor

Freeallocatedmemory

endfor

ofallowingonlyonepseudoknot. Ifthisrestrictionistosevere, wecouldextendour

modelbyallowingthesequencetobesplitintosegmentsforeachofwhichtheoptimal

secondarystructureiscalculatedusingthemodelofgure2.Suchanextendedmodel

ismorecomparabletothemodelusedbyRivasandEddyintermsoflegalstructures

(though still more restricted). It is also comparable to the model used by Rivas

and Eddy in allowing secondary structure prediction in time

O( |s| ⁶ )

^. ^The ^space

requirementcanstillbelimitedto

O( |s| ³ )

^though.

We couldkeepplayingthis game of modifying models and algorithms to obtain

thebest possiblecombinationofafastalgorithm andbroadclassof legalsecondary

structures. But for any class of secondary structures with pseudoknots we should

probably not expect to do better than the requirements of time

O( |s| ³ )

^and ^space

O( |s| ² )

^of^the^classic^mfold^algorithm. ^Furthermore,inthefollowingsectionwepro- videevidencethatweshouldnotsethopestohighfordevelopingecientalgorithms

handlingsecondarystructureswithgeneraltypesofpseudoknots.

4 Complexity Results

Inthis sectionweprovethat RNA secondarystructure predictionwithpseudoknots

is

NP

^-complete ⁱⁿ ^a ^simple ^nearest ^neighbour ^model, ^cf. ^denition ^1. ^This ^model

might seem too simple, and probably would be if we wanted to base a secondary

structurepredictionalgorithmonit. Butwhen provingcomplexityresults,wewant

todosoinamodelthat isassimpleaspossible. Iftheproblemin thesimplemodel

is

NP

^-complete, ît ^will ^remain^soⁱⁿ âny^more^complexând ^realistic^model îf^xing

(10)

Denition1 (The Nearest Neighbour Pseudoknot Model) Let

S

^be ^a ^sec-

ondarystructureon asequence

s ∈ {A, C, G, U } ^∗

^,^with

|s| = n

^,^that ^is,

S

îsâ ^setôf

basepairs

i · j

^where

1 ≤ i < j ≤ n

^and

∀i · j, i ⁰ · j ⁰ ∈ S : i = i ⁰ ⇔ j = j ⁰

^. ^The^energy

of

S

^is^anindependent sum ofenergies ofeach ofthe basepairs in

S

^,

E ( S ) = X

i·j∈S

E ( i · j, i + 1 , j − 1) ,

where the energy ofabasepair

i · j

^only ^depends ^on

•

^the ^base^pair îtself,^that îs,^the ^types ôf^bases^forming^the ^pair.

•

^the ^twoneighbouring bases

i + 1

^and

j − 1

^,^that^is, ^the^types^of ^these^two^bases.

Furthermore, if

i + 1 · j ⁰ ∈ S

^(or

i ⁰ · j − 1 ∈ S

⁾ ^the^energy ^can^depend^on

j ⁰

^(or

on

i ⁰

^).

Note that the Nearest Neighbour Pseudoknot Model allowsarbitrarily complex

pseudoknotsasthereisnorestrictionthatbasepairsarenotallowedtooverlap. The

energyofabasepairintheNearestNeighbourPseudoknotModelisallowedtodepend

onnon-neighbouringbases,butonlythroughabasepairingwithaneighbouringbase.

IfwecomparethistotheTinocomodel,cf.[14],theTinocomodelallowstheenergy

of a base pair to depend, not only on the neighbouring bases and the base pairs

they mightparticipate in, but on all bases and base pairs in the loop it closes. If

weconsiderthemodelassumedbythemfoldserver,this ismorerestrictedthanthe

Tinocomodel. Stillitallowstheenergyofabasepairtodependonthetypeofloopit

closes,thesizeoftheloop,andcoaxialstackingofbasepairsinvolvingneighbouring

bases.TheNearestNeighbourPseudoknotModelcanbeseenasafurtherrestriction

ofthis where weonly allowthe energy ofa base pairto depend on stacking eects

withunpaired neighbouringbasesand basepairsinvolvingneighbouringbases. The

valueofthesestackingeectscanhoweverdependonwhethertheinvolvedbasepairs

formahelix, anordinaryloop(a bulgeormultibranched loop),orapseudoknot.

Thus, if we compare the Nearest Neighbour Pseudoknot Model to the energy

modelusedbyRivasandEddy,cf.[10],itshouldbeoflittlesurprisethattheNearest

NeighbourPseudoknotModelis arestrictionof themodel usedbyRivasand Eddy.

TheNearest NeighbourPseudoknot Model canbe obtainedfrom the energy model

used by Rivas and Eddy by xing someof the parameters. Thus an

NP

^-hardness

resultforsecondarystructurepredictionintheNearestNeighbourPseudoknotModel

immediatelyimpliesthatsecondarystructurepredictionintheenergymodelusedby

RivasandEddyis

NP

^-hard.

Proposition1 The problemofdeterminingwhethertheoptimal secondarystructure

inthe NearestNeighbourPseudoknot Modelhas energylower thansomeenergyvalue

E

^is

NP

^-complete.

As the problem trivially is in

NP

^(guess ^the ^optimal ^secondary ^structure ^and

verifyinpolynomialtimethatithasanenergyvaluelowerthan

E

^),^all^we^need^to^do

istoprovethat itis

NP

^-hard. ^We^will ^do^this^by^a^reduction^to^the^special^case^of

3satwhereeachliteraloccursatmosttwotimes,cf.[8,proposition9.3]. Throughout

(11)

theproofofthepropositionwewillallowonlyWatson-Crickbasepairs,i.e.

A

^pairing

with

U

^and

C

^pairing^with

G

^. ^This^will^become^explicitⁱⁿ^the^nalspecicationofthe basepairenergyfunction,andisonlyatechnicallimitationtoreducethecomplexity

oftheproof. Beforeprovingproposition1weneedsomebuildingblocks.

Denition2 The

d

^digit ^binary representation of

k

^, ^where

0 ≤ k ≤ 2 ^d − 1

^, ^over

the alphabet

{A, U}

^, ^is ^the ^string

b _{A,U} ( k, d )

^of ^length

d

^that interpreted as a bi- narynumber with

A

representing

0

^and

U

representing

1

^has ^the ^value

k

^. ^Similarly

b _{C,G} ( k, d )

^is^the

d

^digit ^binary representation of

k

^over^the ^alphabet

{C, G}

^.

The

k

^'th ^distinct

{A, U }

^pattern^using

d

^digit ^binaryrepresentationsisthestring

A . . . A

| {z }

d+2

U b {A,U} ( k, d ) AUAb {A,U} ( k, d ) U A . . . A | {z }

d+2

.

The

k

^'th ^distinct

{C, G}

^pattern ^using

d

^digit ^binaryrepresentationsisdenedsimi- larly.

Denition3 Forastring

s

^the complementarystring

s ¯

^is^the^string constructedby reversing

s

^and^replacing ^each

A

^with ^a

U

^,^each

U

^with^an

A

^,^each

C

^with^a

G

^,^and

each

G

^with ^a

C

^.

Theneed forthese distinct patterns isto circumventthe fact that weonly have

fourlettersinthealphabetofRNAsequences. TheywillbeusedtoconstructanRNA

sequencecorrespondingto abooleanformulaonrestricted3satform,such thatthe

energyof an optimalsecondary structure ofthe constructed RNA sequenceimplies

whethertheformulaissatisable. TheconstructedRNAsequencewillconsistoftwo

parts,apartwheretheliteralsaregroupedaccordingtotheclausesandapartwhere

theliteralsaregroupedaccordingto thevariables.

Ifwehadanalphabetofarbitrarysizewecouldusetwosymbolstorepresenteach

occurrence of aliteral, onesymbolin theclauses partand the other symbolin the

literals part. A score of minus one could be assigned for each pairing of two such

symbolswithsomeextrapairsofsymbolsbeingusedtoformstructuresnullifyingthe

benetsofpairingmorethanonesymbolinaclause,orpairingasymbolrepresenting

avariableaswellaspairingasymbolrepresentingthisvariablesnegation.

Withoutanalphabetofarbitrarysizewewillinsteadusedistinct

{C, G}

^patterns

andtheir complementary stringsin the clauses and variables parts, respectively, to

representtheliteralsoftheformula. Ahelixformedbetweena

{C, G}

^pattern^and^its

complementarystringwillindicate that thecorrespondingliteralis trueand wewill

chooseenergyparametersensuringthatsuchahelixusuallycontributesnegativelyto

thetotalenergy. Thedistinct

{A, U}

^patterns^and^theircomplementarystringswillbe usedtoformstructuresnullifyingbenetsofhavingmorethanonetrueliteralineach

clause,andofhavingbothaliteralrepresentingavariableandaliteralrepresentingits

negationbeingtrueatthesametime. Thisisensuredbychoosingenergyparameters

suchthathelicesformedbythedistinct

{A, U}

^patterns^also^contribute^negatively^to

thetotalenergy,exceptifthecasetheyshould nullify occurs. Inthatsituation they

contributezerotothetotalenergy. Theformalspecicationoftheenergyparameters

ispostponedtilltheendofthissection.

(12)

Denition4 Let

C = l 1 ∨l ₂ ∨ l 3

^be^a^bo^oleandisjunctionofthreeliterals. The clause block

C

^of

C

^using

d

^digit ^binaryrepresentationsisthe string

|{z}

S 1

|{z}

L 1

|{z} S ¯ 1

|{z}

S 2

|{z}

L 2

|{z}

S 1

|{z} S ¯ 2

|{z}

L 3

|{z}

S 2

,

where the

S i

^'s ^are ^distinct

{A, U }

^patterns ^using

d

^digit ^binary representations for twodierent

k

^'s, ^and^the

L i

^'s^are^distinct

{C, G}

^patterns^using

d

^digit^binary ^repre-

sentationsfor threedierent

k

^'s.

The rationalebehind this constructionis that we can form twohelices between

distinct

{A, U }

^patterns ^and ^their complementary strings within the clause block.

These twohelices will span dierent

L i

^'s, ^except ^for ^the ^case^where ^the

S 1

^and

S 2

anking

L 2

^both ^form ^helices ^with ^their complementary string. In this case, the innermostbase pairofthe

S 1

^helixând^theôutermost^base^pairôf^the

S 2

^helix^(and

viceversa)willbeneighbouringbasepairsformingpseudoknots.

Furthermore,the

L i

^'s ^spanned^by^such^a^helix^will ^be^screened. ^By^screened,^we

mean that at least one of the anking bases of the

L i

^pattern ^cannot ^form ^a ^base

pairwith a base not spanned by the helix without forming a pseudoknotwith the

innermostbasepairofthehelix. The

L i

^pattern^thus^cannot^form^the^intended^helix

with its complementary string in the variable block, that we will describe shortly,

withoutintroducinga pseudoknotofneighbouringbase pairs. Withoutintroducing

neighbouringpseudoknottedbasepairs,foraclauseblockwecanthusformhelicesof

twoofthedistinct patternsstraightaway,andathirdhelix ifwecanpaironeofthe

L i

^patterns^with^itscomplementarystringinthevariablespart.

Denition5 Let

x

^be â ^variable ôccurring ⁱⁿ â ^boolean ^formula ^where êach ^literal

occursatmost twice. The variable block

V

^of

x

^using

d

^digit ^binaryrepresentations isthestring

|{z}

S 1

|{z} P ¯ 1

|{z} P ¯ 2

|{z} S ¯ 1

|{z} N ¯ 1

|{z} N ¯ 2

|{z}

S 1

,

where

S 1

^is^a^distinct

{A, U }

^p^attern ^for^some

k

^,^the

P ¯ i

^'s^arecomplementarystrings tothe distinct

{C, G}

^patterns ^used^for ^the ^at^most ^two^positive occurrences of

x

^(if

x

ôccurs ^positive ônly ônce, ône ôf ^the

P ¯

^patterns ^is ^omitted ^from

V

⁾ ^and ^the

N ¯ i

^'s

are complementary strings to the distinct

{C, G}

^patterns ^used ^for ^the ^at^most ^two

negative occurrences of

x

^(if

x

ôccurs ^negative ônly ônce, ône ôf ^the

N ¯

^patterns ^is

omittedfrom

V

^).

Therationalebehindthisconstructionisonceagaintouseahelixformedbyone

oftheoccurrencesof

S 1

^and ^itscomplementarystringto screenthe complementary stringscorrespondingtoeitherthe(atmost)twopositiveoccurrencesof

x

^or^the^(at

most)twonegativeoccurrencesof

x

^. Îf^weâre^toâvoidintroducingneighbouringbase pairsformingapseudoknot,eithernoneofthedistinct

S 1

^patterns^form^a^helix^with

thecomplementarystring, thecomplementary stringscorresponding tothe positive

occurrencesof

x

^do^not^form ^helices,^or^thecomplementarystringscorrespondingto thenegativeoccurrencesof

x

^do^not^form^helices. ^We^are^now^ready^to^construct^the

RNAsequencerepresentingabooleanformulaonrestricted3satform.

(13)

Denition6 Let

φ

^be ^a ^boolean ^formula ^on conjunctive normal form where each clause has three literals and each literal occurs at most two times. Assume that

φ

consistsof

c

^clauses ^and^uses

v

^variables. ^The ^RNA ^sequence corresponding to

φ

^is

thesequence

s φ = C 1 C 2 . . . C c V 1 V 2 . . . V v ,

where

C i

^is ^the ^clause ^block ^using

d log ₂ (3 c + v ) e

^digit ^binary representations corresponding to the

i

^'th ^clause ^of

φ

^,

V i

^is ^the ^variable ^block ^using

d log ₂ (3 c + v ) e

^digit

binary representations corresponding to the

i

^'th ^variable ^of

φ

^, ^no ^distinct ^p^attern ^is

used more than once, and the patterns corresponding to a literal and their comple-

mentarystrings occur inreverseorder.

Thechoice ofnumberofdigitsweusein thebinaryrepresentationsensuresthat

we canchoose at least

max { 3 c, 2 c + v}

^dierent ^values^for ^distinct ^patterns. ^Each

clause block uses two distinct

{A, U}

^patterns ^and ^three ^distinct

{C, G}

^patterns,

whileeachvariableblockusesonedistinct

{A, U }

^pattern. ^Thus^we^do^not^run^out^of

patterns. Wewillusethetermcomplementarypattern forthedeliberateoccurrences

ofthecomplementarystringtoadistinctpattern, thatis, thestringsindicatedbya

barredpatternindenitions 4and5.

So far we have assumed that helices only form between a distinct pattern and

the complementary string designed to form a helix with it. Helices can of course

formbetweenpartsofdistinctpatternsnotdesignedtoformhelicestogether,butthe

followinglemmalimitsthelengthofsuchhelices.

Lemma1 Let

s φ

^be^an^RNA^sequenceconstructedfromabooleanformula

φ

^accord-

ingtodenition6. In anystructure

S

^of

s φ

^,^any ^helix^ofconsecutivelystacking pairs oflengthatleast

4 d + 5

^,^where

d

îs^the ^numberôf^digitsûsed^for ^the^binary^represen-

tations,will have at least

2 d + 3

^bases ât ^the ênd ôf â ^distinct ^pattern ^forming ^base

pairs with the intendedbasesofthe complementarypattern tothisdistinct pattern.

Proof. Byconstruction any substring of

s φ

^of ^length^at ^least

4 d + 5

^will ^contain^at

least

2 d + 3

^bases ^from ône ôf ^the ênds ôf â ^distinct ^pattern ôr îts complementary pattern. Consideroneofthetwosubstringsformingthehelix. Thiswillbeoflength

atleast

4 d + 5

^and ^thus ^contain^at ^least

2 d + 3

^bases ^from â^distinct ^patternôrîts

complementary pattern. Assume withoutlossofgeneralitythat itcontainstherst

2 d + 3

^bases^from^the

k

^'th^distinct

{A, U }

^pattern^using

d

^digitrepresentations,that is,itcontainsthesubstring

A ^d+2 U b {A,U} ( k, d )

^. ^Byconstruction,theonlyoccurrences of

d + 2

consecutive

U

^'s ^preceded ^by^an

A

ⁱⁿ

s φ

âre ât ^the ênds ôf complementary patterns to distinct

{A, U }

^patterns, ^and ^thus

A ^d+2 U b {A,U} ( k, d )

^forms ^base ^pairs

with

¯ b {A,U} ( k ⁰ , d ) AU ^d+2

^for^some

k ⁰

^(by^the^assumption^that^onlyWatson-Crickbase pairsareallowed). As

b {A,U} ( k, d )

^pairs^with

¯ b {A,U} ( k ⁰ , d )

^it^follows^that

k = k ⁰

^.

2

Wehavenowestablishedthatanyhelixofconsiderablelengthwillcontainatleast

partofadesignedpairing. Thenextlemmaestablishesthatthiswillbeallitcontains.

Lemma2 Let

s φ

^be^an^RNA^sequenceconstructedfromabooleanformula

φ

^according

todenition6using

d

^digit^binaryrepresentations. Inanystructure

S

^of

s φ

^,^there^are

nohelicesofmorethan

4 d +9

consecutivelystackingbasepairscontainingonly

A

^'s^and

U

^'s^or^containing^only

C

^'s^and

G

^'s. ^Theônly^helicês ôf^length

4 d + 9

^containing^only

(14)

A

^'s ^and

U

^'s ^or ^containing ^only

C

^'s ^and

G

^'s ^are ^helic^es ^formed ^by ^distinct ^patterns

andtheircomplementary pattern.

Proof. Bylemma1weknowthatahelixoflength

4 d + 9

^will^containôneôf^theênds

ofadistinctpattern pairedwithitscomplementarypattern. All wehavetoshowis,

that we cannot extenda helixformed bya distinct pattern and its complementary

patternwithanextrastackingpairofbasesofthesametype.

If the distinct pattern is a

{C, G}

^pattern ^this ^is straightforward, as it will be in a clause block and thus bordered by an

A

^and ^a

U

^, ^or ^by ^two

A

^'s. ^Similarly^,

the complementary pattern of a distinct

{A, U }

^pattern ^from ^a ^variable ^block ^will

be bordered by two

G

^'s. ^Finally^, ^the complementary pattern to a distinct

{A, U }

patternfromaclauseblockwillbeborderedbyan

A

^on^one^side,^cf.^denition^4. ^But

takingthe

S ¯ 1

complementarypattern asexample,this

A

^should ^form ^an^illegal ^(by

theWatson-Crickbase pairassumption)base pairwith either theleftmost

A

^of^the

precedingclauseblockor therightmost

C

ⁱⁿ^the

L 2

^pattern^to^extend^the^helix.

2

Proof (ofproposition 1). Asmentionedabovethereductionwillbefrom3satwith

therestrictionthateachliteralappearsatmosttwice. Solet

φ

^be^a^valid^formula^for

this restrictionof 3sat with

c

^clauses ^and

v

^variables. ^In^polynomial ^time, ^we ^can

construct

s φ

âccording^to ^denition^6,ând^the^base ^pairênergy^function

E ( X i · Y j , V i+1 , W j−1 ) =

 

 



 

 

− 1

^if

V i+1 · W j−1 ∈ S

^and^either

X · Y, V · W ∈ {A · U, U · A}

or

X · Y, V · W ∈ {C · G, G · C}

4 d + 7

^if

X · Y ∈ {A · U, U · A, C · G, G · C}

and for

j ⁰ 6∈ {i + 1 , . . . j − 1 }

^we ^have

V i+1 · Z j ⁰ , W j−1 · Z j ⁰ , Z j ⁰ · V i+1 ,

Z j ⁰ · W j−1 6∈ S 4 d + 8

^otherwise

where

d

îs^the^numberôf^digitsûsed^for^the^binaryrepresentationsin

s φ

^and

S

^is^the

structureforwhichtheenergyiscalculated. Thenotation

X i

îsûsedâsâ^shorthand

toindicatethat the

i

^'th^base^is^of^type

X

^.

Weclaimthat theoptimalsecondarystructureof

s φ

^with^the^above^energy^func-

tionhasenergy

− (3 c + v )

îfândônlyîf

φ

^is ^satisable. ^By^the^energy^function,^the

onlyhelices forwhich thebasepairs combinedyields anegativecontribution tothe

energyofthesecondarystructurearehelicesof atleast

4 d + 9

^base^pairs,^base ^pairs

thatareeither all

A

^'s^pairing^with

U

^'s^or^all

C

^'s^pairing^with

G

^'s. ^By^lemma^2,^the

only such helices that can be formed are between distinct patterns and their com-

plementarypatterns; these helices will consistof exactly

4 d + 9

^base ^pairs^and^thus

contribute

− 1

^to^the^total^scoreôfâ^secondary^structure,^provided ^that^theînnermost

basepairofthehelixdoesnothaveaneighbouringbasepairthatformsapseudoknot.

Hence,ifadistinct patternis screenedbyahelix,it cannotform ahelixyielding a

negativecontributiontothetotalenergy.

If there is an assignment of truth values to the variables of

φ

^satisfying

φ

^, ^we

canconstruct a secondarystructure

S

^on

s φ

^with ^energy

− (3 c + v )

^based ^on ^this

assignmentbyformingthefollowingbase pairs.

(15)

•

^Fôrêach^variable^block^forming^the^helixôf^the^distinct

{A, U }

^pattern^and^the

complementary pattern screening the complementary patterns of the literals

thatbecomefalsebytheassignment.

•

^F^or ^each ^clause^block^forming^the^helices^between^the^distinct

{A, U }

^patterns

that leave the distinct

{C, G}

^pattern ^of ^a ^literal ^that ^becomes ^true ^by ^the

assignmentunscreened.

•

^Fôrming ^the^helices ^between^theûnscreened^distinct ^patternsôf ^literals ⁱⁿ^the

clausespartandtheircomplementarypatterns(thatare unscreenedastheas-

signmentsatises

φ

^,ândâs^the^reverseôrderrequirementindenition6ensures thetwocomplementary patterns correspondingto the sameliteral nothaving

neighbouringbasepairsforming apseudoknot)in thevariablespart.

Bythe discussion following denition 4, the distinct patterns of aclause block can

form at most three helices, each yielding a contribution of

− 1

^, ^and ^each ^variable

blockintroducesonlyonenewdistinctpattern;hencetheenergyof

S

^of

− (3 c + v )

^is

optimal.

Assume now that

s φ

^has ^an ^optimal ^structure

S

^of ^energy

− (3 c + v )

^. ^By ^the

aboveand the discussion following denition 4, we get that each clause block will

contain a distinct pattern corresponding to a literal forming a helix with its un-

screenedcomplementarypattern in the variablespart, andthat the complementary

patternscorrespondingeithertoavariableortoitsnegationwillbescreened. Wecan

thus inferatruthassignmentto thevariablesof

φ

^satisfying

φ

^from ^the^unscreened

complementarypatternsofliteralsin

S

^.

2

Theenergyfunctionspeciedintheproofofproposition1rewardsstackingsome

base pairs, penalises loops by penalising the rst base pair in a helix, and further

penalises neighbouring base pairs that form a pseudoknot. The only two remark-

ableoddities arethedisallowanceofbase pairingsbetween

G

^and

U

^, ^and^penalising

stackingan

A, U

^base^pair^with^a

C, G

^base^pair.

Onecanobservethat wecould allow

G, U

^base ^pairs^without^changing^anything

butinsertinga

C

^between^the^twocomplementarypatternscorrespondingtothesame literal. Asforpenalisingstacking

A, U

^base^pairs^with

C, G

^base^pairs,^this^was^chosen

toeaseestablishingthefactthatnoenergybenetsareobtainedbyextendingahelix

formedbyadistinctpatternanditscomplementarypatternbyfurtherstackingbase

pairs. Aproofwheretheenergyfunctionrewardsstackingofallcombinationsof

A, U

basepairs,

C, G

^base^pairs ^and

G, U

^base ^pairs^can^beâchieved ^by â^moreînvolved

constructionoftheclausespartof

s φ

^. ^However,^to ^limit^the^complexity^of^the^proof,

wehavechosentopresenttheaboveversion.

5 Discussion

The NP-completeness of the RNA secondary structure prediction problem in the

NearestNeighbour Pseudoknot Model tells us, that any algorithm allowing energy

functions suciently general to be specialised to the energy functions in the Near-

estNeighbourPseudoknotModel,andrunninginworstcasepolynomialtime,would

implyP

=

^NP. ^The ^question^whether ôr^not ^P îs êqual^to ^NP îs ône ôf ^the ^fun-

damental open problems in computer science. Based on strong evidence, the large

(16)

majorityofcomputerscientistsbelievethat P

6 =

^NP. ^The NP-completenessofthe RNAsecondary structureprediction problemin the NearestNeighbourPseudoknot

Modelthushintsthat thereisonlylittlehopeforaworstcasepolynomialtimealgo-

rithmforRNAsecondarystructure predictionintheNearestNeighbourPseudoknot

Model, or models extending it. Moreover, it hints that any algorithm for predict-

ingRNA secondarystructureswith generalpseudoknots mostlikelyhaveto exploit

specicpropertiesofaxedenergyfunction toobtainpolynomialrunningtime .

OneapproachtoobtainapolynomialtimealgorithmforRNAsecondarystructure

predictionwith pseudoknots is to limit the types of legal pseudoknots. This is the

approachtakenbyRivasandEddyin [10]andbyusinsection3. Anotherapproach

istakenbyTabaskaetal.in[13],whereinteractionsbetweenneighbouringbasepairs

areignored,thusreducingtheproblemofRNA secondarystructureprediction(with

pseudoknots) to computea maximal weightedmatching. If we are satisedto nd

notnecessarilythe structures ofleast free energy, then heuristics canbe applied to

searchforstructuresof lowenergy. Forexample,vanBatenburget al.in[17] report

onsuccessfulexperimentswithapplyinggeneticalgorithmstotheproblemofnding

lowenergyRNA secondarystructurescontainingpseudoknots.

References

[1] S. R. Eddy and R. Durbin. RNA sequence analysis using covariance models.

Nucleic Acids Research,22:20792088,1994.

[2] J. Gorodkin, L. J. Heyer, and G. D. Stormo. Finding common sequence and

structuremotifsinasetofRNAsequences.InProceedingsofthe5thInternational

ConferenceonIntelligentSystemsforMolecularBiology (ISMB),pages120123,

1997.

[3] B. Knudsen and J.Hein. RNA secondarystructure predictionusing stochastic

context-free grammars and evolutionary history. Bioinformatics, 15:446454,

1999.

[4] R.B.Lyngsø,M.Zuker,andC.N.S.Pedersen.Fastevaluationofinternalloops

in RNAsecondarystructure prediction. Bioinformatics, 15(6):440445,1999.

[5] J.S.McCaskill. Theequilibriumpartitionfunction andbasepairbindingprob-

abilitiesforRNAsecondarystructure. Biopolymers,29:11051119,1990.

[6] R. Nussinov and A. B. Jacobson. Fast algorithm for predicting the secondary

structure ofsingle-stranded RNA. Proceedings of the NationalAcademy of Sci-

enceofthe USA,77(11):63096313,1980.

[7] R.Nussinov,G.Pieczenik,J.R.Griggs,andD.J.Kleitman.Algorithmsforloop

matchings. SIAMJournalonAppliedMathematics,35:6882,1978.

[8] C. H. Papadimitriou. Computational Complexity. Addison-Wesley Publishing

Company,1994.

[9] C.W.A.Pleij. RNApseudoknots.InR.F.GestelandandJ.F.Atkins,editors,

The RNAWorld. ColdSpring HarborLaboratoryPress,1993.

(17)

predictionincludingpseudoknots. Journalof MolecularBiology,285:20532068,

1999.

[11] Y.Sakakibara,M.Brown,R.Hughey,I.S.Mian,K.Sjölander,R.C.Underwood,

andD.Haussler. Stochasticcontext-freegrammarsfortRNAmodeling. Nucleic

AcidsResearch,22:51125120,1994.

[12] D. Sanko. Simultaneous solutionof theRNA folding,alignmentand protose-

quence problems. SIAMJournal onApplied Mathematics,45:810825,1985.

[13] J. E. Tabaska,R. B.Cary, H. N. Gabow, andG. D. Stormo. An RNA folding

method capable of identifying pseudoknots and base triples. Bioinformatics,

14(8):691699,1998.

[14] I. Tinoco, P. N. Borer, B. Dengler, M. D. Levine, O. C. Uhlenbeck, D. M.

Crothers,and J.Gralla. Improvedestimation ofsecondarystructurein ribonu-

cleicacids. NatureNewBiology,246:4041,1973.

[15] I.Tinoco,O.C.Uhlenbeck,andM.D.Levine.Estimationofsecondarystructure

in ribonucleicacids. Nature,230:362367,1971.

[16] D.H.Turner,N.Sugimoto,andS.M.Freier. RNAstructureprediction.Annual

Review ofBiophysics andBiophysical Chemistry,17:167192,1988.

[17] F. H. D. van Batenburg, A. P. Gultyaev, and C. W. A. Pleij. An APL-

programmedgenetic algorithm for the predictionof RNA secondarystructure.

Journalof Theoretical Biology,174(3):269280,1995.

[18] K.ZhangandD.Shasha.Simplefastalgorithmsfortheeditingdistancebetween

treesandrelatedproblems.SIAMJournalonComputing,18(6):12451262,1989.

[19] M.ZukerandP.Stiegler.OptimalcomputerfoldingoflargeRNAsequencesusing

thermodynamics andauxiliaryinformation. Nucleic Acids Research,9:133148,

1981.

(18)

Recent BRICS Report Series Publications

RS-00-1 Rune B. Lyngsø and Christian N. S. Pedersen. Pseudoknots in RNA Secondary Structures. January 2000. 15 pp. To appear in Fourth Annual International Conference on Computational Molecular Biology, RECOMB ’00 Proceedings, 2000.

RS-99-57 Peter D. Mosses. A Modular SOS for ML Concurrency Primi- tives. December 1999. 22 pp.

RS-99-56 Peter D. Mosses. A Modular SOS for Action Notation. Decem- ber 1999. 39 pp. Full version of paper appearing in Mosses and Watt, editors, Second International Workshop on Action Semantics, AS ’99 Proceedings, BRICS Notes Series NS-99-3, 1999, pages 131–142.

RS-99-55 Peter D. Mosses. Logical Specification of Operational Se- mantics. December 1999. 18 pp. Invited paper. Appears in Flum, Rodr´ıguez-Artalejo and Mario, editors, European Asso- ciation for Computer Science Logic: 13th International Work- shop, CSL ’99 Proceedings, LNCS 1683, 1999, pages 32–49.

RS-99-54 Peter D. Mosses. Foundations of Modular SOS. December 1999.

17 pp. Full version of paper appearing in Kutyłowski, Pachol- ski and Wierzbicki, editors, Mathematical Foundations of Com- puter Science: 24th International Symposium, MFCS ’99 Pro- ceedings, LNCS 1672, 1999, pages 70–80.

RS-99-53 Torsten K. Iversen, K˚are J. Kristoffersen, Kim G. Larsen, Morten Laursen, Rune G. Madsen, Steffen K. Mortensen, Paul Pettersson, and Chris B. Thomasen. Model-Checking Real- Time Control Programs — Verifying LEGO Mindstorms Systems Using UPPAAL. December 1999. 9 pp.

RS-99-52 Jesper G. Henriksen, Madhavan Mukund, K. Narayan Kumar, and P. S. Thiagarajan. Towards a Theory of Regular MSC Lan- guages. December 1999.

RS-99-51 Olivier Danvy. Formalizing Implementation Strategies for First- Class Continuations. December 1999. Extended version of an article to appear in Programming Languages and Systems:

Ninth European Symposium on Programming, ESOP ’00 Pro-

ceedings, LNCS, 2000.

BRICS Basic Research in Computer Science

BRICS R S-00-1 L yngsø & P edersen: P seudoknots in RN A S econdary Structur es

BRICS

Basic Research in Computer Science

Pseudoknots in RNA Secondary Structures

Rune B. Lyngsø

Christian N. S. Pedersen

BRICS Report Series RS-00-1

ISSN 0909-0878 January 2000

Copyright c 2000, Rune B. Lyngsø & Christian N. S. Pedersen.

BRICS, Department of Computer Science University of Aarhus. All rights reserved.

Reproduction of all or part of this work is permitted for educational or research use on condition that this copyright notice is included in any copy.

See back inner page for a list of recent BRICS Report Series publications.

Copies may be obtained by contacting:

BRICS

Department of Computer Science University of Aarhus

Ny Munkegade, building 540 DK–8000 Aarhus C

Denmark

Telephone: +45 8942 3360 Telefax: +45 8942 3255 Internet: BRICS@brics.dk

BRICS publications are in general accessible through the World Wide Web and anonymous FTP through these URLs:

http://www.brics.dk ftp://ftp.brics.dk

This document in subdirectory RS/00/1/

∗

∗

O(n 5 )

O(n 3 )

n

∗

{

}

O ( n 3 )

n

O ( n 3 )

n

O ( n 6 )

O ( n 4 )

n

s

|s| = n

S

i ·j

1 ≤ i < j ≤ n

∀i ·j, i 0 ·j 0 ∈ S : i = i 0 ⇔ j = j 0

i k l j

= min

r,s

 

 

 



 

 

  i r k l j s

i k l j

r s

i k

l j r s

 

 

 



 

 

 

i · j, i 0 · j 0 ∈ S

i < i 0 < j < j 0

i · j, i 0 · j 0 ∈ S

i < i 0 < j 0 < j

i < j < i 0 < j 0

V ( i, j )

s [ i .. j ]

i

j

WM ( i, j )

s [ i .. j ]

W ( i )

s [1 .. i ]

V ( i, j ) = min

eH ( i, j ) ,

eS ( i, j, i + 1 , j − 1) + V ( i + 1 , j − 1) , min

O(n ⁵ )

O(n ³ )

O ( n ³ )

O ( n ³ )

O ( n ⁶ )

O ( n ⁴ )

∀i ·j, i ⁰ ·j ⁰ ∈ S : i = i ⁰ ⇔ j = j ⁰

  ⁱ ^r ^k ^{l j} ^s

i · j, i ⁰ · j ⁰ ∈ S

i < i ⁰ < j < j ⁰

i · j, i ⁰ · j ⁰ ∈ S

i < i ⁰ < j ⁰ < j

i < j < i ⁰ < j ⁰

i<i ⁰ <j ⁰ <j i ⁰ − i + j − j ⁰ > 2

{eL ( i, j, i ⁰ , j ⁰ ) + V ( i ⁰ , j ⁰ ) },

eM ( k, k ⁰ ) = a + bk ⁰ + ck

k ⁰

O( |s| ³ )

O( |s| ⁶ )

O( |s| ⁴ )

O( |s| ⁴ )

O( |s| ⁷ )

O( |s| ² )

i ⁰

l ⁰

i ≤ i ⁰ ≤ j

k ≤ l ⁰ ≤ l

O( |s| ⁵ )

O( |s| ² )

O( |s| ⁴ )

O( |s| ³ )

O( |s| ³ )

O( |s| ⁵ )

O( |s| ⁵ )