• Ingen resultater fundet

BRICS Basic Research in Computer Science

N/A
N/A
Info
Hent
Protected

Academic year: 2022

Del "BRICS Basic Research in Computer Science"

Copied!
18
0
0

Indlæser.... (se fuldtekst nu)

Hele teksten

(1)

BRICS R S-00-1 L yngsø & P edersen: P seudoknots in RN A S econdary Structur es

BRICS

Basic Research in Computer Science

Pseudoknots in RNA Secondary Structures

Rune B. Lyngsø

Christian N. S. Pedersen

BRICS Report Series RS-00-1

ISSN 0909-0878 January 2000

(2)

Copyright c 2000, Rune B. Lyngsø & Christian N. S. Pedersen.

BRICS, Department of Computer Science University of Aarhus. All rights reserved.

Reproduction of all or part of this work is permitted for educational or research use on condition that this copyright notice is included in any copy.

See back inner page for a list of recent BRICS Report Series publications.

Copies may be obtained by contacting:

BRICS

Department of Computer Science University of Aarhus

Ny Munkegade, building 540 DK–8000 Aarhus C

Denmark

Telephone: +45 8942 3360 Telefax: +45 8942 3255 Internet: BRICS@brics.dk

BRICS publications are in general accessible through the World Wide Web and anonymous FTP through these URLs:

http://www.brics.dk ftp://ftp.brics.dk

This document in subdirectory RS/00/1/

(3)

Rune B. Lyngsø

Christian N. S.Pedersen

Abstract

RNAmoleculesare sequences ofnucleotidesthat serveas more thanmere

intermediaries between DNAand proteins, e.g. as catalytic molecules. Com-

putational predictionof RNAsecondarystructureis among thefew structure

predictionproblemsthat canbesolved satisfactoryinpolynomialtime. Most

workhasbeendonetopredictstructuresthatdonotcontainpseudoknots. Al-

lowing pseudoknots introducemodellingand computationalproblems. In this

paperwe consider the problem of predicting RNA secondary structurewhen

certain typesof pseudoknots are allowed. We rst present analgorithm that

in time

O(n 5 )

and space

O(n 3 )

predicts the secondary structure of an RNA

sequenceoflength

n

inamodelthatallows certain kindsofpseudoknots. We then prove that the general problem of predicting RNA secondary structure

containing pseudoknotsis NP-completefor alarge classof reasonablemodels

ofpseudoknots.

1 Introduction

An RNA molecule is a sequence of nucleotides that often is just an intermediary

betweenDNAandproteins. SomeRNAmoleculesdohoweverhavevitalimportance,

e.g.intranslationofmRNAtoproteins. ThethreedimensionalstructureofanRNA

moleculeistoalargeextentdeterminedbyinteractionsbetweenpairsofnucleotides,

calledbase pairings. Thesecondarystructure ofanRNAmoleculeistheset ofbase

pairingsinthethreedimensionalstructureofthemolecule. Thesecondarystructure

canthus be used in its own right to look for information, e.g.active sites, oras a

steppingstonetowardspredictionofhigherstructurallevels.

Ifthethreedimensional,ortertiary,structure ofanRNAmoleculeisavailableit

isofcourseeasyto determinethesecondarystructure. Butdeterminingthetertiary

structureisacomplicatedandtimeconsumingtask. Whenthetertiarystructureofan

RNA moleculeis notavailable,the authoritativewayof determining the secondary

structure of an RNA molecule is by comparative modelling. Given a number of

related RNA sequences the common secondary structure is inferred by identifying

compensatorymutations, that is, by identifying pairs of positions where mutations

ofthebase in oneofthe positions is accompaniedbya mutation ofthe basein the

otherpositiontoretaintheirbasepairingcapability. Thedrawbackofthistechnique

is that it requires several related RNA sequences to be available. Moreover, since

BasicResearchInComputerScience(BRICS), Centre oftheDanishNationalResearchFoun- dation, Department of Computer Science, University of Aarhus, Ny Munkegade, 8000 Århus C,

Denmark.E-mail:

{

rlyngsoe,cstorm

}

@brics.dk .

(4)

diculttofully automatecomparativemodelling.

Thus computational methods forpredicting the secondarystructure of an RNA

sequence are in demand. To construct such methods it is necessary to model the

biologicalrealitythat governs structureformation. Inspiredby thelawsof thermo-

dynamics this is often done in terms of energy minimisation. Using a model that

describes how to assign free energies to legal secondary structures, the secondary

structureofanRNA sequenceispredictedasthestructureof leastfreeenergy. The

biologicalrelevanceofthepredictedstructure andthecomputationalresourcessuch

astimeandspacethatareneededtocomputeit,dependentirelyonthechoiceoflegal

structuresandfreeenergies. Mostworkhasbeendevotedtoconstructalgorithmsfor

RNAsecondarystructurepredictionwhenthelegalstructuresarelimitedtosecondary

structuresthatdonotcontainpseudoknots,thatis, donotcontainoverlappingbase

pairs. Nussinovetal.in [7]presentanalgorithmusingasimplefreeenergyfunction

that is minimised when the secondary structure contains the maximum number of

complementary base pairs. The algorithm takes time

O ( n 3 )

for predictingthe sec-

ondarystructureofanRNAsequenceoflength

n

. Amorecomplexmodelforthefree

energyofsecondarystructuresisproposedbyTinocoetal.in[15]. Thismodelstates

thatthe freeenergyofasecondarystructure isthe sumofindependentenergiesfor

each loop in the structure. Based on this model of free energy, Zukerand Stiegler

in [19],and Nussinov and Jacobsenin [6], presentalgorithms that taketime

O ( n 3 )

for predicting the secondarystructure of an RNA sequence of length

n

. Since the

ideasof these algorithms form the basis of the widely used mfold server for RNA

secondarystructure prediction,theyare commonlyreferredto asmfoldalgorithms,

oralgorithmsofthemfoldtype.

Thereasonthat legalstructuresareoftenrequirednottocontainpseudoknotsis

notthat pseudoknots do not occur in real world structures, but rather because of

modellingandcomputationalconsiderations. Itisstillanopenquestionhowtocon-

structa reasonablemodel offree energyfor structures containingpseudoknots that

alsomakesitpossibletoconstructecientstructurepredictionalgorithms. Rivasand

Eddyin [10] presentan algorithmthat in time

O ( n 6 )

andspace

O ( n 4 )

predictsthe

secondarystructureof anRNA sequence oflength

n

in amodelthat allowscertain

kindsof pseudoknots. In this paper we study the problem of predictingRNA sec-

ondarystructure containingpseudoknots further. Insection 2webriey reviewthe

ideasof themfoldalgorithms. Extending onthese ideas,wein section3presentan

algorithmforpredictingRNAsecondarystructurewhencertaintypesofpseudoknots

areallowed. We compare thepresentedalgorithm with the algorithm presented by

Rivasand Eddyin [10]. Insection4weshowthat predictingRNAsecondarystruc-

turescontainingpseudoknots of arbitrary typesis NP-complete for alargeclass of

reasonablefreeenergyfunctions. Finally, in section5wediscuss theimplicationsof

theNP-completenessresult.

2 Terminology

ForanRNAsequence

s

,

|s| = n

,asecondarystructureisaset

S

ofbasepairs

i ·j

with

1 i < j n

,suchthat

∀i ·j, i 0 ·j 0 S : i = i 0 j = j 0

. Eachbasecanthustakepart

inat mostonebase pair. Thebase pairsofasecondarystructuredescribethebase

(5)

i k l j

= min

r,s

 

 

 

 

 

  i r k l j s

,

i k l j

r s

,

i k

l j r s

 

 

 

 

 

 

Figure1: GeneralrecursionschemefortheRivasandEddyRNAsecondarystructure

predictionalgorithm.

pairinginteractionsformedbyhydrogenbondinginacorrespondingtertiarystructure.

ItisusuallyassumedthatRNAsecondarystructuresdonotcontainpseudoknots. Two

basepairsformapseudoknotiftheyareoverlapping,i.e.twobasepairs

i · j, i 0 · j 0 S

formapseudoknotif

i < i 0 < j < j 0

. Thetermpseudoknotisalsousedasashorthand

forotheroverlappingstructuresthanbasepairs,e.g.twohelicesofstackingbasepairs,

whenthebase pairsofthesestructures formpseudoknots.

Thereareofcoursegoodreasonsforintroducingthisrestriction,prominentamong

whichisasimplicationoflegalstructures. Thesimplicationofnotallowingpseudo-

knotsensuresthattwobase pairs

i · j, i 0 · j 0 S

areeithernested,i.e.

i < i 0 < j 0 < j

,

ordisjoint, i.e.

i < j < i 0 < j 0

. In many situations this allows us to rst handle

onebase pairandthentheother(if theyarenested),orhandlethem independently

(if they are disjoint). The pseudoknot restriction is thus crucial in algorithms for

e.g.structure prediction [1,3,6,11,19], partition function calculations [5], compar-

ing secondarystructures [18], and simultaneous alignmentand structure prediction

of RNA sequences [2,12]. Inthe following we will exemplify this by giving a brief

summaryofanalgorithmof themfoldtypeforsecondarystructure prediction. The

summary is also aimed at introducing the terminology we will use in section 3. A

moredetailedsummary canbefoundin e.g.Turneretal.[16].

An mfold algorithm predicts secondary structures by computing minimum (or

close to minimum) energy structures in the model proposed by Tinoco et al. [14]

extendedwith simplifying assumptionsaboutthe nature of the energy function for

multibranched loops. Three arrays,

V ( i, j )

holding the minimum energy of a sec-

ondarystructureon

s [ i .. j ]

withbases

i

and

j

formingabasepair,

WM ( i, j )

holding

theminimum energy ofa structure on

s [ i .. j ]

that is partof a multibranched loop, and

W ( i )

holdingtheminimumenergyofastructureon

s [1 .. i ]

,arecomputedbased

ontherecursions

V ( i, j ) = min

eH ( i, j ) ,

eS ( i, j, i + 1 , j 1) + V ( i + 1 , j 1) , min

i<i 0 <j 0 <j i 0 i + j j 0 > 2

{eL ( i, j, i 0 , j 0 ) + V ( i 0 , j 0 ) },

i+1<k<j min {WM ( i + 1 , k 1) + WM ( k, j 1) + a}

,

(1)

(6)

WM ( i, j ) = min

V ( i, j ) + b, WM ( i, j 1) + c, WM ( i + 1 , j ) + c,

i<k≤j min {WM ( i, k 1) + WM ( k, j ) } ,

(2)

W ( i ) = min

W ( i 1) ,

0≤k<i min {W ( k ) + V ( k + 1 , i ) } .

(3)

Theserecursionsemployenergyfunctions forhairpinloops(

eH

),stackingbasepairs

(

eS

),internalloopsandbulges(

eL

),andmultibranchedloops(

eM ( k, k 0 ) = a + bk 0 + ck

,where

k 0

isthenumberofunpairedbasesand

k

thenumberofhelicesinthemulti-

branched loop). With thecurrentlyused parametersfor theenergy functions these

recursions allow for an

O( |s| 3 )

time algorithm, cf. [4,16], for computing secondary

structuresof minimumenergyforanRNAsequence

s

.

3 Algorithmic Results

TheTinoco model, cf. [14] describeshowto assign energiesto secondarystructures

notcontainingpseudoknots,butdoesnotaddresshowtohandlesecondarystructures

containingpseudoknots. To develop algorithms for predicting secondarystructures

containing pseudoknots, an important step is to decide on a model, i.e. to give a

description of the types of legal secondary structures, and how to assign energies

tothesestructures. Asdevelopinganalgorithmanddecidingonamodelare closely

connectedprocesses,thedescriptionofthemodelisoftenonlyinpartgivenexplicitly.

Oftenthetypesoflegalsecondarystructuresareonlydenedimplicitlythroughthe

constructedalgorithm.

AnexampleofthisisthepseudoknotmodelusedbyRivasandEddyin[10]. This

is,toourknowledge, theonlyrigorous,energybased,polynomialtimealgorithmfor

RNA secondary structure prediction including a class of pseudoknots. In gure 1

webrieysketch theideaoftheRivasandEddyalgorithm. Arraysholding energies

ofoptimalstructuresforthesubsequencefrom

i

through

j

are maintainedsimilarto

equations1to3,butwiththefurtherrestrictionthatthebasesfrom

k

through

l

areyet

unpaired(toallowforfuturepseudoknotinteractions). Thegeneralrecursionscheme

foranentryinoneofthesematricesistominimiseoverallpossiblewaysofsplitting

thesubsequence withan unpaired region into twonew subsequences with unpaired

regions. Thisdenesthelegalstructuresofthemodel. Theenergyparameters,cf.[10,

table 3], used were partly ne tuned by hand and partly obtained by multiplying

similarparametersforunknottedstructuresbyaweightingparameter.

Therequirementsoftime

O( |s| 6 )

and space

O( |s| 4 )

forthis algorithm areobser-

vationsthatfollowdirectlyfromgure1. Thoughpolynomial,these timeandspace

requirementsarerathersteepandin[10]anestimateof130140basesismentioned

asa rough upper bound for the size of sequences for which the algorithm is feasi-

ble. Thoughcomputational poweris everincreasing, applying Moore's law (stating

thatcomputationalpowerdoublesevery18months)stillonlyallowssequencesof300

(7)

i j

k

l

Optimalenergy

= min

i<j<k<l {

Optimalenergyof

i j k l +

Optimalenergyof

j k l i }

Figure 2: A model for a class of pseudoknots. The sequence has been drawn as a

circletohighlightthat oneofthefourpartsofthesequencemightextendacrossthe

sequenceends,hereshownwithazigzaggedline.

basestenyearsfrom nowand of650bases intwentyyears. Nevertheless, theexper-

imentsbasedon thisalgorithm reported in [10] show thefeasibilityof energy-based

predictionsofRNAsecondarystructureswithpseudoknots.

To obtain a faster algorithm, we propose a more restrictedmodel for legal sec-

ondarystructures. Thelegalsecondarystructuresofourmodelarestructureswhere

wecansplitthesequenceintofourparts(oneofwhichmightextendacrosstheends

ofthe sequence) asillustratedin gure2. The splitting into four parts divides the

sequenceinto two pairsof opposing subsequences,illustratedin gure2 aspairsof

blackandgreypartsofthesequence. Eachpairofopposingsubsequencesareallowed

toformanunknottedsecondarystructureandthepseudoknottedsecondarystructure

ariseswhen thesetwosecondarystructuresarecombined.

Tofurtherexplainthetypesofsecondarystructuresallowedinthismodel,consider

apseudoknotoftypeHasillustratedingure3. ApseudoknotoftypeHconsistsof

two overlappinghelices, each closing a hairpinloop, such that some ofthe bases in

thehairpinclosed byone ofthe helices arepart ofthe otherhelix. Asindicated in

gure3,wecansplitapseudoknotof typeH intofour partssuchthat onlybasesin

non-neighbouring,oropposing,parts formbase pairs. Themodel ingure2canbe

seenasageneralisationof pseudoknotsoftypeH where

theoverlappingstructures canbe arbitrary, complexsecondarystructures not containingpseudoknots.

theloopregionsclosedbytheoverlappingstructuresdonotneedto behairpin loops. They can be part of any type of loop as long asthey are consecutive

stretchesofbases.

Themodelingure2thusencompassessecondarystructureswithonepseudoknotof

typeH(oroftypeBortypeI,cf.[9,gure3])amongothers.

As just explained, ourmodel allowsonlyone (albeitverycomplex) pseudoknot,

sointhatrespectourmodelisastepbackwardcomparedtothemodelusedbyRivas

and Eddy. But if we candevelop moreecient algorithms for secondarystructure

predictionin this model, itnds itsjusticationin caseswhereusing theRivasand

(8)

pairings.

Eddy algorithm is infeasible and we only expect, or are content, to nd only one

pseudoknotinteraction. In the rest of this section we will focus on developing an

ecientalgorithmforsecondarystructurepredictioninourmodel.

Astraightforwardalgorithmtosolvethisproblemwouldbetorunthroughallthe

O( |s| 4 )

choicesofsplitsandcomputetheenergyoftheoptimalstructuresofthetwo

pairsof subsequences. This would requiretime

O( |s| 7 )

and space

O( |s| 2 )

. Onecan

howeverobserve,that when wecompute theenergy of theoptimal structure ofthe

subsequencefrombase

i

tobase

l

withthesubsequencefrombase

j

tobase

k

removed,

wealsocomputetheenergyoftheoptimalstructureofthesubsequencefrom base

i 0

tobase

l 0

withthesubsequencefrom base

j

tobase

k

removedforall

i i 0 j

and

k l 0 l

. Hence,byusingtheseintermediateresultsfromthedynamicprogramming algorithm,we canreduce the time requirement to

O( |s| 5 )

by just running through

allthe

O( |s| 2 )

choicesof theremoved subsequence. Unfortunately, wethen haveto storesomeintermediate resultsuntil other results become available. This increases

thespacerequirementto

O( |s| 4 )

. However,amorethoroughinvestigationshowsthat theintermediate resultscomputed with

k 1

astherightendpointof the removed

subsequenceareonlycombinedwithintermediateresultscomputedwith

k

astheleft

endpointoftheremovedsubsequence. Thisallowsustosplitthecomputationinto

n

independentphases,eachrequiringonlyspace

O( |s| 3 )

,thusreducingtheoverallspace

requirementto

O( |s| 3 )

whilemaintainingthe

O( |s| 5 )

timerequirement.

TheformalspecicationofthesketchedalgorithmforpredictingRNA secondary

structurescontainingpseudoknotsisgiveninalgorithm1. Thespecicationisrather

abstract.Itismoreanalgorithmschemathanaready-to-implementalgorithm. More

specically,animplementationwouldrequireseveraldierentarrays,storingenergies

undervariousassumptionsofbasepairingsofankingbases. Inalgorithm1weonly

showhavetomaintainonetypeofarray(

V

). Butthesametechniquecanbeusedfor

maintainingseveraltypesofinterdependentarraysusedinanactualimplementation

ofthealgorithm.

The

O( |s| 5 )

runningtime ofalgorithm 1should makeit feasiblefor longerRNA

sequencesthan theRivasand Eddy algorithm. Forexample, ifweassumethat the

constants hidden by the O notation are similar for the two algorithms, the 130

140basesupperbound fortheRivasandEddyalgorithmimpliesanupperboundof

350375bases for ouralgorithm. This increase mightjustify therestricted model

(9)

doknotsbasedonthemodelillustratedingure2.

/*

V j,k ( i, l )

denotes theenergyof theoptimalstructurefor

s [ i..j ]

concatenatedwith

s [ k..l ]

. */

E =

for

k = 1

to

|s|

do/*Fix oneof the endpointsof the excludedregion */

Allocatememoryforstoringandcalculating

V j,k ( i, l )

and

V k−1,l ( j, i )

for

i < j <

k < l

/*Computetableswith

k

(or

k 1

)asright(orleft)endpointofexcludedregion.

*/

for

j = 1

to

k 1

do

Computetable

V j,k

endfor

for

l = k

to

|s|

do

Computetable

V k−1,l

endfor

/* Combine tables. */

for

1 i < j < k < l ≤ |s|

do

E = min {E, V j,k ( i, l ) + V k−1,l+1 ( j + 1 , i 1) }

endfor

Freeallocatedmemory

endfor

ofallowingonlyonepseudoknot. Ifthisrestrictionistosevere, wecouldextendour

modelbyallowingthesequencetobesplitintosegmentsforeachofwhichtheoptimal

secondarystructureiscalculatedusingthemodelofgure2.Suchanextendedmodel

ismorecomparabletothemodelusedbyRivasandEddyintermsoflegalstructures

(though still more restricted). It is also comparable to the model used by Rivas

and Eddy in allowing secondary structure prediction in time

O( |s| 6 )

. The space

requirementcanstillbelimitedto

O( |s| 3 )

though.

We couldkeepplayingthis game of modifying models and algorithms to obtain

thebest possiblecombinationofafastalgorithm andbroadclassof legalsecondary

structures. But for any class of secondary structures with pseudoknots we should

probably not expect to do better than the requirements of time

O( |s| 3 )

and space

O( |s| 2 )

oftheclassicmfoldalgorithm. Furthermore,inthefollowingsectionwepro- videevidencethatweshouldnotsethopestohighfordevelopingecientalgorithms

handlingsecondarystructureswithgeneraltypesofpseudoknots.

4 Complexity Results

Inthis sectionweprovethat RNA secondarystructure predictionwithpseudoknots

is

NP

-complete in a simple nearest neighbour model, cf. denition 1. This model

might seem too simple, and probably would be if we wanted to base a secondary

structurepredictionalgorithmonit. Butwhen provingcomplexityresults,wewant

todosoinamodelthat isassimpleaspossible. Iftheproblemin thesimplemodel

is

NP

-complete, it will remainsoin anymorecomplexand realisticmodel ifxing

(10)

Denition1 (The Nearest Neighbour Pseudoknot Model) Let

S

be a sec-

ondarystructureon asequence

s ∈ {A, C, G, U }

,with

|s| = n

,that is,

S

isa setof

basepairs

i · j

where

1 i < j n

and

∀i · j, i 0 · j 0 ∈ S : i = i 0 j = j 0

. Theenergy

of

S

isanindependent sum ofenergies ofeach ofthe basepairs in

S

,

E ( S ) = X

i·j∈S

E ( i · j, i + 1 , j 1) ,

where the energy ofabasepair

i · j

only depends on

the basepair itself,that is,the types ofbasesformingthe pair.

the twoneighbouring bases

i + 1

and

j 1

,thatis, thetypesof thesetwobases.

Furthermore, if

i + 1 · j 0 ∈ S

(or

i 0 · j 1 ∈ S

) theenergy candependon

j 0

(or

on

i 0

).

Note that the Nearest Neighbour Pseudoknot Model allowsarbitrarily complex

pseudoknotsasthereisnorestrictionthatbasepairsarenotallowedtooverlap. The

energyofabasepairintheNearestNeighbourPseudoknotModelisallowedtodepend

onnon-neighbouringbases,butonlythroughabasepairingwithaneighbouringbase.

IfwecomparethistotheTinocomodel,cf.[14],theTinocomodelallowstheenergy

of a base pair to depend, not only on the neighbouring bases and the base pairs

they mightparticipate in, but on all bases and base pairs in the loop it closes. If

weconsiderthemodelassumedbythemfoldserver,this ismorerestrictedthanthe

Tinocomodel. Stillitallowstheenergyofabasepairtodependonthetypeofloopit

closes,thesizeoftheloop,andcoaxialstackingofbasepairsinvolvingneighbouring

bases.TheNearestNeighbourPseudoknotModelcanbeseenasafurtherrestriction

ofthis where weonly allowthe energy ofa base pairto depend on stacking eects

withunpaired neighbouringbasesand basepairsinvolvingneighbouringbases. The

valueofthesestackingeectscanhoweverdependonwhethertheinvolvedbasepairs

formahelix, anordinaryloop(a bulgeormultibranched loop),orapseudoknot.

Thus, if we compare the Nearest Neighbour Pseudoknot Model to the energy

modelusedbyRivasandEddy,cf.[10],itshouldbeoflittlesurprisethattheNearest

NeighbourPseudoknotModelis arestrictionof themodel usedbyRivasand Eddy.

TheNearest NeighbourPseudoknot Model canbe obtainedfrom the energy model

used by Rivas and Eddy by xing someof the parameters. Thus an

NP

-hardness

resultforsecondarystructurepredictionintheNearestNeighbourPseudoknotModel

immediatelyimpliesthatsecondarystructurepredictionintheenergymodelusedby

RivasandEddyis

NP

-hard.

Proposition1 The problemofdeterminingwhethertheoptimal secondarystructure

inthe NearestNeighbourPseudoknot Modelhas energylower thansomeenergyvalue

E

is

NP

-complete.

As the problem trivially is in

NP

(guess the optimal secondary structure and

verifyinpolynomialtimethatithasanenergyvaluelowerthan

E

),allweneedtodo

istoprovethat itis

NP

-hard. Wewill dothisbyareductiontothespecialcaseof

3satwhereeachliteraloccursatmosttwotimes,cf.[8,proposition9.3]. Throughout

(11)

theproofofthepropositionwewillallowonlyWatson-Crickbasepairs,i.e.

A

pairing

with

U

and

C

pairingwith

G

. Thiswillbecomeexplicitinthenalspecicationofthe basepairenergyfunction,andisonlyatechnicallimitationtoreducethecomplexity

oftheproof. Beforeprovingproposition1weneedsomebuildingblocks.

Denition2 The

d

digit binary representation of

k

, where

0 k 2 d 1

, over

the alphabet

{A, U}

, is the string

b {A,U} ( k, d )

of length

d

that interpreted as a bi- narynumber with

A

representing

0

and

U

representing

1

has the value

k

. Similarly

b {C,G} ( k, d )

isthe

d

digit binary representation of

k

overthe alphabet

{C, G}

.

The

k

'th distinct

{A, U }

patternusing

d

digit binaryrepresentationsisthestring

A . . . A

| {z }

d+2

U b {A,U} ( k, d ) AUAb {A,U} ( k, d ) U A . . . A | {z }

d+2

.

The

k

'th distinct

{C, G}

pattern using

d

digit binaryrepresentationsisdenedsimi- larly.

Denition3 Forastring

s

the complementarystring

s ¯

isthestring constructedby reversing

s

andreplacing each

A

with a

U

,each

U

withan

A

,each

C

witha

G

,and

each

G

with a

C

.

Theneed forthese distinct patterns isto circumventthe fact that weonly have

fourlettersinthealphabetofRNAsequences. TheywillbeusedtoconstructanRNA

sequencecorrespondingto abooleanformulaonrestricted3satform,such thatthe

energyof an optimalsecondary structure ofthe constructed RNA sequenceimplies

whethertheformulaissatisable. TheconstructedRNAsequencewillconsistoftwo

parts,apartwheretheliteralsaregroupedaccordingtotheclausesandapartwhere

theliteralsaregroupedaccordingto thevariables.

Ifwehadanalphabetofarbitrarysizewecouldusetwosymbolstorepresenteach

occurrence of aliteral, onesymbolin theclauses partand the other symbolin the

literals part. A score of minus one could be assigned for each pairing of two such

symbolswithsomeextrapairsofsymbolsbeingusedtoformstructuresnullifyingthe

benetsofpairingmorethanonesymbolinaclause,orpairingasymbolrepresenting

avariableaswellaspairingasymbolrepresentingthisvariablesnegation.

Withoutanalphabetofarbitrarysizewewillinsteadusedistinct

{C, G}

patterns

andtheir complementary stringsin the clauses and variables parts, respectively, to

representtheliteralsoftheformula. Ahelixformedbetweena

{C, G}

patternandits

complementarystringwillindicate that thecorrespondingliteralis trueand wewill

chooseenergyparametersensuringthatsuchahelixusuallycontributesnegativelyto

thetotalenergy. Thedistinct

{A, U}

patternsandtheircomplementarystringswillbe usedtoformstructuresnullifyingbenetsofhavingmorethanonetrueliteralineach

clause,andofhavingbothaliteralrepresentingavariableandaliteralrepresentingits

negationbeingtrueatthesametime. Thisisensuredbychoosingenergyparameters

suchthathelicesformedbythedistinct

{A, U}

patternsalsocontributenegativelyto

thetotalenergy,exceptifthecasetheyshould nullify occurs. Inthatsituation they

contributezerotothetotalenergy. Theformalspecicationoftheenergyparameters

ispostponedtilltheendofthissection.

(12)

Denition4 Let

C = l 1 ∨l 2 l 3

beabooleandisjunctionofthreeliterals. The clause block

C

of

C

using

d

digit binaryrepresentationsisthe string

|{z}

S 1

|{z}

L 1

|{z} S ¯ 1

|{z}

S 2

|{z}

L 2

|{z}

S 1

|{z} S ¯ 2

|{z}

L 3

|{z}

S 2

,

where the

S i

's are distinct

{A, U }

patterns using

d

digit binary representations for twodierent

k

's, andthe

L i

'saredistinct

{C, G}

patternsusing

d

digitbinary repre-

sentationsfor threedierent

k

's.

The rationalebehind this constructionis that we can form twohelices between

distinct

{A, U }

patterns and their complementary strings within the clause block.

These twohelices will span dierent

L i

's, except for the casewhere the

S 1

and

S 2

anking

L 2

both form helices with their complementary string. In this case, the innermostbase pairofthe

S 1

helixandtheoutermostbasepairofthe

S 2

helix(and

viceversa)willbeneighbouringbasepairsformingpseudoknots.

Furthermore,the

L i

's spannedbysuchahelixwill bescreened. Byscreened,we

mean that at least one of the anking bases of the

L i

pattern cannot form a base

pairwith a base not spanned by the helix without forming a pseudoknotwith the

innermostbasepairofthehelix. The

L i

patternthuscannotformtheintendedhelix

with its complementary string in the variable block, that we will describe shortly,

withoutintroducinga pseudoknotofneighbouringbase pairs. Withoutintroducing

neighbouringpseudoknottedbasepairs,foraclauseblockwecanthusformhelicesof

twoofthedistinct patternsstraightaway,andathirdhelix ifwecanpaironeofthe

L i

patternswithitscomplementarystringinthevariablespart.

Denition5 Let

x

be a variable occurring in a boolean formula where each literal

occursatmost twice. The variable block

V

of

x

using

d

digit binaryrepresentations isthestring

|{z}

S 1

|{z} P ¯ 1

|{z} P ¯ 2

|{z} S ¯ 1

|{z} N ¯ 1

|{z} N ¯ 2

|{z}

S 1

,

where

S 1

isadistinct

{A, U }

pattern forsome

k

,the

P ¯ i

'sarecomplementarystrings tothe distinct

{C, G}

patterns usedfor the atmost twopositive occurrences of

x

(if

x

occurs positive only once, one of the

P ¯

patterns is omitted from

V

) and the

N ¯ i

's

are complementary strings to the distinct

{C, G}

patterns used for the atmost two

negative occurrences of

x

(if

x

occurs negative only once, one of the

N ¯

patterns is

omittedfrom

V

).

Therationalebehindthisconstructionisonceagaintouseahelixformedbyone

oftheoccurrencesof

S 1

and itscomplementarystringto screenthe complementary stringscorrespondingtoeitherthe(atmost)twopositiveoccurrencesof

x

orthe(at

most)twonegativeoccurrencesof

x

. Ifwearetoavoidintroducingneighbouringbase pairsformingapseudoknot,eithernoneofthedistinct

S 1

patternsformahelixwith

thecomplementarystring, thecomplementary stringscorresponding tothe positive

occurrencesof

x

donotform helices,orthecomplementarystringscorrespondingto thenegativeoccurrencesof

x

donotformhelices. Wearenowreadytoconstructthe

RNAsequencerepresentingabooleanformulaonrestricted3satform.

(13)

Denition6 Let

φ

be a boolean formula on conjunctive normal form where each clause has three literals and each literal occurs at most two times. Assume that

φ

consistsof

c

clauses anduses

v

variables. The RNA sequence corresponding to

φ

is

thesequence

s φ = C 1 C 2 . . . C c V 1 V 2 . . . V v ,

where

C i

is the clause block using

d log 2 (3 c + v ) e

digit binary representations corre- sponding to the

i

'th clause of

φ

,

V i

is the variable block using

d log 2 (3 c + v ) e

digit

binary representations corresponding to the

i

'th variable of

φ

, no distinct pattern is

used more than once, and the patterns corresponding to a literal and their comple-

mentarystrings occur inreverseorder.

Thechoice ofnumberofdigitsweusein thebinaryrepresentationsensuresthat

we canchoose at least

max { 3 c, 2 c + v}

dierent valuesfor distinct patterns. Each

clause block uses two distinct

{A, U}

patterns and three distinct

{C, G}

patterns,

whileeachvariableblockusesonedistinct

{A, U }

pattern. Thuswedonotrunoutof

patterns. Wewillusethetermcomplementarypattern forthedeliberateoccurrences

ofthecomplementarystringtoadistinctpattern, thatis, thestringsindicatedbya

barredpatternindenitions 4and5.

So far we have assumed that helices only form between a distinct pattern and

the complementary string designed to form a helix with it. Helices can of course

formbetweenpartsofdistinctpatternsnotdesignedtoformhelicestogether,butthe

followinglemmalimitsthelengthofsuchhelices.

Lemma1 Let

s φ

beanRNAsequenceconstructedfromabooleanformula

φ

accord-

ingtodenition6. In anystructure

S

of

s φ

,any helixofconsecutivelystacking pairs oflengthatleast

4 d + 5

,where

d

isthe numberofdigitsusedfor thebinaryrepresen-

tations,will have at least

2 d + 3

bases at the end of a distinct pattern forming base

pairs with the intendedbasesofthe complementarypattern tothisdistinct pattern.

Proof. Byconstruction any substring of

s φ

of lengthat least

4 d + 5

will containat

least

2 d + 3

bases from one of the ends of a distinct pattern or its complementary pattern. Consideroneofthetwosubstringsformingthehelix. Thiswillbeoflength

atleast

4 d + 5

and thus containat least

2 d + 3

bases from adistinct patternorits

complementary pattern. Assume withoutlossofgeneralitythat itcontainstherst

2 d + 3

basesfromthe

k

'thdistinct

{A, U }

patternusing

d

digitrepresentations,that is,itcontainsthesubstring

A d+2 U b {A,U} ( k, d )

. Byconstruction,theonlyoccurrences of

d + 2

consecutive

U

's preceded byan

A

in

s φ

are at the ends of complementary patterns to distinct

{A, U }

patterns, and thus

A d+2 U b {A,U} ( k, d )

forms base pairs

with

¯ b {A,U} ( k 0 , d ) AU d+2

forsome

k 0

(bytheassumptionthatonlyWatson-Crickbase pairsareallowed). As

b {A,U} ( k, d )

pairswith

¯ b {A,U} ( k 0 , d )

itfollowsthat

k = k 0

.

2

Wehavenowestablishedthatanyhelixofconsiderablelengthwillcontainatleast

partofadesignedpairing. Thenextlemmaestablishesthatthiswillbeallitcontains.

Lemma2 Let

s φ

beanRNAsequenceconstructedfromabooleanformula

φ

according

todenition6using

d

digitbinaryrepresentations. Inanystructure

S

of

s φ

,thereare

nohelicesofmorethan

4 d +9

consecutivelystackingbasepairscontainingonly

A

'sand

U

'sorcontainingonly

C

'sand

G

's. Theonlyhelices oflength

4 d + 9

containingonly

(14)

A

's and

U

's or containing only

C

's and

G

's are helices formed by distinct patterns

andtheircomplementary pattern.

Proof. Bylemma1weknowthatahelixoflength

4 d + 9

willcontainoneoftheends

ofadistinctpattern pairedwithitscomplementarypattern. All wehavetoshowis,

that we cannot extenda helixformed bya distinct pattern and its complementary

patternwithanextrastackingpairofbasesofthesametype.

If the distinct pattern is a

{C, G}

pattern this is straightforward, as it will be in a clause block and thus bordered by an

A

and a

U

, or by two

A

's. Similarly,

the complementary pattern of a distinct

{A, U }

pattern from a variable block will

be bordered by two

G

's. Finally, the complementary pattern to a distinct

{A, U }

patternfromaclauseblockwillbeborderedbyan

A

ononeside,cf.denition4. But

takingthe

S ¯ 1

complementarypattern asexample,this

A

should form anillegal (by

theWatson-Crickbase pairassumption)base pairwith either theleftmost

A

ofthe

precedingclauseblockor therightmost

C

inthe

L 2

patterntoextendthehelix.

2

Proof (ofproposition 1). Asmentionedabovethereductionwillbefrom3satwith

therestrictionthateachliteralappearsatmosttwice. Solet

φ

beavalidformulafor

this restrictionof 3sat with

c

clauses and

v

variables. Inpolynomial time, we can

construct

s φ

accordingto denition6,andthebase pairenergyfunction

E ( X i · Y j , V i+1 , W j−1 ) =

 

 

 

 

 

 

 

 

 

 

1

if

V i+1 · W j−1 ∈ S

andeither

X · Y, V · W ∈ {A · U, U · A}

or

X · Y, V · W ∈ {C · G, G · C}

4 d + 7

if

X · Y ∈ {A · U, U · A, C · G, G · C}

and for

j 0 6∈ {i + 1 , . . . j 1 }

we have

V i+1 · Z j 0 , W j−1 · Z j 0 , Z j 0 · V i+1 ,

Z j 0 · W j−1 6∈ S 4 d + 8

otherwise

where

d

isthenumberofdigitsusedforthebinaryrepresentationsin

s φ

and

S

isthe

structureforwhichtheenergyiscalculated. Thenotation

X i

isusedasashorthand

toindicatethat the

i

'thbaseisoftype

X

.

Weclaimthat theoptimalsecondarystructureof

s φ

withtheaboveenergyfunc-

tionhasenergy

(3 c + v )

ifandonlyif

φ

is satisable. Bytheenergyfunction,the

onlyhelices forwhich thebasepairs combinedyields anegativecontribution tothe

energyofthesecondarystructurearehelicesof atleast

4 d + 9

basepairs,base pairs

thatareeither all

A

'spairingwith

U

'sorall

C

'spairingwith

G

's. Bylemma2,the

only such helices that can be formed are between distinct patterns and their com-

plementarypatterns; these helices will consistof exactly

4 d + 9

base pairsandthus

contribute

1

tothetotalscoreofasecondarystructure,provided thattheinnermost

basepairofthehelixdoesnothaveaneighbouringbasepairthatformsapseudoknot.

Hence,ifadistinct patternis screenedbyahelix,it cannotform ahelixyielding a

negativecontributiontothetotalenergy.

If there is an assignment of truth values to the variables of

φ

satisfying

φ

, we

canconstruct a secondarystructure

S

on

s φ

with energy

(3 c + v )

based on this

assignmentbyformingthefollowingbase pairs.

(15)

Foreachvariableblockformingthehelixofthedistinct

{A, U }

patternandthe

complementary pattern screening the complementary patterns of the literals

thatbecomefalsebytheassignment.

For each clauseblockformingthehelicesbetweenthedistinct

{A, U }

patterns

that leave the distinct

{C, G}

pattern of a literal that becomes true by the

assignmentunscreened.

Forming thehelices betweentheunscreeneddistinct patternsof literals inthe

clausespartandtheircomplementarypatterns(thatare unscreenedastheas-

signmentsatises

φ

,andasthereverseorderrequirementindenition6ensures thetwocomplementary patterns correspondingto the sameliteral nothaving

neighbouringbasepairsforming apseudoknot)in thevariablespart.

Bythe discussion following denition 4, the distinct patterns of aclause block can

form at most three helices, each yielding a contribution of

1

, and each variable

blockintroducesonlyonenewdistinctpattern;hencetheenergyof

S

of

(3 c + v )

is

optimal.

Assume now that

s φ

has an optimal structure

S

of energy

(3 c + v )

. By the

aboveand the discussion following denition 4, we get that each clause block will

contain a distinct pattern corresponding to a literal forming a helix with its un-

screenedcomplementarypattern in the variablespart, andthat the complementary

patternscorrespondingeithertoavariableortoitsnegationwillbescreened. Wecan

thus inferatruthassignmentto thevariablesof

φ

satisfying

φ

from theunscreened

complementarypatternsofliteralsin

S

.

2

Theenergyfunctionspeciedintheproofofproposition1rewardsstackingsome

base pairs, penalises loops by penalising the rst base pair in a helix, and further

penalises neighbouring base pairs that form a pseudoknot. The only two remark-

ableoddities arethedisallowanceofbase pairingsbetween

G

and

U

, andpenalising

stackingan

A, U

basepairwitha

C, G

basepair.

Onecanobservethat wecould allow

G, U

base pairswithoutchanginganything

butinsertinga

C

betweenthetwocomplementarypatternscorrespondingtothesame literal. Asforpenalisingstacking

A, U

basepairswith

C, G

basepairs,thiswaschosen

toeaseestablishingthefactthatnoenergybenetsareobtainedbyextendingahelix

formedbyadistinctpatternanditscomplementarypatternbyfurtherstackingbase

pairs. Aproofwheretheenergyfunctionrewardsstackingofallcombinationsof

A, U

basepairs,

C, G

basepairs and

G, U

base pairscanbeachieved by amoreinvolved

constructionoftheclausespartof

s φ

. However,to limitthecomplexityoftheproof,

wehavechosentopresenttheaboveversion.

5 Discussion

The NP-completeness of the RNA secondary structure prediction problem in the

NearestNeighbour Pseudoknot Model tells us, that any algorithm allowing energy

functions suciently general to be specialised to the energy functions in the Near-

estNeighbourPseudoknotModel,andrunninginworstcasepolynomialtime,would

implyP

=

NP. The questionwhether ornot P is equalto NP is one of the fun-

damental open problems in computer science. Based on strong evidence, the large

(16)

majorityofcomputerscientistsbelievethat P

6 =

NP. The NP-completenessofthe RNAsecondary structureprediction problemin the NearestNeighbourPseudoknot

Modelthushintsthat thereisonlylittlehopeforaworstcasepolynomialtimealgo-

rithmforRNAsecondarystructure predictionintheNearestNeighbourPseudoknot

Model, or models extending it. Moreover, it hints that any algorithm for predict-

ingRNA secondarystructureswith generalpseudoknots mostlikelyhaveto exploit

specicpropertiesofaxedenergyfunction toobtainpolynomialrunningtime .

OneapproachtoobtainapolynomialtimealgorithmforRNAsecondarystructure

predictionwith pseudoknots is to limit the types of legal pseudoknots. This is the

approachtakenbyRivasandEddyin [10]andbyusinsection3. Anotherapproach

istakenbyTabaskaetal.in[13],whereinteractionsbetweenneighbouringbasepairs

areignored,thusreducingtheproblemofRNA secondarystructureprediction(with

pseudoknots) to computea maximal weightedmatching. If we are satisedto nd

notnecessarilythe structures ofleast free energy, then heuristics canbe applied to

searchforstructuresof lowenergy. Forexample,vanBatenburget al.in[17] report

onsuccessfulexperimentswithapplyinggeneticalgorithmstotheproblemofnding

lowenergyRNA secondarystructurescontainingpseudoknots.

References

[1] S. R. Eddy and R. Durbin. RNA sequence analysis using covariance models.

Nucleic Acids Research,22:20792088,1994.

[2] J. Gorodkin, L. J. Heyer, and G. D. Stormo. Finding common sequence and

structuremotifsinasetofRNAsequences.InProceedingsofthe5thInternational

ConferenceonIntelligentSystemsforMolecularBiology (ISMB),pages120123,

1997.

[3] B. Knudsen and J.Hein. RNA secondarystructure predictionusing stochastic

context-free grammars and evolutionary history. Bioinformatics, 15:446454,

1999.

[4] R.B.Lyngsø,M.Zuker,andC.N.S.Pedersen.Fastevaluationofinternalloops

in RNAsecondarystructure prediction. Bioinformatics, 15(6):440445,1999.

[5] J.S.McCaskill. Theequilibriumpartitionfunction andbasepairbindingprob-

abilitiesforRNAsecondarystructure. Biopolymers,29:11051119,1990.

[6] R. Nussinov and A. B. Jacobson. Fast algorithm for predicting the secondary

structure ofsingle-stranded RNA. Proceedings of the NationalAcademy of Sci-

enceofthe USA,77(11):63096313,1980.

[7] R.Nussinov,G.Pieczenik,J.R.Griggs,andD.J.Kleitman.Algorithmsforloop

matchings. SIAMJournalonAppliedMathematics,35:6882,1978.

[8] C. H. Papadimitriou. Computational Complexity. Addison-Wesley Publishing

Company,1994.

[9] C.W.A.Pleij. RNApseudoknots.InR.F.GestelandandJ.F.Atkins,editors,

The RNAWorld. ColdSpring HarborLaboratoryPress,1993.

(17)

predictionincludingpseudoknots. Journalof MolecularBiology,285:20532068,

1999.

[11] Y.Sakakibara,M.Brown,R.Hughey,I.S.Mian,K.Sjölander,R.C.Underwood,

andD.Haussler. Stochasticcontext-freegrammarsfortRNAmodeling. Nucleic

AcidsResearch,22:51125120,1994.

[12] D. Sanko. Simultaneous solutionof theRNA folding,alignmentand protose-

quence problems. SIAMJournal onApplied Mathematics,45:810825,1985.

[13] J. E. Tabaska,R. B.Cary, H. N. Gabow, andG. D. Stormo. An RNA folding

method capable of identifying pseudoknots and base triples. Bioinformatics,

14(8):691699,1998.

[14] I. Tinoco, P. N. Borer, B. Dengler, M. D. Levine, O. C. Uhlenbeck, D. M.

Crothers,and J.Gralla. Improvedestimation ofsecondarystructurein ribonu-

cleicacids. NatureNewBiology,246:4041,1973.

[15] I.Tinoco,O.C.Uhlenbeck,andM.D.Levine.Estimationofsecondarystructure

in ribonucleicacids. Nature,230:362367,1971.

[16] D.H.Turner,N.Sugimoto,andS.M.Freier. RNAstructureprediction.Annual

Review ofBiophysics andBiophysical Chemistry,17:167192,1988.

[17] F. H. D. van Batenburg, A. P. Gultyaev, and C. W. A. Pleij. An APL-

programmedgenetic algorithm for the predictionof RNA secondarystructure.

Journalof Theoretical Biology,174(3):269280,1995.

[18] K.ZhangandD.Shasha.Simplefastalgorithmsfortheeditingdistancebetween

treesandrelatedproblems.SIAMJournalonComputing,18(6):12451262,1989.

[19] M.ZukerandP.Stiegler.OptimalcomputerfoldingoflargeRNAsequencesusing

thermodynamics andauxiliaryinformation. Nucleic Acids Research,9:133148,

1981.

(18)

Recent BRICS Report Series Publications

RS-00-1 Rune B. Lyngsø and Christian N. S. Pedersen. Pseudoknots in RNA Secondary Structures. January 2000. 15 pp. To appear in Fourth Annual International Conference on Computational Molecular Biology, RECOMB ’00 Proceedings, 2000.

RS-99-57 Peter D. Mosses. A Modular SOS for ML Concurrency Primi- tives. December 1999. 22 pp.

RS-99-56 Peter D. Mosses. A Modular SOS for Action Notation. Decem- ber 1999. 39 pp. Full version of paper appearing in Mosses and Watt, editors, Second International Workshop on Action Semantics, AS ’99 Proceedings, BRICS Notes Series NS-99-3, 1999, pages 131–142.

RS-99-55 Peter D. Mosses. Logical Specification of Operational Se- mantics. December 1999. 18 pp. Invited paper. Appears in Flum, Rodr´ıguez-Artalejo and Mario, editors, European Asso- ciation for Computer Science Logic: 13th International Work- shop, CSL ’99 Proceedings, LNCS 1683, 1999, pages 32–49.

RS-99-54 Peter D. Mosses. Foundations of Modular SOS. December 1999.

17 pp. Full version of paper appearing in Kutyłowski, Pachol- ski and Wierzbicki, editors, Mathematical Foundations of Com- puter Science: 24th International Symposium, MFCS ’99 Pro- ceedings, LNCS 1672, 1999, pages 70–80.

RS-99-53 Torsten K. Iversen, K˚are J. Kristoffersen, Kim G. Larsen, Morten Laursen, Rune G. Madsen, Steffen K. Mortensen, Paul Pettersson, and Chris B. Thomasen. Model-Checking Real- Time Control Programs — Verifying LEGO Mindstorms Systems Using UPPAAL. December 1999. 9 pp.

RS-99-52 Jesper G. Henriksen, Madhavan Mukund, K. Narayan Kumar, and P. S. Thiagarajan. Towards a Theory of Regular MSC Lan- guages. December 1999.

RS-99-51 Olivier Danvy. Formalizing Implementation Strategies for First- Class Continuations. December 1999. Extended version of an article to appear in Programming Languages and Systems:

Ninth European Symposium on Programming, ESOP ’00 Pro-

ceedings, LNCS, 2000.

Referencer

RELATEREDE DOKUMENTER

of the expressive completeness of this property language with respect to tests. More precisely, we study whether all properties that are testable can

With this relaxation we have been able to define complexity in this model using restricted classes of real numbers and functions.. Interval arithmetic [7], can be used as the

We have presented a wavelet based 3D compression scheme for very large volume data supporting fast random access to individual voxels within the volume. Experiments on the CT data

The diagnostics are used to determine the current state of the herd with more precision, forming the basis for a decision of a control strategy for the disease. After a decision to

Likewise, the existence of the Archives in Denmark inhibited the establishment of an historical society or centralized archives in North America since those who supported the

Most specific to our sample, in 2006, there were about 40% of long-term individuals who after the termination of the subsidised contract in small firms were employed on

This gives a cleaner semantics to the restriction phenomenon, and further- more avoids the state-space explosions mentioned above. According to [28], we can guarantee that the

In the other form of Nash-equilibrium the policy variables are different in the two regions and benefits are highest in the neighbouring region, which can be interpreted as