• Ingen resultater fundet

Time and Space Analysis

5.3 Wildcard Tree Index

5.3.1 Time and Space Analysis

For each of thejwildcards inp, the size ofAis increased by a factor of at mostβ. Hence the search forpbranches to a total of at mostPj

i=0βi =O(βj)locations, each of which requires O(m) time, resulting in query time O(βjm+occ). For β = 1 the query time becomesO(m+j+occ), since nextheavylocations(`) =∅for all`.

We will now show the space required to storeTβk(C0), resulting in Lemma 13.

Lemma 13 For any integer 1 ≤ β < σ, the wildcard tree Tβk(C0) has query time O(βjm+occ)for1< β < σandO(m+j+occ)forβ = 1. The wildcard tree stores

38 A TIME-SPACE TRADE-OFF FORK-BOUNDED WILDCARD INDEXES

O(|C0|hk) strings, whereh is an upper bound on the light height of all compressed triesT(S)satisfyingS ⊆suffd(C0)for some integerd.

In the paper by Cole et al. [13] the authors give a proof by induction of the space needed to store a wildcard tree forβ = 2. See Appendix B for an explanation of their proof and a discussion of a potential problem in it. We give a more general proof of the space usage that works for any value ofβ.

Proof We prove that the total number of strings (leaves) inTβi(S), denoted|Tβi(S)|, is at most|S|Pi

j=0hj = O(|S|hi). The proof is by induction on i. The base case i= 0holds, sinceTβ0(S) =T(S)contains|S|=|S|P0

j=0hj strings. For the induction step, assume that|Tβi(S)| ≤ |S|Pi

j=0hj. LetSv = suff2(lightstrings(v))for a vertex v∈T(S). From the construction we have that the number of strings inTβi+1(S)is the number of strings inT(S)plus the number of strings in the wildcard trees joined to the vertices ofT(S). That is,

The string setsSv consist of suffixes of strings inS. Consider a stringx∈S, i.e., a leaf inT(S). The number of times a suffix ofxappears in a setSv is equal to the light depth ofxinT(S). From the construction,Sis a subset ofsuffd(C0)for somed, and hencehis an upper bound on the maximum light depth ofT(S). This establishes that

X

Constructing the wildcard treeTβk(C), forC = suff(t), we obtain a wildcard index with the following properties.

Lemma 14 Lettbe a string of lengthnfrom an alphabet of sizeσ. For2 ≤β < σ there is ak-bounded wildcard index for t usingO nlogkβn

space. The index can report the occurrences of a pattern withmcharacters andj ≤k wildcards in time O βjm+occ

.

Proof The query time follows from Lemma 13. Since Tβk(C) is a compressed trie, and because each edge label is a substring oft, the space needed to storeTβk(C)is upper bounded by the number of strings it contains which by Lemma 13 isO(nhk).

Sinceα=β−1, it follows from Lemma 12 thath= logβnis an upper bound on the light height of all compressed triesT(S) satisfyingS ⊆ suffd(C) for somed, since they contain at mostnvertices. Consequently, the space needed to store the index is O(nlogkβn).

39 5.4 Wildcard Tree Index Using the LCP Data Structure

In this section we prove Theorem 2 by showing how LCP queries can be used to speed up the query time of the wildcard tree index described in the previous section.

The wildcard index of Lemma 14 reduces the branching factor of the suffix tree search fromσtoβ, but still has the drawback that the search for a subpatternpifrom an active location ` ∈ Tβk(C) takesO(|pi|) time. This can be addressed by combining the index with the LCP data structure as in the paper by Cole et al. [13]. In that way, the search for a subpattern can be done in timeO(log logn). The index is obtained by modifying the construction ofTβi(S)such that eachT(S)is added to the LCP data structure prior to joining the (β, i−1)-wildcard trees to the vertices of T(S). For all T(S) except the finalT(S) =Tβ0(S), support for unrooted LCP queries in timeO(log logn)is added using additionalO(|S|log|S|)space. For the finalT(S)we only need support for rooted queries since at this point all wildcards inphave been consumed. Upon receiving the query pattern p=p1∗p2∗. . .∗pj, eachpi is preprocessed in timeO(|pi|)to support LCP queries for any suffix ofpi.

The search for p in the modified index proceeds as described in Algorithm 3. The algorithm is similar to that for the normal wildcard tree index, except now LCP queries are used to search forp0, p1, . . . , pj. After performing a LCP query forpifrom`, the search only continues from the new location`0 if the full subpatternpiwas matched. In that case, the search branches to the heavy children and the root of the wildcard tree joined to`0, since the next character in the search is a wildcard. If the last subpattern was matched, the resulting location is an occurrence ofpint, and all strings inChaving that location as an ancestor are reported.

Algorithm 3Searching a wildcard tree to find the occurrences a patternp=p0∗p1∗. . .∗pj

using the LCP data structure.

A ← {(0,root(Tβk(C0)))}

R ← ∅

whileA 6=∅do (i, `)←extract(A)

`0LCP(pi,trie(`), `) if|`0| − |`|=|pi|then

ifi=jthen R ← R ∪ {`0} else

A ← A ∪ {(i+ 1, `00)|`00∈nextheavylocations(`)∨`00=nextlocation(`0,∗)}

Report descendant leaves of locations inR

5.4.1 Time and Space Analysis

In the search for p a total of at most Pj

i=0βi = O(βj) LCP queries each taking time O(log logn)are performed. Preprocessingp0, p1, . . . , pj takesPj

i=0|pi|=mtime, so the

40 A TIME-SPACE TRADE-OFF FORK-BOUNDED WILDCARD INDEXES

query time isO(m+βjlog logn+occ). The space needed to store the index isO(nlogkβn) for Tβk(C) plus the space needed to store the LCP data structure. Adding support for rooted LCP queries requires linear space in the total size of the compressed tries, which isO(nlogkβn). LetT(S0), T(S1), . . . , T(Sq)denote the compressed tries with support for unrooted LCP queries. Since eachSi contains at mostnstrings andPq

i=0|Si|=|Tβk−1(C)|, by Lemma 2, the additional space required to support unrooted LCP queries is

O

q

X

i=0

|Si|log|Si|

!

=O logn

q

X

i=0

|Si|

!

=O

logn|Tβk−1(C0)|

=O nlog(n) logk−1β n ,

which is an upper bound on the total space required to store the wildcard index. This concludes the proof of Theorem 2.

Thek-bounded wildcard index described by Cole et al. [13] is obtained as a special case of Theorem 2.

Corollary 3 (Cole et al.) Let t be a string of length n from an alphabet of size σ.

There is a k-bounded wildcard index for tusing O(nlogkn) space. The index can report the occurrences of a pattern withmcharacters andj ≤k wildcards in time O(m+ 2jlog logn+occ).

6

A k-B OUNDED W ILDCARD I NDEX

WITH O PTIMAL Q UERY T IME

In this chapter we show Theorem 3. In Section 6.1 we first obtain ak-bounded wildcard index similar to Theorem 2 by using Lemma 13. We use this index to construct a black-box reduction for obtaining optimal time indexes. In Section 6.2 the black-box reduction is used on the wildcard index from Theorem 1 to obtain Theorem 3.

6.1 A Black-Box Reduction

Consider thek-bounded wildcard index which ensures a linear query time, obtained by creating the wildcard treeT1k(suff(t))fort. We can show that the space usage for the index depends on the height of the suffix tree fort.

Lemma 15 Let t be a string of lengthn from an alphabet of sizeσ. There is a k-bounded wildcard index fortusingO(nhk)space, wherehis the height of the suffix tree fort. The index can report the occurrences of a pattern withmcharacters andj wildcards in timeO(m+j+occ).

Proof Sincesuff(t)is closed under the suffix operation, the height ofT(suff(t))is an upper bound on the height of all compressed triesT(S)satisfyingS ⊆suffd(suff(t)) for some d. For β = 1, the light height of T(S) is equal to the height of T(S), so h= height(T(suff(t)))can be used as an upper bound of the light height in Lemma 13, and consequently the space needed to storeT1k(suff(t))isO(nhk).

In the worst case the height of the suffix tree is close ton, but combining the index with another wildcard index yields a useful black-box reduction. The idea is to query the first index if the pattern is short, and the second index if the pattern is long.

Lemma 16 LetF ≥mand letGbe independent ofmandj. Given a wildcard index W with query time O(F +G+occ) and space usageO(|W|), there is ak-bounded wildcard indexW0 with query timeO(F+j+occ)and taking spaceO(nmin(G, h)k+

|W|), wherehis the height of the suffix tree fort.

42 AK-BOUNDED WILDCARD INDEX WITH OPTIMAL QUERY TIME

Proof The wildcard indexW0consists ofW and a specialk-bounded wildcard index W00 = T1k(prefG(suff(t))), which is a wildcard tree with β = 1 over the set of all substrings oftof lengthG. Since the maximum string depth inW00isG, this value can be used as an upper bound for the light height in Lemma 13. Consequently, the space required to storeW00isO(nmin(G, h)k)by using Lemma 15 ifG > h, wherehis the height of the suffix tree fort. See Figure 6.1 for an illustration of the construction of W00 =T1k(prefG(suff(t))).

A query onW0results in a query on eitherW orW00. In caseF+j > G, we query W and the query time will beO(F +G+occ) =O(F+j+occ). In caseF +j≤G, we queryW00 with query timeO(m+j+occ) = O(F +j+occ). In any case, the query time ofW0 isO(F+j+occ).

k G

T1k(prefG(C))

T(C)

Figure 6.1:Illustrating how the wildcard treeT1k(prefG(C))forC= suff(t)is obtained by slicing off the top of the compressed trieT(C)with string depth Gand appending k additional layers of wildcard trees recursively. For reference, Figure 3.3(c) shows a concrete example of constructing T(prefG(C))forG= 3.

6.2 Obtaining the Index

Applying Lemma 16 with F = m and G = σklog logn on the unbounded wildcard index from Theorem 1 yields a newk-bounded wildcard index with optimal query time O(F +j+occ) =O(m+j+occ). The index uses space

O(nGk+n) =O(n(σklog logn)k) =O(σk2nlogklogn). This concludes the proof of Theorem 3.

7

V ARIABLE L ENGTH G APS

In this chapter we consider thestring indexing for patterns with variable length gaps problem.

We show that this problem can be solved, using the bounded and unbounded wildcard indexes described in the previous chapters, by only changing the search algorithms.

Section 7.1 introduces the problem, Section 7.2 describes previous work and Section 7.3 gives an overview of our solutions. In Section 7.4 we account for the changes necessary for supporting variable length gaps and analyse the modified search algorithm.

7.1 Introduction

The string indexing for patterns with variable length gaps problem is to build an index for a stringtthat can efficiently report the occurrences of a query patternpof the form

p=p0∗{a1, b1}p1 ∗{a2, b2} . . . ∗{aj, bj}pj .

The query pattern consists of j+ 1 stringsp0, p1, . . . , pj ∈ Σ interleaved by j variable length gaps∗{ai, bi},i= 1, . . . , j, whereai andbi are positive integers such thatai ≤bi. Intuitively, a variable length gap∗{ai, bi}matches an arbitrary string over Σof length betweenai andbi, both inclusive.

Example Consider the stringtand patternpover the alphabetΣ ={a,b,c,d}.

t=acbccbacccddabdaabcdccbccdaa p=b∗{0,4}cc∗{3,5}d

The stringtcontains five occurrences of the query patternpas shown in Figure 7.1.

As shown by the example, different occurrences of the query patternpcan start or end at the same position int, and the same substring intcan contain multiple occurrences ofp.

Hence to completely characterize an occurrence ofpint, we need to report the positions of the individual subpatternsp0, p1, . . . , pj for each full occurrence of the pattern. However, in the following we will restrict our attention to reporting the start and end position of

44 VARIABLE LENGTH GAPS

1

a c2 b3 c4 c5 b6 a7 c8 c9 10c 11d 12d 13a 14b 15d 16a 17a 18b 19c 20d 21c 22c 23b 24c 25c 26d 27a 28a b c c ∗ ∗ ∗ ∗ ∗ d

b ∗ ∗ ∗ ∗ c c ∗ ∗ ∗ ∗ ∗ d b ∗ c c ∗ ∗ ∗ ∗ ∗ d b ∗ ∗ c c ∗ ∗ ∗ ∗ d

b ∗ ∗ c c ∗ ∗ ∗ d

Figure 7.1:The five occurrences ofp=b∗{0,4}cc∗{3,5}dint=acbccbacccddabdaabcdccbccdaa.

each occurrence ofpint. For the above example, we would thus report the pairs(3,11), (3,15),(6,15)and(18,26).

String indexing for patterns with variable length gaps has applications in information retrieval, data mining and computational biology [18, 19, 31, 30, 33]. In particular, the PROSITE data base [23, 7] uses patterns with variable length gaps to identify and classify protein sequences. The problem is a generalization of string indexing for patterns with wildcards, since a wildcard ∗ is equivalent to the variable length gap∗{1,1}. Variable length gaps are also known asbounded wildcards, as a variable length gap∗{ai, bi}can be regarded as a bounded sequence of wildcards.

String indexing for patterns with variable length gaps is equivalent to string indexing for patterns with wildcards, with the addition of allowingoptional wildcardsin the pattern.

An optional wildcard matches any character fromΣor the empty string, i.e., an optional wildcard is equivalent to the variable length gap∗{0,1}. Conversely, we may also consider a variable length gap∗{ai, bi}asaiconsecutive wildcards followed bybi−aiconsecutive optional wildcards.

In the following we letA =Pj

i=1ai andB =Pj

i=1bi denote the sum of the lower and upper bounds on the variable length gaps in p, respectively. HenceA and B −A denote the number of normal and optional wildcards inp, respectively. A wildcard index with support for optional wildcards is called anoptional wildcard index. As for wildcard indexes, we distinguish between bounded and unbounded optional wildcard indexes.

A (k, o)-bounded optional wildcard index supports patterns containing A ≤ k normal wildcards and B −A ≤ o optional wildcards. An unbounded optional wildcard index supports patterns with no restriction on the number of normal and optional wildcards..

7.2 Previous Work

Lam et al. [26] introduced optional wildcards in the pattern and presented a variant of their solution for the string indexing for patterns with wildcards problem. The idea is to determine potential matches and verify complete matches using interval stabbing on the possible positions for the subpatterns. This leads to an unbounded optional wildcard index with query timeO(m+Bjmin0≤i≤jocc(pi, t))and space usageO(n). As beforeocc(pi, t) isΘ(n)in the worst case, so the query time isΘ(Bjn)in the worst case.

45 Besides the result by Lam et al., we have no knowledge of solutions for the string indexing for patterns with variable length gaps problem. Some results are known for thestring matching with variable length gaps problem, where the text string may not be preprocessed in advance. Bille et al. [6] recently presented the best known solution for the matching problem with query timeO((n+m) logj+occ)using spaceO(m+A). We refer to their paper for a review of the previous work on the string matching with variable length gaps problem.

7.3 Our Results

By modifying the search algorithm, we show that our solutions to string indexing for patterns with wildcards also support variable length gaps in the pattern, leading to the following new theorems.

Theorem 4 Let t be a string of length n from an alphabet of size σ. There is an unbounded optional wildcard index fortusingO(n)space. The index can report the occurrences of a pattern withmcharacters,Awildcards andB−Aoptional wildcards in timeO(m+ 2B−AσBlog logn+occ), whereA=Pj

i=1ai andB =Pj i=1bi.

Theorem 5 Lettbe a string of lengthnfrom an alphabet of sizeσ. For 2≤β < σ, there is a(k, o)-bounded optional wildcard index fortusingO nlog(n) logk+o−1β n space. The index can report the occurrences of a pattern withm characters,A≤k wildcards andB−A≤ooptional wildcards in timeO m+ 2B−AβBlog logn+occ

, whereA=Pj

i=1aiandB=Pj i=1bi.

Theorem 6 Let t be a string of length n from an alphabet of size σ. There is a (k, o)-bounded optional wildcard index fortusingO(σ(k+o)2nlogk+ologn)space. The index can report the occurrences of a pattern withmcharacters,A≤kwildcards and B−A ≤o optional wildcards in timeO(2B−A(m+B) +occ), whereA = Pj

i=1ai

andB =Pj i=1bi.

These results completely generalize our previous solutions, since if the query pattern only contains variable length gaps of the form∗{1,1}, the problem reduces to string indexing for patterns with wildcards. In that caseA =B =j and we obtain exactly Theorem 1, Theorem 2 and Theorem 3.

Compared to the only known index for the problem by Lam et al. [26], Theorem 4 gives an index that matches theO(n)space usage, but improves the worst-case query time fromΘ(Bjn)toO m+ 2B−AσBlog logn+occ

, provided thatB ≤logσ√ nj.

7.4 Supporting Variable Length Gaps

As remarked a variable length gap∗{ai, bi}is equivalent toaiwildcards followed bybi−ai

optional wildcards. Hence to support variable length gaps, we only have to describe how the search algorithms must be modified to match an optional wildcard inp. Essentially the

46 VARIABLE LENGTH GAPS

difference between matching a normal wildcard and an optional one is how the search handles the location where the wildcard is matched. For a normal wildcard the location is removed from the set of active locations, but for an optional wildcard the location is kept active. A sequence ofjoptional wildcards is simulated as a sequence ofj wildcards, where all intermediate locations reached are kept active.

As an example, consider the unbounded wildcard index obtained by adding support for unrooted LCP queries on the suffix treeT(suff(t))for the indexed stringt. Algorithm 4 illustrates how to searchT(suff(t))for a patternpcontaining variable length gaps. The algorithm is similar to Algorithm 3, except when a variable length gap ∗{ai+1, bi+1} is matched in a location`0. In this case, the for-loop in the algorithm simulates the variable length gap asai+1 normal wildcards followed bybi+1−ai+1 optional wildcards. At the start of each iteration of the for-loopf wildcards have been simulated. For each wildcard simulated,B0holds the resulting set of locations inT(suff(t))reached, whileBcontains the active locations at the beginning of each iteration forf. When a new iteration forf starts,Bcontains the active locations afterf wildcards were simulated. Ifai+1≤f ≤bi+1, the currently considered locationois kept active, since the next wildcard to simulate is optional.

Extending the algorithm to support searching in wildcard trees is just a matter of changing the way the search branches. Instead of branching on all outgoing edges, it must instead branch to the heavy children and follow the edge labeled∗for each wildcard simulated. We refer to Algorithm 3 for how the branching should be performed.

7.4.1 Reporting Occurrences

To report the substrings intwhere the query pattern occurs, we assume that each leaf`in T(suff(t))has been labeled by the start position,pos(`), of the suffix intit corresponds to. When Algorithm 4 terminates, each location `0 ∈ R corresponds to one or more substrings int, where the query patternpoccurs. To report the start and end position of these substrings, we traverse the subtree rooted at`0and identify the leaves`0, `1, . . . , `r corresponding to suffixes ofthaving`0 as a prefix. The start and end positions of these substrings are then given by

pos(`0),pos(`0) +|`0|

, pos(`1),pos(`1) +|`0|

, . . . , pos(`r),pos(`r) +|`0| .

7.4.2 Analysis of the Modified Search

To analyse the running time of Algorithm 4, we will bound the maximum number of LCP queries performed during the search for the query pattern

p=p0∗{a1, b1}p1 ∗{a2, b2} . . . ∗{aj, bj}pj . We defineAi =Pi

l=1ai andBi=Pi

l=1bi. The number of normal and optional wildcards preceding the subpatternpiinpisAiandBi−Ai, respectively. To bound the number of locations in which a LCP query for the subpatternpi can start, we choose and promote l= 0,1, . . . , Bi−Aiof the preceding optional wildcards to normal wildcards and discard

47 Algorithm 4Searching the unbounded wildcard indexT(suff(t))to find the occurrences a patternp=p0∗{a1, b1}p1∗{a2, b2} . . . ∗{aj, bj}pj using the LCP data structure.

For each location inRreport the corresponding substrings int

the rest. For a specific choice there are exactlyAi+lwildcards precedingpi, and thus the number of locations in which a LCP query forpican start is at mostβAi+l. The termβis an upper bound on the branching factor of the search when consuming a wildcard. For the indexT(suff(t))searched by Algorithm 4, the branching factor isβ =σ, but indexes based on wildcard trees can have a smaller branching factor. There are Bi−Al i

possibilities for choosing theloptional wildcards, so the number of locations in which a LCP query forpi can start is at most Summing over thej+ 1subpatterns, we obtain the following bound for the total number of LCP queries performed during a search for the query patternp.

j

X

i=0

2Bi−AiβBi = O 2B−AβB .

Since the search algorithm performs LCP queries in timeO(log logn)and has to preprocess the pattern in time O(m), the total query time is O(m+ 2B−AβBlog logn+occ). This concludes the proof of Theorem 4 and Theorem 5.

48 VARIABLE LENGTH GAPS

To show Theorem 6, we apply a black-box reduction very similar to Lemma 16, leading to a (k, o)-bounded optional wildcard index, where k and o are parameters chosen in advance. This index consists of the following two optional wildcard indexes. A query is performed on one of these indexes depending on the lengthm+Bof the received query patternp.

1. The unbounded optional wildcard index given by Theorem 4. This index has query timeO(m+ 2B−AσBlog logn+occ)and uses spaceO(n).

2. The(k, o)-bounded optional wildcard index obtained by constructing the wildcard treeT1k+o(prefG(suff(t)))without the LCP data structure, whereG=σk+olog logn.

From Equation 7.1 withβ = 1, the search for the subpatternpican start from at most 2Bi−Ai locations. Searching forpifrom each of these locations takes timeO(|pi|+bi), since the LCP data structure is not used and the tree must be traversed one character at a time. Summing over thej+ 1subpatterns, we obtain the following query time for the index

O

j

X

i=0

2Bi−Ai(|pi|+bi) +occ

!

= O 2B−A(m+B) +occ .

The index is a wildcard tree and by the same argument as for Theorem 3, it can be stored using spaceO(nGk+o).

In case the received query patternphas lengthm+B > Gwe query the first index. It follows that2B−AσBlog logn≤2B−AG <2B−A(m+B), so the query time isO(2B−A(m+ B) +occ). Ifphas lengthm+B ≤Gall occurrences ofpintcan be found by querying the second index in timeO(2B−A(m+B) +occ). The space of the index is

O(n+nGk+o) =O(n(σk+olog logn)k+o) =O(nσ(k+o)2logk+ologn). This concludes the proof of Theorem 6.

8

C ONCLUSION

In this thesis we considered the problem of indexing a string to report the occurrences of a query pattern containing wildcards. The ideal index has size linear in the length of the

In this thesis we considered the problem of indexing a string to report the occurrences of a query pattern containing wildcards. The ideal index has size linear in the length of the