Time and Space Analysis

5.3 Wildcard Tree Index

5.3.1 Time and Space Analysis

For each of thejwildcards inp, the size ofAis increased by a factor of at mostβ. Hence the search forpbranches to a total of at mostPj

i=0βⁱ =O(β^j)locations, each of which requires O(m) time, resulting in query time O(β^jm+occ). For β = 1 the query time becomesO(m+j+occ), since nextheavylocations(`) =∅for all`.

We will now show the space required to storeT_β^k(C⁰), resulting in Lemma 13.

Lemma 13 For any integer 1 ≤ β < σ, the wildcard tree T_β^k(C⁰) has query time O(β^jm+occ)for1< β < σandO(m+j+occ)forβ = 1. The wildcard tree stores

38 A TIME-SPACE TRADE-OFF FORK-BOUNDED WILDCARD INDEXES

O(|C⁰|h^k) strings, whereh is an upper bound on the light height of all compressed triesT(S)satisfyingS ⊆suff_d(C⁰)for some integerd.

In the paper by Cole et al. [13] the authors give a proof by induction of the space needed to store a wildcard tree forβ = 2. See Appendix B for an explanation of their proof and a discussion of a potential problem in it. We give a more general proof of the space usage that works for any value ofβ.

Proof We prove that the total number of strings (leaves) inT_βⁱ(S), denoted|T_βⁱ(S)|, is at most|S|P_i

j=0h^j = O(|S|hⁱ). The proof is by induction on i. The base case i= 0holds, sinceT_β⁰(S) =T(S)contains|S|=|S|P0

j=0h^j strings. For the induction step, assume that|T_βⁱ(S)| ≤ |S|Pi

j=0h^j. LetS_v = suff₂(lightstrings(v))for a vertex v∈T(S). From the construction we have that the number of strings inT_βⁱ⁺¹(S)is the number of strings inT(S)plus the number of strings in the wildcard trees joined to the vertices ofT(S). That is,

The string setsS_v consist of suffixes of strings inS. Consider a stringx∈S, i.e., a leaf inT(S). The number of times a suffix ofxappears in a setSv is equal to the light depth ofxinT(S). From the construction,Sis a subset ofsuff_d(C⁰)for somed, and hencehis an upper bound on the maximum light depth ofT(S). This establishes that

Constructing the wildcard treeT_β^k(C), forC = suff(t), we obtain a wildcard index with the following properties.

Lemma 14 Lettbe a string of lengthnfrom an alphabet of sizeσ. For2 ≤β < σ there is ak-bounded wildcard index for t usingO nlog^k_βn

space. The index can report the occurrences of a pattern withmcharacters andj ≤k wildcards in time O β^jm+occ

Proof The query time follows from Lemma 13. Since T_β^k(C) is a compressed trie, and because each edge label is a substring oft, the space needed to storeT_β^k(C)is upper bounded by the number of strings it contains which by Lemma 13 isO(nh^k).

Sinceα=β−1, it follows from Lemma 12 thath= log_βnis an upper bound on the light height of all compressed triesT(S) satisfyingS ⊆ suff_d(C) for somed, since they contain at mostnvertices. Consequently, the space needed to store the index is O(nlog^k_βn).

39 5.4 Wildcard Tree Index Using the LCP Data Structure

In this section we prove Theorem 2 by showing how LCP queries can be used to speed up the query time of the wildcard tree index described in the previous section.

The wildcard index of Lemma 14 reduces the branching factor of the suffix tree search fromσtoβ, but still has the drawback that the search for a subpatternp_ifrom an active location ` ∈ T_β^k(C) takesO(|p_i|) time. This can be addressed by combining the index with the LCP data structure as in the paper by Cole et al. [13]. In that way, the search for a subpattern can be done in timeO(log logn). The index is obtained by modifying the construction ofT_βⁱ(S)such that eachT(S)is added to the LCP data structure prior to joining the (β, i−1)-wildcard trees to the vertices of T(S). For all T(S) except the finalT(S) =T_β⁰(S), support for unrooted LCP queries in timeO(log logn)is added using additionalO(|S|log|S|)space. For the finalT(S)we only need support for rooted queries since at this point all wildcards inphave been consumed. Upon receiving the query pattern p=p1∗p2∗. . .∗pj, eachpi is preprocessed in timeO(|p_i|)to support LCP queries for any suffix ofp_i.

The search for p in the modified index proceeds as described in Algorithm 3. The algorithm is similar to that for the normal wildcard tree index, except now LCP queries are used to search forp₀, p₁, . . . , p_j. After performing a LCP query forp_ifrom`, the search only continues from the new location`⁰ if the full subpatternpiwas matched. In that case, the search branches to the heavy children and the root of the wildcard tree joined to`⁰, since the next character in the search is a wildcard. If the last subpattern was matched, the resulting location is an occurrence ofpint, and all strings inChaving that location as an ancestor are reported.

Algorithm 3Searching a wildcard tree to find the occurrences a patternp=p0∗p1∗. . .∗pj

using the LCP data structure.

A ← {(0,root(T_β^k(C⁰)))}

R ← ∅

whileA 6=∅do (i, `)←extract(A)

`⁰ ← LCP(pi,trie(`), `) if|`⁰| − |`|=|p_i|then

ifi=jthen R ← R ∪ {`⁰} else

A ← A ∪ {(i+ 1, `⁰⁰)|`⁰⁰∈nextheavylocations(`)∨`⁰⁰=nextlocation(`⁰,∗)}

Report descendant leaves of locations inR

5.4.1 Time and Space Analysis

In the search for p a total of at most Pj

i=0βⁱ = O(β^j) LCP queries each taking time O(log logn)are performed. Preprocessingp0, p1, . . . , pj takesPj

i=0|p_i|=mtime, so the

40 A TIME-SPACE TRADE-OFF FORK-BOUNDED WILDCARD INDEXES

query time isO(m+β^jlog logn+occ). The space needed to store the index isO(nlog^k_βn) for T_β^k(C) plus the space needed to store the LCP data structure. Adding support for rooted LCP queries requires linear space in the total size of the compressed tries, which isO(nlog^k_βn). LetT(S₀), T(S₁), . . . , T(S_q)denote the compressed tries with support for unrooted LCP queries. Since eachSi contains at mostnstrings andPq

i=0|S_i|=|T_β^k−1(C)|, by Lemma 2, the additional space required to support unrooted LCP queries is

i=0

|S_i|log|S_i|

=O logn

i=0

|S_i|

logn|T_β^k−1(C⁰)|

=O nlog(n) log^k−1_β n ,

which is an upper bound on the total space required to store the wildcard index. This concludes the proof of Theorem 2.

Thek-bounded wildcard index described by Cole et al. [13] is obtained as a special case of Theorem 2.

Corollary 3 (Cole et al.) Let t be a string of length n from an alphabet of size σ.

There is a k-bounded wildcard index for tusing O(nlog^kn) space. The index can report the occurrences of a pattern withmcharacters andj ≤k wildcards in time O(m+ 2^jlog logn+occ).

6

A k-B ^OUNDED W ^ILDCARD I ^NDEX

WITH O PTIMAL Q UERY T IME

In this chapter we show Theorem 3. In Section 6.1 we first obtain ak-bounded wildcard index similar to Theorem 2 by using Lemma 13. We use this index to construct a black-box reduction for obtaining optimal time indexes. In Section 6.2 the black-box reduction is used on the wildcard index from Theorem 1 to obtain Theorem 3.

6.1 A Black-Box Reduction

Consider thek-bounded wildcard index which ensures a linear query time, obtained by creating the wildcard treeT₁^k(suff(t))fort. We can show that the space usage for the index depends on the height of the suffix tree fort.

Lemma 15 Let t be a string of lengthn from an alphabet of sizeσ. There is a k-bounded wildcard index fortusingO(nh^k)space, wherehis the height of the suffix tree fort. The index can report the occurrences of a pattern withmcharacters andj wildcards in timeO(m+j+occ).

Proof Sincesuff(t)is closed under the suffix operation, the height ofT(suff(t))is an upper bound on the height of all compressed triesT(S)satisfyingS ⊆suff_d(suff(t)) for some d. For β = 1, the light height of T(S) is equal to the height of T(S), so h= height(T(suff(t)))can be used as an upper bound of the light height in Lemma 13, and consequently the space needed to storeT₁^k(suff(t))isO(nh^k).

In the worst case the height of the suffix tree is close ton, but combining the index with another wildcard index yields a useful black-box reduction. The idea is to query the first index if the pattern is short, and the second index if the pattern is long.

Lemma 16 LetF ≥mand letGbe independent ofmandj. Given a wildcard index W with query time O(F +G+occ) and space usageO(|W|), there is ak-bounded wildcard indexW⁰ with query timeO(F+j+occ)and taking spaceO(nmin(G, h)^k+

|W|), wherehis the height of the suffix tree fort.

42 AK-BOUNDED WILDCARD INDEX WITH OPTIMAL QUERY TIME

Proof The wildcard indexW⁰consists ofW and a specialk-bounded wildcard index W⁰⁰ = T₁^k(pref_G(suff(t))), which is a wildcard tree with β = 1 over the set of all substrings oftof lengthG. Since the maximum string depth inW⁰⁰isG, this value can be used as an upper bound for the light height in Lemma 13. Consequently, the space required to storeW⁰⁰isO(nmin(G, h)^k)by using Lemma 15 ifG > h, wherehis the height of the suffix tree fort. See Figure 6.1 for an illustration of the construction of W⁰⁰ =T₁^k(pref_G(suff(t))).

A query onW⁰results in a query on eitherW orW⁰⁰. In caseF+j > G, we query W and the query time will beO(F +G+occ) =O(F+j+occ). In caseF +j≤G, we queryW⁰⁰ with query timeO(m+j+occ) = O(F +j+occ). In any case, the query time ofW⁰ isO(F+j+occ).

∗

∗ ∗

∗

∗ ∗

∗

k G

T₁^k(pref_G(C))

T(C)

Figure 6.1:Illustrating how the wildcard treeT₁^k(pref_G(C))forC= suff(t)is obtained by slicing off the top of the compressed trieT(C)with string depth Gand appending k additional layers of wildcard trees recursively. For reference, Figure 3.3(c) shows a concrete example of constructing T(pref_G(C))forG= 3.

6.2 Obtaining the Index

Applying Lemma 16 with F = m and G = σ^klog logn on the unbounded wildcard index from Theorem 1 yields a newk-bounded wildcard index with optimal query time O(F +j+occ) =O(m+j+occ). The index uses space

O(nG^k+n) =O(n(σ^klog logn)^k) =O(σ^k²nlog^klogn). This concludes the proof of Theorem 3.

7

V ^ARIABLE L ^ENGTH G ^APS

In this chapter we consider thestring indexing for patterns with variable length gaps problem.

We show that this problem can be solved, using the bounded and unbounded wildcard indexes described in the previous chapters, by only changing the search algorithms.

Section 7.1 introduces the problem, Section 7.2 describes previous work and Section 7.3 gives an overview of our solutions. In Section 7.4 we account for the changes necessary for supporting variable length gaps and analyse the modified search algorithm.

7.1 Introduction

The string indexing for patterns with variable length gaps problem is to build an index for a stringtthat can efficiently report the occurrences of a query patternpof the form

p=p₀∗{a₁, b₁}p₁ ∗{a₂, b₂} . . . ∗{a_j, b_j}p_j .

The query pattern consists of j+ 1 stringsp0, p1, . . . , pj ∈ Σ^∗ interleaved by j variable length gaps∗{a_i, b_i},i= 1, . . . , j, wherea_i andb_i are positive integers such thata_i ≤b_i. Intuitively, a variable length gap∗{a_i, bi}matches an arbitrary string over Σof length betweenai andbi, both inclusive.

Example Consider the stringtand patternpover the alphabetΣ ={a,b,c,d}.

t=acbccbacccddabdaabcdccbccdaa p=b∗{0,4}cc∗{3,5}d

The stringtcontains five occurrences of the query patternpas shown in Figure 7.1.

As shown by the example, different occurrences of the query patternpcan start or end at the same position int, and the same substring intcan contain multiple occurrences ofp.

Hence to completely characterize an occurrence ofpint, we need to report the positions of the individual subpatternsp₀, p₁, . . . , p_j for each full occurrence of the pattern. However, in the following we will restrict our attention to reporting the start and end position of

44 VARIABLE LENGTH GAPS

a c² b³ c⁴ c⁵ b⁶ a⁷ c⁸ c⁹ ¹⁰c ¹¹d ¹²d ¹³a ¹⁴b ¹⁵d ¹⁶a ¹⁷a ¹⁸b ¹⁹c ²⁰d ²¹c ²²c ²³b ²⁴c ²⁵c ²⁶d ²⁷a ²⁸a b c c ∗ ∗ ∗ ∗ ∗ d

b ∗ ∗ ∗ ∗ c c ∗ ∗ ∗ ∗ ∗ d b ∗ c c ∗ ∗ ∗ ∗ ∗ d b ∗ ∗ c c ∗ ∗ ∗ ∗ d

b ∗ ∗ c c ∗ ∗ ∗ d

Figure 7.1:The five occurrences ofp=b∗{0,4}cc∗{3,5}dint=acbccbacccddabdaabcdccbccdaa.

each occurrence ofpint. For the above example, we would thus report the pairs(3,11), (3,15),(6,15)and(18,26).

String indexing for patterns with variable length gaps has applications in information retrieval, data mining and computational biology [18, 19, 31, 30, 33]. In particular, the PROSITE data base [23, 7] uses patterns with variable length gaps to identify and classify protein sequences. The problem is a generalization of string indexing for patterns with wildcards, since a wildcard ∗ is equivalent to the variable length gap∗{1,1}. Variable length gaps are also known asbounded wildcards, as a variable length gap∗{a_i, b_i}can be regarded as a bounded sequence of wildcards.

String indexing for patterns with variable length gaps is equivalent to string indexing for patterns with wildcards, with the addition of allowingoptional wildcardsin the pattern.

An optional wildcard matches any character fromΣor the empty string, i.e., an optional wildcard is equivalent to the variable length gap∗{0,1}. Conversely, we may also consider a variable length gap∗{a_i, bi}asaiconsecutive wildcards followed bybi−aiconsecutive optional wildcards.

In the following we letA =Pj

i=1ai andB =Pj

i=1bi denote the sum of the lower and upper bounds on the variable length gaps in p, respectively. HenceA and B −A denote the number of normal and optional wildcards inp, respectively. A wildcard index with support for optional wildcards is called anoptional wildcard index. As for wildcard indexes, we distinguish between bounded and unbounded optional wildcard indexes.

A (k, o)-bounded optional wildcard index supports patterns containing A ≤ k normal wildcards and B −A ≤ o optional wildcards. An unbounded optional wildcard index supports patterns with no restriction on the number of normal and optional wildcards..

7.2 Previous Work

Lam et al. [26] introduced optional wildcards in the pattern and presented a variant of their solution for the string indexing for patterns with wildcards problem. The idea is to determine potential matches and verify complete matches using interval stabbing on the possible positions for the subpatterns. This leads to an unbounded optional wildcard index with query timeO(m+Bjmin0≤i≤jocc(p_i, t))and space usageO(n). As beforeocc(p_i, t) isΘ(n)in the worst case, so the query time isΘ(Bjn)in the worst case.

45 Besides the result by Lam et al., we have no knowledge of solutions for the string indexing for patterns with variable length gaps problem. Some results are known for thestring matching with variable length gaps problem, where the text string may not be preprocessed in advance. Bille et al. [6] recently presented the best known solution for the matching problem with query timeO((n+m) logj+occ)using spaceO(m+A). We refer to their paper for a review of the previous work on the string matching with variable length gaps problem.

7.3 Our Results

By modifying the search algorithm, we show that our solutions to string indexing for patterns with wildcards also support variable length gaps in the pattern, leading to the following new theorems.

Theorem 4 Let t be a string of length n from an alphabet of size σ. There is an unbounded optional wildcard index fortusingO(n)space. The index can report the occurrences of a pattern withmcharacters,Awildcards andB−Aoptional wildcards in timeO(m+ 2^B−Aσ^Blog logn+occ), whereA=Pj

i=1ai andB =Pj i=1bi.

Theorem 5 Lettbe a string of lengthnfrom an alphabet of sizeσ. For 2≤β < σ, there is a(k, o)-bounded optional wildcard index fortusingO nlog(n) log^k+o−1_β n space. The index can report the occurrences of a pattern withm characters,A≤k wildcards andB−A≤ooptional wildcards in timeO m+ 2^B−Aβ^Blog logn+occ

, whereA=Pj

i=1aiandB=Pj i=1bi.

Theorem 6 Let t be a string of length n from an alphabet of size σ. There is a (k, o)-bounded optional wildcard index fortusingO(σ^(k+o)²nlog^k+ologn)space. The index can report the occurrences of a pattern withmcharacters,A≤kwildcards and B−A ≤o optional wildcards in timeO(2^B−A(m+B) +occ), whereA = Pj

i=1ai

andB =Pj i=1bi.

These results completely generalize our previous solutions, since if the query pattern only contains variable length gaps of the form∗{1,1}, the problem reduces to string indexing for patterns with wildcards. In that caseA =B =j and we obtain exactly Theorem 1, Theorem 2 and Theorem 3.

Compared to the only known index for the problem by Lam et al. [26], Theorem 4 gives an index that matches theO(n)space usage, but improves the worst-case query time fromΘ(Bjn)toO m+ 2^B−Aσ^Blog logn+occ

, provided thatB ≤log_σ√ nj.

7.4 Supporting Variable Length Gaps

As remarked a variable length gap∗{a_i, bi}is equivalent toaiwildcards followed bybi−ai

optional wildcards. Hence to support variable length gaps, we only have to describe how the search algorithms must be modified to match an optional wildcard inp. Essentially the

46 VARIABLE LENGTH GAPS

difference between matching a normal wildcard and an optional one is how the search handles the location where the wildcard is matched. For a normal wildcard the location is removed from the set of active locations, but for an optional wildcard the location is kept active. A sequence ofjoptional wildcards is simulated as a sequence ofj wildcards, where all intermediate locations reached are kept active.

As an example, consider the unbounded wildcard index obtained by adding support for unrooted LCP queries on the suffix treeT(suff(t))for the indexed stringt. Algorithm 4 illustrates how to searchT(suff(t))for a patternpcontaining variable length gaps. The algorithm is similar to Algorithm 3, except when a variable length gap ∗{a_i+1, b_i+1} is matched in a location`⁰. In this case, the for-loop in the algorithm simulates the variable length gap asai+1 normal wildcards followed bybi+1−ai+1 optional wildcards. At the start of each iteration of the for-loopf wildcards have been simulated. For each wildcard simulated,B⁰holds the resulting set of locations inT(suff(t))reached, whileBcontains the active locations at the beginning of each iteration forf. When a new iteration forf starts,Bcontains the active locations afterf wildcards were simulated. Ifa_i+1≤f ≤b_i+1, the currently considered locationois kept active, since the next wildcard to simulate is optional.

Extending the algorithm to support searching in wildcard trees is just a matter of changing the way the search branches. Instead of branching on all outgoing edges, it must instead branch to the heavy children and follow the edge labeled∗for each wildcard simulated. We refer to Algorithm 3 for how the branching should be performed.

7.4.1 Reporting Occurrences

To report the substrings intwhere the query pattern occurs, we assume that each leaf`in T(suff(t))has been labeled by the start position,pos(`), of the suffix intit corresponds to. When Algorithm 4 terminates, each location `⁰ ∈ R corresponds to one or more substrings int, where the query patternpoccurs. To report the start and end position of these substrings, we traverse the subtree rooted at`⁰and identify the leaves`₀, `₁, . . . , `_r corresponding to suffixes ofthaving`⁰ as a prefix. The start and end positions of these substrings are then given by

pos(`0),pos(`0) +|`⁰|

, pos(`1),pos(`1) +|`⁰|

, . . . , pos(`r),pos(`r) +|`⁰| .

7.4.2 Analysis of the Modified Search

To analyse the running time of Algorithm 4, we will bound the maximum number of LCP queries performed during the search for the query pattern

p=p0∗{a₁, b1}p1 ∗{a₂, b2} . . . ∗{a_j, bj}pj . We defineAi =Pi

l=1ai andBi=Pi

l=1bi. The number of normal and optional wildcards preceding the subpatternpiinpisAiandBi−Ai, respectively. To bound the number of locations in which a LCP query for the subpatternp_i can start, we choose and promote l= 0,1, . . . , B_i−A_iof the preceding optional wildcards to normal wildcards and discard

47 Algorithm 4Searching the unbounded wildcard indexT(suff(t))to find the occurrences a patternp=p0∗{a₁, b1}p1∗{a₂, b2} . . . ∗{a_j, bj}pj using the LCP data structure.

For each location inRreport the corresponding substrings int

the rest. For a specific choice there are exactlyAi+lwildcards precedingpi, and thus the number of locations in which a LCP query forp_ican start is at mostβ^Aⁱ^+l. The termβis an upper bound on the branching factor of the search when consuming a wildcard. For the indexT(suff(t))searched by Algorithm 4, the branching factor isβ =σ, but indexes based on wildcard trees can have a smaller branching factor. There are ^Bⁱ^−A_l ⁱ

possibilities for choosing theloptional wildcards, so the number of locations in which a LCP query forp_i can start is at most Summing over thej+ 1subpatterns, we obtain the following bound for the total number of LCP queries performed during a search for the query patternp.

i=0

2^Bⁱ^−Aⁱβ^Bⁱ = O 2^B−Aβ^B .

Since the search algorithm performs LCP queries in timeO(log logn)and has to preprocess the pattern in time O(m), the total query time is O(m+ 2^B−Aβ^Blog logn+occ). This concludes the proof of Theorem 4 and Theorem 5.

48 VARIABLE LENGTH GAPS

To show Theorem 6, we apply a black-box reduction very similar to Lemma 16, leading to a (k, o)-bounded optional wildcard index, where k and o are parameters chosen in advance. This index consists of the following two optional wildcard indexes. A query is performed on one of these indexes depending on the lengthm+Bof the received query patternp.

1. The unbounded optional wildcard index given by Theorem 4. This index has query timeO(m+ 2^B−Aσ^Blog logn+occ)and uses spaceO(n).

2. The(k, o)-bounded optional wildcard index obtained by constructing the wildcard treeT₁^k+o(pref_G(suff(t)))without the LCP data structure, whereG=σ^k+olog logn.

From Equation 7.1 withβ = 1, the search for the subpatternp_ican start from at most 2^Bⁱ^−Aⁱ locations. Searching forpifrom each of these locations takes timeO(|p_i|+bi), since the LCP data structure is not used and the tree must be traversed one character at a time. Summing over thej+ 1subpatterns, we obtain the following query time for the index

i=0

2^Bⁱ^−Aⁱ(|p_i|+bi) +occ

= O 2^B−A(m+B) +occ .

The index is a wildcard tree and by the same argument as for Theorem 3, it can be stored using spaceO(nG^k+o).

In case the received query patternphas lengthm+B > Gwe query the first index. It follows that2^B−Aσ^Blog logn≤2^B−AG <2^B−A(m+B), so the query time isO(2^B−A(m+ B) +occ). Ifphas lengthm+B ≤Gall occurrences ofpintcan be found by querying the second index in timeO(2^B−A(m+B) +occ). The space of the index is

O(n+nG^k+o) =O(n(σ^k+olog logn)^k+o) =O(nσ^(k+o)²log^k+ologn). This concludes the proof of Theorem 6.

8

C ONCLUSION

In this thesis we considered the problem of indexing a string to report the occurrences of a query pattern containing wildcards. The ideal index has size linear in the length of the

In document String Indexing for Patterns with Wildcards (Sider 43-0)