Wildcard Trees

We introduce the(β, k)-wildcard tree, denotedT_β^k(C⁰), where 1≤ β < σ andk > 0 are chosen integer parameters. This data structure stores a setC⁰ of modified substrings of the

35 indexed texttin a compressed trie. The key property is that the search for a patternpwith at mostkwildcards branches to at mostβ locations inT_β^k(C⁰)when consuming a single wildcard ofp. In particular forβ = 1, the search forpnever branches and the search time becomes linear in the length ofp.

For a vertexv, we define thewildcard heightofvto be the number of wildcards on the path fromvto the root. Intuitively, given a wildcard tree that supportsiwildcards, support for an extra wildcard is added by joining a new tree to each vertexv having wildcard heightiwith an edge labeled∗. This new tree is searched if a wildcard is consumed inv.

More precisely,T_β^k(C⁰)is built recursively as follows.

Construction ofT_βⁱ(S): Produce a heavy(β−1)-tree decomposition ofT(S), then for each internal vertexv∈T(S), join the root ofT_βⁱ⁻¹(suff2(lightstrings(v)) tovby an edge labeled∗. LetT_β⁰(S) =T(S).

T_β^k−1(suff2(lightstrings(v)))

∗

∗ ∗

∗

∗ v

∗

β−1 lightstrings(v)

T_β⁰(C⁰)

T_β¹(C⁰)

T_β^k(C⁰)

Figure 5.2:Abstract illustration of the recursive construction of the wildcard treeT_β^k(C⁰). The final tree consists of compressed tries joined by edges labeled∗and organized intoklayers.

As illustrated by Figure 5.2, the final treeT_β^k(C⁰) can be regarded as a number of com-pressed tries joined by edges labeled∗. Each compressed trieT(S) stores at most|C⁰| selected suffixes of the strings inC⁰, sinceS is a subset ofsuffd(C⁰)for somed. We will use this observation in the following sections.

Since a leaf`in a compressed trieT(S)is obtained as the suffix of a stringx∈C⁰, we assume that`inherits the label ofx in case the strings inC⁰ are labeled. For example, whenC⁰ denotes the suffixes oft, we will label each suffix inC⁰with its start position in t. This immediately provides us with ank-unbounded wildcard index. Figure 5.3 shows some concrete examples of the construction ofT_β^k(C⁰)whenC⁰ is a set of labeled suffixes.

36 A TIME-SPACE TRADE-OFF FORK-BOUNDED WILDCARD INDEXES

Figure 5.3:Illustrating the recursive construction ofT_β^k(C)forβ∈ {1,2,3},k= 2andC= suff(bananas$).

The recursion levels 0, 1, 2 in the construction are colored black, blue, red respectively. All edges inT₁²(C)are light, since the construction is based on a heavyα-tree decomposition with α=β−1 = 0. The leafs are labeled with the start position of their corresponding suffix int.

37 5.3 Wildcard Tree Index

Given a setC⁰of substrings oftand a patternp, we can identify the strings ofC⁰having a prefix matchingp by constructingT_β^k(C⁰). SearchingT_β^k(C⁰)is similar to the suffix tree search, except when consuming a wildcard character ofpin an explicit vertexv∈T_β^k(C⁰) with more thanβ children. In that case the search branches to the root of the wildcard tree joined tovand to the first location on theβ−1heavy edges ofv, effectively letting the wildcard match the first character on all edges fromv.

More precisely, the search forpis carried out as described in Algorithm 2.

Algorithm 2Searching a wildcard tree for a patternp=p0∗p1∗. . .∗pj. A ← {root(T_β^k(C⁰))}

R ← ∅

whileA 6=∅do

`←extract(A) if|`|=|p|then R ← R ∪ {`}

else

ifp[|`|+ 1] =∗then

A ← A ∪nextheavylocations(`) A ← A ∪ {nextlocation(`, p[|`|+ 1])}

Report descendant leaves of locations inR

The algorithm maintains a setAof active locations, which initially only contains the root of T_β^k(C⁰). A location `is repeatedly extracted fromA, and the search continues from

`by adding a (possibly empty) set of new locations to A. When matching a character cofpin a location`, the location reached by following the charactercfrom`, denoted nextlocation(`, c), is added toA. Additionally, when matching a wildcard, the set of all heavy child locations of`, denoted nextheavylocations(`), is added toA. The algorithm terminates with a set of locationsR, each corresponding to at least one occurrence ofp.

The strings inC⁰ havingpas a prefix are found by reporting the leaves below the locations inRin timeO(occ).

5.3.1 Time and Space Analysis

For each of thejwildcards inp, the size ofAis increased by a factor of at mostβ. Hence the search forpbranches to a total of at mostPj

i=0βⁱ =O(β^j)locations, each of which requires O(m) time, resulting in query time O(β^jm+occ). For β = 1 the query time becomesO(m+j+occ), since nextheavylocations(`) =∅for all`.

We will now show the space required to storeT_β^k(C⁰), resulting in Lemma 13.

Lemma 13 For any integer 1 ≤ β < σ, the wildcard tree T_β^k(C⁰) has query time O(β^jm+occ)for1< β < σandO(m+j+occ)forβ = 1. The wildcard tree stores

38 A TIME-SPACE TRADE-OFF FORK-BOUNDED WILDCARD INDEXES

O(|C⁰|h^k) strings, whereh is an upper bound on the light height of all compressed triesT(S)satisfyingS ⊆suff_d(C⁰)for some integerd.

In the paper by Cole et al. [13] the authors give a proof by induction of the space needed to store a wildcard tree forβ = 2. See Appendix B for an explanation of their proof and a discussion of a potential problem in it. We give a more general proof of the space usage that works for any value ofβ.

Proof We prove that the total number of strings (leaves) inT_βⁱ(S), denoted|T_βⁱ(S)|, is at most|S|P_i

j=0h^j = O(|S|hⁱ). The proof is by induction on i. The base case i= 0holds, sinceT_β⁰(S) =T(S)contains|S|=|S|P0

j=0h^j strings. For the induction step, assume that|T_βⁱ(S)| ≤ |S|Pi

j=0h^j. LetS_v = suff₂(lightstrings(v))for a vertex v∈T(S). From the construction we have that the number of strings inT_βⁱ⁺¹(S)is the number of strings inT(S)plus the number of strings in the wildcard trees joined to the vertices ofT(S). That is,

The string setsS_v consist of suffixes of strings inS. Consider a stringx∈S, i.e., a leaf inT(S). The number of times a suffix ofxappears in a setSv is equal to the light depth ofxinT(S). From the construction,Sis a subset ofsuff_d(C⁰)for somed, and hencehis an upper bound on the maximum light depth ofT(S). This establishes that

Constructing the wildcard treeT_β^k(C), forC = suff(t), we obtain a wildcard index with the following properties.

Lemma 14 Lettbe a string of lengthnfrom an alphabet of sizeσ. For2 ≤β < σ there is ak-bounded wildcard index for t usingO nlog^k_βn

space. The index can report the occurrences of a pattern withmcharacters andj ≤k wildcards in time O β^jm+occ

Proof The query time follows from Lemma 13. Since T_β^k(C) is a compressed trie, and because each edge label is a substring oft, the space needed to storeT_β^k(C)is upper bounded by the number of strings it contains which by Lemma 13 isO(nh^k).

Sinceα=β−1, it follows from Lemma 12 thath= log_βnis an upper bound on the light height of all compressed triesT(S) satisfyingS ⊆ suff_d(C) for somed, since they contain at mostnvertices. Consequently, the space needed to store the index is O(nlog^k_βn).

39 5.4 Wildcard Tree Index Using the LCP Data Structure

In this section we prove Theorem 2 by showing how LCP queries can be used to speed up the query time of the wildcard tree index described in the previous section.

The wildcard index of Lemma 14 reduces the branching factor of the suffix tree search fromσtoβ, but still has the drawback that the search for a subpatternp_ifrom an active location ` ∈ T_β^k(C) takesO(|p_i|) time. This can be addressed by combining the index with the LCP data structure as in the paper by Cole et al. [13]. In that way, the search for a subpattern can be done in timeO(log logn). The index is obtained by modifying the construction ofT_βⁱ(S)such that eachT(S)is added to the LCP data structure prior to joining the (β, i−1)-wildcard trees to the vertices of T(S). For all T(S) except the finalT(S) =T_β⁰(S), support for unrooted LCP queries in timeO(log logn)is added using additionalO(|S|log|S|)space. For the finalT(S)we only need support for rooted queries since at this point all wildcards inphave been consumed. Upon receiving the query pattern p=p1∗p2∗. . .∗pj, eachpi is preprocessed in timeO(|p_i|)to support LCP queries for any suffix ofp_i.

The search for p in the modified index proceeds as described in Algorithm 3. The algorithm is similar to that for the normal wildcard tree index, except now LCP queries are used to search forp₀, p₁, . . . , p_j. After performing a LCP query forp_ifrom`, the search only continues from the new location`⁰ if the full subpatternpiwas matched. In that case, the search branches to the heavy children and the root of the wildcard tree joined to`⁰, since the next character in the search is a wildcard. If the last subpattern was matched, the resulting location is an occurrence ofpint, and all strings inChaving that location as an ancestor are reported.

Algorithm 3Searching a wildcard tree to find the occurrences a patternp=p0∗p1∗. . .∗pj

using the LCP data structure.

A ← {(0,root(T_β^k(C⁰)))}

R ← ∅

whileA 6=∅do (i, `)←extract(A)

`⁰ ← LCP(pi,trie(`), `) if|`⁰| − |`|=|p_i|then

ifi=jthen R ← R ∪ {`⁰} else

A ← A ∪ {(i+ 1, `⁰⁰)|`⁰⁰∈nextheavylocations(`)∨`⁰⁰=nextlocation(`⁰,∗)}

Report descendant leaves of locations inR

5.4.1 Time and Space Analysis

In the search for p a total of at most Pj

i=0βⁱ = O(β^j) LCP queries each taking time O(log logn)are performed. Preprocessingp0, p1, . . . , pj takesPj

i=0|p_i|=mtime, so the

40 A TIME-SPACE TRADE-OFF FORK-BOUNDED WILDCARD INDEXES

query time isO(m+β^jlog logn+occ). The space needed to store the index isO(nlog^k_βn) for T_β^k(C) plus the space needed to store the LCP data structure. Adding support for rooted LCP queries requires linear space in the total size of the compressed tries, which isO(nlog^k_βn). LetT(S₀), T(S₁), . . . , T(S_q)denote the compressed tries with support for unrooted LCP queries. Since eachSi contains at mostnstrings andPq

i=0|S_i|=|T_β^k−1(C)|, by Lemma 2, the additional space required to support unrooted LCP queries is

i=0

|S_i|log|S_i|

=O logn

i=0

|S_i|

logn|T_β^k−1(C⁰)|

=O nlog(n) log^k−1_β n ,

which is an upper bound on the total space required to store the wildcard index. This concludes the proof of Theorem 2.

Thek-bounded wildcard index described by Cole et al. [13] is obtained as a special case of Theorem 2.

Corollary 3 (Cole et al.) Let t be a string of length n from an alphabet of size σ.

There is a k-bounded wildcard index for tusing O(nlog^kn) space. The index can report the occurrences of a pattern withmcharacters andj ≤k wildcards in time O(m+ 2^jlog logn+occ).

6

A k-B ^OUNDED W ^ILDCARD I ^NDEX

WITH O PTIMAL Q UERY T IME

In this chapter we show Theorem 3. In Section 6.1 we first obtain ak-bounded wildcard index similar to Theorem 2 by using Lemma 13. We use this index to construct a black-box reduction for obtaining optimal time indexes. In Section 6.2 the black-box reduction is used on the wildcard index from Theorem 1 to obtain Theorem 3.

6.1 A Black-Box Reduction

Consider thek-bounded wildcard index which ensures a linear query time, obtained by creating the wildcard treeT₁^k(suff(t))fort. We can show that the space usage for the index depends on the height of the suffix tree fort.

Lemma 15 Let t be a string of lengthn from an alphabet of sizeσ. There is a k-bounded wildcard index fortusingO(nh^k)space, wherehis the height of the suffix tree fort. The index can report the occurrences of a pattern withmcharacters andj wildcards in timeO(m+j+occ).

Proof Sincesuff(t)is closed under the suffix operation, the height ofT(suff(t))is an upper bound on the height of all compressed triesT(S)satisfyingS ⊆suff_d(suff(t)) for some d. For β = 1, the light height of T(S) is equal to the height of T(S), so h= height(T(suff(t)))can be used as an upper bound of the light height in Lemma 13, and consequently the space needed to storeT₁^k(suff(t))isO(nh^k).

In the worst case the height of the suffix tree is close ton, but combining the index with another wildcard index yields a useful black-box reduction. The idea is to query the first index if the pattern is short, and the second index if the pattern is long.

Lemma 16 LetF ≥mand letGbe independent ofmandj. Given a wildcard index W with query time O(F +G+occ) and space usageO(|W|), there is ak-bounded wildcard indexW⁰ with query timeO(F+j+occ)and taking spaceO(nmin(G, h)^k+

|W|), wherehis the height of the suffix tree fort.

42 AK-BOUNDED WILDCARD INDEX WITH OPTIMAL QUERY TIME

Proof The wildcard indexW⁰consists ofW and a specialk-bounded wildcard index W⁰⁰ = T₁^k(pref_G(suff(t))), which is a wildcard tree with β = 1 over the set of all substrings oftof lengthG. Since the maximum string depth inW⁰⁰isG, this value can be used as an upper bound for the light height in Lemma 13. Consequently, the space required to storeW⁰⁰isO(nmin(G, h)^k)by using Lemma 15 ifG > h, wherehis the height of the suffix tree fort. See Figure 6.1 for an illustration of the construction of W⁰⁰ =T₁^k(pref_G(suff(t))).

A query onW⁰results in a query on eitherW orW⁰⁰. In caseF+j > G, we query W and the query time will beO(F +G+occ) =O(F+j+occ). In caseF +j≤G, we queryW⁰⁰ with query timeO(m+j+occ) = O(F +j+occ). In any case, the query time ofW⁰ isO(F+j+occ).

∗

∗ ∗

∗

∗ ∗

∗

k G

T₁^k(pref_G(C))

T(C)

Figure 6.1:Illustrating how the wildcard treeT₁^k(pref_G(C))forC= suff(t)is obtained by slicing off the top of the compressed trieT(C)with string depth Gand appending k additional layers of wildcard trees recursively. For reference, Figure 3.3(c) shows a concrete example of constructing T(pref_G(C))forG= 3.

6.2 Obtaining the Index

Applying Lemma 16 with F = m and G = σ^klog logn on the unbounded wildcard index from Theorem 1 yields a newk-bounded wildcard index with optimal query time O(F +j+occ) =O(m+j+occ). The index uses space

O(nG^k+n) =O(n(σ^klog logn)^k) =O(σ^k²nlog^klogn). This concludes the proof of Theorem 3.

7

V ^ARIABLE L ^ENGTH G ^APS

In this chapter we consider thestring indexing for patterns with variable length gaps problem.

We show that this problem can be solved, using the bounded and unbounded wildcard indexes described in the previous chapters, by only changing the search algorithms.

Section 7.1 introduces the problem, Section 7.2 describes previous work and Section 7.3 gives an overview of our solutions. In Section 7.4 we account for the changes necessary for supporting variable length gaps and analyse the modified search algorithm.

7.1 Introduction

The string indexing for patterns with variable length gaps problem is to build an index for a stringtthat can efficiently report the occurrences of a query patternpof the form

p=p₀∗{a₁, b₁}p₁ ∗{a₂, b₂} . . . ∗{a_j, b_j}p_j .

The query pattern consists of j+ 1 stringsp0, p1, . . . , pj ∈ Σ^∗ interleaved by j variable length gaps∗{a_i, b_i},i= 1, . . . , j, wherea_i andb_i are positive integers such thata_i ≤b_i. Intuitively, a variable length gap∗{a_i, bi}matches an arbitrary string over Σof length betweenai andbi, both inclusive.

Example Consider the stringtand patternpover the alphabetΣ ={a,b,c,d}.

t=acbccbacccddabdaabcdccbccdaa p=b∗{0,4}cc∗{3,5}d

The stringtcontains five occurrences of the query patternpas shown in Figure 7.1.

As shown by the example, different occurrences of the query patternpcan start or end at the same position int, and the same substring intcan contain multiple occurrences ofp.

Hence to completely characterize an occurrence ofpint, we need to report the positions of the individual subpatternsp₀, p₁, . . . , p_j for each full occurrence of the pattern. However, in the following we will restrict our attention to reporting the start and end position of

44 VARIABLE LENGTH GAPS

a c² b³ c⁴ c⁵ b⁶ a⁷ c⁸ c⁹ ¹⁰c ¹¹d ¹²d ¹³a ¹⁴b ¹⁵d ¹⁶a ¹⁷a ¹⁸b ¹⁹c ²⁰d ²¹c ²²c ²³b ²⁴c ²⁵c ²⁶d ²⁷a ²⁸a b c c ∗ ∗ ∗ ∗ ∗ d

b ∗ ∗ ∗ ∗ c c ∗ ∗ ∗ ∗ ∗ d b ∗ c c ∗ ∗ ∗ ∗ ∗ d b ∗ ∗ c c ∗ ∗ ∗ ∗ d

b ∗ ∗ c c ∗ ∗ ∗ d

Figure 7.1:The five occurrences ofp=b∗{0,4}cc∗{3,5}dint=acbccbacccddabdaabcdccbccdaa.

each occurrence ofpint. For the above example, we would thus report the pairs(3,11), (3,15),(6,15)and(18,26).

String indexing for patterns with variable length gaps has applications in information retrieval, data mining and computational biology [18, 19, 31, 30, 33]. In particular, the PROSITE data base [23, 7] uses patterns with variable length gaps to identify and classify protein sequences. The problem is a generalization of string indexing for patterns with wildcards, since a wildcard ∗ is equivalent to the variable length gap∗{1,1}. Variable length gaps are also known asbounded wildcards, as a variable length gap∗{a_i, b_i}can be regarded as a bounded sequence of wildcards.

String indexing for patterns with variable length gaps is equivalent to string indexing for patterns with wildcards, with the addition of allowingoptional wildcardsin the pattern.

An optional wildcard matches any character fromΣor the empty string, i.e., an optional wildcard is equivalent to the variable length gap∗{0,1}. Conversely, we may also consider a variable length gap∗{a_i, bi}asaiconsecutive wildcards followed bybi−aiconsecutive optional wildcards.

In the following we letA =Pj

i=1ai andB =Pj

i=1bi denote the sum of the lower and upper bounds on the variable length gaps in p, respectively. HenceA and B −A denote the number of normal and optional wildcards inp, respectively. A wildcard index with support for optional wildcards is called anoptional wildcard index. As for wildcard indexes, we distinguish between bounded and unbounded optional wildcard indexes.

A (k, o)-bounded optional wildcard index supports patterns containing A ≤ k normal wildcards and B −A ≤ o optional wildcards. An unbounded optional wildcard index supports patterns with no restriction on the number of normal and optional wildcards..

7.2 Previous Work

Lam et al. [26] introduced optional wildcards in the pattern and presented a variant of their solution for the string indexing for patterns with wildcards problem. The idea is to determine potential matches and verify complete matches using interval stabbing on the possible positions for the subpatterns. This leads to an unbounded optional wildcard index with query timeO(m+Bjmin0≤i≤jocc(p_i, t))and space usageO(n). As beforeocc(p_i, t) isΘ(n)in the worst case, so the query time isΘ(Bjn)in the worst case.

45 Besides the result by Lam et al., we have no knowledge of solutions for the string indexing for patterns with variable length gaps problem. Some results are known for thestring matching with variable length gaps problem, where the text string may not be preprocessed in advance. Bille et al. [6] recently presented the best known solution for the matching problem with query timeO((n+m) logj+occ)using spaceO(m+A). We refer to their paper for a review of the previous work on the string matching with variable length gaps problem.

7.3 Our Results

By modifying the search algorithm, we show that our solutions to string indexing for patterns with wildcards also support variable length gaps in the pattern, leading to the following new theorems.

Theorem 4 Let t be a string of length n from an alphabet of size σ. There is an unbounded optional wildcard index fortusingO(n)space. The index can report the occurrences of a pattern withmcharacters,Awildcards andB−Aoptional wildcards in timeO(m+ 2^B−Aσ^Blog logn+occ), whereA=Pj

i=1ai andB =Pj i=1bi.

Theorem 5 Lettbe a string of lengthnfrom an alphabet of sizeσ. For 2≤β < σ, there is a(k, o)-bounded optional wildcard index fortusingO nlog(n) log^k+o−1_β n space. The index can report the occurrences of a pattern withm characters,A≤k wildcards andB−A≤ooptional wildcards in timeO m+ 2^B−Aβ^Blog logn+occ

, whereA=Pj

i=1aiandB=Pj i=1bi.

Theorem 6 Let t be a string of length n from an alphabet of size σ. There is a (k, o)-bounded optional wildcard index fortusingO(σ^(k+o)²nlog^k+ologn)space. The index can report the occurrences of a pattern withmcharacters,A≤kwildcards and B−A ≤o optional wildcards in timeO(2^B−A(m+B) +occ), whereA = Pj

i=1ai

andB =Pj i=1bi.

These results completely generalize our previous solutions, since if the query pattern only contains variable length gaps of the form∗{1,1}, the problem reduces to string indexing for patterns with wildcards. In that caseA =B =j and we obtain exactly Theorem 1, Theorem 2 and Theorem 3.

Compared to the only known index for the problem by Lam et al. [26], Theorem 4 gives an index that matches theO(n)space usage, but improves the worst-case query time fromΘ(Bjn)toO m+ 2^B−Aσ^Blog logn+occ

, provided thatB ≤log_σ√ nj.

7.4 Supporting Variable Length Gaps

As remarked a variable length gap∗{a_i, bi}is equivalent toaiwildcards followed bybi−ai

optional wildcards. Hence to support variable length gaps, we only have to describe how the search algorithms must be modified to match an optional wildcard inp. Essentially the

46 VARIABLE LENGTH GAPS

difference between matching a normal wildcard and an optional one is how the search handles the location where the wildcard is matched. For a normal wildcard the location is removed from the set of active locations, but for an optional wildcard the location is kept active. A sequence ofjoptional wildcards is simulated as a sequence ofj wildcards, where all intermediate locations reached are kept active.

As an example, consider the unbounded wildcard index obtained by adding support for unrooted LCP queries on the suffix treeT(suff(t))for the indexed stringt. Algorithm 4 illustrates how to searchT(suff(t))for a patternpcontaining variable length gaps. The algorithm is similar to Algorithm 3, except when a variable length gap ∗{a_i+1, b_i+1} is matched in a location`⁰. In this case, the for-loop in the algorithm simulates the variable length gap asai+1 normal wildcards followed bybi+1−ai+1 optional wildcards. At the start of each iteration of the for-loopf wildcards have been simulated. For each wildcard simulated,B⁰holds the resulting set of locations inT(suff(t))reached, whileBcontains the active locations at the beginning of each iteration forf. When a new iteration forf starts,Bcontains the active locations afterf wildcards were simulated. Ifa_i+1≤f ≤b_i+1, the currently considered locationois kept active, since the next wildcard to simulate is optional.

Extending the algorithm to support searching in wildcard trees is just a matter of changing the way the search branches. Instead of branching on all outgoing edges, it must instead branch to the heavy children and follow the edge labeled∗for each wildcard simulated. We refer to Algorithm 3 for how the branching should be performed.

In document String Indexing for Patterns with Wildcards (Sider 40-0)

6

A k-B OUNDED W ILDCARD I NDEX

WITH O PTIMAL Q UERY T IME

7

V ARIABLE L ENGTH G APS

A k-B ^OUNDED W ^ILDCARD I ^NDEX

V ^ARIABLE L ^ENGTH G ^APS