• Ingen resultater fundet

Substring Range Reporting

N/A
N/A
Info
Hent
Protected

Academic year: 2022

Del "Substring Range Reporting"

Copied!
22
0
0

Indlæser.... (se fuldtekst nu)

Hele teksten

(1)

Substring Range Reporting

Philip Bille Inge Li Gørtz

(2)

Outline

• Problem definition

• Results

• 2D range reporting approach

• New solution

• Applications

• Position-restricted substring searching

• Indexing substrings with intervals

• Indexing substrings with gaps

• Remarks and Open Problems

(3)

Classic String Indexing

• Preprocess string S of length n.

• Report(P): Given pattern P of length m, report all occurrences of P in S.

• Suffix tree + perfect hashing: O(n) space and O(m + occ) query time.

S = senselessness

(4)

Substring Range Reporting

• Preprocess S. Each position in S has an integer label in [0,u].

• Report(P, a,b): Report all occurrences of P whose startpos label is in [a,b]

S = senselessness

1 7 3 2 115 3 2 9 0 5 6

(5)

2DRR and SRR

0

u a b

P

• Build suffix tree T for S + 2DRR data structure over leaves of T and [0,u].

• Suffix i represented by (lex-order(i), label(i)) in 2DRR.

• Report(P, a, b): Search for P in T. Do 2DRR query with interval of leaves and [a,b].

T

(6)

Results

• Also many succinct versions.

• In all our applications u = O(n) => space is O(nlogεn)

• Not based on 2DRR approach: Any 2DRR data structure with O(n logO(1)n) space must use Ω(log log u) time [PT2006].

O(nlog

ε

n) O(m + log log u + occ) [MN2006]

O(nu

ε

) O(m + occ) [CIKR2008]

O(nlog

ε

n + nlog log u) O(m + occ) This paper

Space Time Reference

(7)

The Data Structure

• Split suffix tree into top-tree and bottom-trees at depth log log u

• Two cases depending on if search for P ends in a top-tree or a bottom-tree

log log u

(8)

The Data Structure

• Case 1: m > log log u (search ends in a bottom-tree)

• Use 2DRR as before

• Space: O(nlogεn)

• Time: O(m + log log u + occ) = O(m + occ)

log log u

(9)

The Data Structure

• Case 2: m ≤ log log u (search ends in top-tree)

• For each node v in top-tree store all descendant leaf labels in 1DRR. With [ABR2001] each 1DRR use linear space and optimal query time. Search with [a,b] in 1DRR.

• Time: O(m + occ)

• Space: O(Number of descendant leaves of all nodes in top-tree) = O(nloglogu)

log log u

(10)

The Data Structure

• Combining both cases:

• Total space: O(nlogεn + nlog log u)

• Total time: O(m + occ)

• Basic idea related to filtering search [Cha1986].

(11)

Outline

• Problem definition

• Results

• 2D range reporting approach

• New solution

• Applications

• Position-restricted substring searching

• Indexing substrings with intervals

• Indexing substrings with gaps

• Remarks and Open Problems

(12)

Position-restricted Substring Searching

• Preprocess string S.

• Report(P, a, b): Report occurrences of P that start at position in [a,b]

• Special case of SRR where label(i) = i.

• Note u = n.

S = senselessness

(13)

Results

O(nlog

ε

n) O(m + log log n + occ) [MN2006]

O(n

1+ε

) O(m + occ) [CIKR2008]

O(nlog

ε

n) O(m + occ) This paper

Space Time Reference

(14)

Indexing Substrings with Intervals

• Preprocess string S and set of intervals π in S.

• Report(P, a, b): Report occurrences of P that start at position in [a,b] and within π.

• Reduction to SRR: label(i) = i if i is covered by π and 0 otherwise.

S = senselessness

(15)

Results

O(nlog

2

n) O(m + log log n + occ) [CIKRW2010]

O(n

1+ε

) O(m + occ) [CIKR2008]

O(nlog

ε

n) O(m + occ) This paper

Space Time Reference

(16)

Indexing Substrings with Gaps

• Preprocess string S with parameter d = size of gaps

• Report(P1, P2): Report occurrences of P1d P2

S = senselessness

(17)

Results

O(nlog

ε

n) O(m + log log n + occ) [IL2009]

O(nlog

ε

n) O(m + occ) This paper

Space Time Reference

(18)

2DRR and Gaps

• Build suffix trees TS and TSR for S and SR + 2DRR data structure over pairs of leaves. (TS and TSR stores suffixes and prefixes of S)

• Point in 2DRR if distance between prefix and suffix is d.

• Report(P1, P2): Search for P2 in TS and P1R in TSR. Do 2DRR query with intervals of leaves.

TS

TSR

P2

P1R

d

S[1,i] S[j,n]

(19)

SRR and Gaps

• Replace TS with SRR for S.

• Label of suffix i = y-coordinate from 2DRR data structure.

• Report(P1, P2): Search for P1R in TSR to get range [a,b]. Do SRR query for P2

with [a,b].

TS

TSR

P2

P1R

d

S[1,i] S[j,n]

(20)

SRR and Gaps

• Total space: SRR for S + TSR = O(nlogεn) + O(n) = O(nlogεn)

• Total time: O(m1 + m2 + occ) = O(m + occ)

(21)

Remarks and Open Problems

• Remarks

• Basic idea for SRR extends to the numerous (succinct) variants of 2DRR.

• Open problems

• What other problems does this idea apply to?

• Are the ideas practical?

(22)

The end

Referencer

RELATEREDE DOKUMENTER

Here we study the nonlinear optical response of monocrystalline gold flakes, revealing a polarization dependence in second-harmonic generation from the { 111 } surface that is

A WPPCL file, which specifies the contents of the data model for a given (imaginary) wind power plant, has been created. The WPPCL file is used by the system to initialize the

Keywords: mathematical modelling, 3D imaging, time-of-flight, range images, robot localization, pose estimation, range image segmentation, non-linear dimen- sion reduction, Local

For wildcard indexes having a query time sublinear in the length of the indexed text, it remains an open problem if there is an index where neither the size nor the query time

In an earlier work (Kampp 1959) Denmark has been divided into 7 agro-geographical regions (fig. 1), principally on the basis of parish statistics of the total yield

By means of an open, self-critical and negotiation-based approach, the exhibition phase focused on uncovering and making Casco available with a view to using the space

The relationship between the aggregate CSP and downside idiosyncratic volatility was estimated using the following multiple linear regression model with firm-fixed and

We cannot say this for sure, but we believe that there are indications that our model is at least as good, if not better than doing an optimization where the