• Ingen resultater fundet

The BM methods

In document View of Sharing of Computations (Sider 163-167)

Originally devised by Boyer and Moore [BM77], numerous variants have emerged – some of which are described in e.g. [KMP77] and [PS88]. The basic idea is to do the matching backwards wrt. p: with p = tree (which will be our working example), the initial configuration will be

...x...

^ TREE

If x is an e, we move the pointer one to the left and attempt to match the first e of tree, etc.The various algorithms differ in the way they treat mismatches; the original approach [BM77] is to maintain two tables, δ1 and δ2. δ1 has an entry for each symbol in the alphabet (the size of which will be denoted Σ) which contains the rightmost occurrence (if any) of the symbol in p. δ2 has an entry for each position in p containing information about the rightmost occurrence of the corresponding suffix3 (if any) elsewhere in p2 is thus roughly of size |p|).

We shall consider the following 4 variants of the BM method, ordered after increasing degree of sophistication (the naming used is in no way conventional):

1.The naive Boyer-Moore, to be called BMna.This is a simplification of the one given in [BM77], exploiting δ1 only.

2.The original Boyer-Moore, to be called BMor.This is the one presented in

[BM77], i.e. δ1 as well as δ2 is exploited.

3.The standard Boyer-Moore, to be called BMst.This is the algo-rithm hinted at

[BM77, p.771, below] and the one derived in [PS88]; it works by

“combining” δ1 and δ2 into a single two-dimensional table.

4.Theoptimal Boyer-Moore, to be called BMop.This is the algorithm suggested by Knuth

[p.346][KMP77]; here no information is discarded (hence a larger data structure than δ12 is needed).

3Moreover, this occurrence must be preceded by a different symbol.

We will illustrate the behavior of these algorithms by giving examples where one algorithm behaves in a suboptimal fashion, whereas the “next one” in the ordering behaves optimally.

7.3.1 BMna vs. BMor

Let the situation be .TEE...

^ TREE

i.e. the two e’s have been found to match, but then a mismatch is de-tected.As the symbol currently scanned in the subject string is a t, and as δ1 records that t occurs in position 0 in p (but not elsewhere), BMna will shift the pattern string one position to the right yielding the configuration

.TEE...

^ TREE

However, the fact that the symbols to the right of the mismatch in the subject string are known to be ee actually justifies shifting the pattern string four positions to the right, as done by BMor.

7.3.2 BMor vs. BMst

Let the situation be ..TE..

^ TREE

i.e. the last e has been successfully matched.

1.From δ1 we can infer that it is sound to shift the pattern string two to the right, as t (the symbol scanned in the subject string) occurs in position 0 in p (but not elsewhere).

2.From δ2, which only exploits that the next symbol in the subject string is an e and the current symbol is not an e (but does not exploit that the current symbol is a t), we can infer nothing more than it is sound to shift the pattern string one to the right.

Hence BMor will move the pattern string max(2,1) = 2 positions to the right.This is very conservative, since from knowing that the symbol scanned is a t and the symbol to its right is an e it is possible to shift the pattern string four positions to the right, as done by BMst.

7.3.3 BMst vs. BMop

Let the situation be MOORRE....

^ TREE

By exploiting δ1 (alone) we can shift the pattern string two to the right, and after e has been successfully matched the situation is

MOORRE....

^ TREE

Now BMst is able to move the pattern string one to the right only, since it knows that then r and the first e will match.BMop does better than that, since it remembers that the fourth symbol of the subject string is an r – hence it is possible to shift the pattern string four to the right.

[BM77, p.764] illustrates the behavior of BMor by means of an ex-ample, the pattern string being at-that and the subject string s being WHICH-FINALLY-HALTS.--AT-THAT-POINT

BMor on that example makes 14 references to s.It will be a useful exercise to verify that BMop makes 11 references only, BMst makes 12 and BMna 16.

7.3.4Discussion

Several remarks can be made about the merits of the various Boyer-Moore algorithms:

It will not always be the case that a “more powerful” method is faster, since it may be useful to make a small shift initially in order to make larger shifts later on.As an example of this, consider (as in [KMP77, p.343]) the following situation:

ABABABABABABABABABABABA...

^ AAAAAAACB

δ1 enables a double shift, whereas δ2 enables a single shift only.

However, after this single shift the situation will soon be ABABABABABABABABABABABA...

^ AAAAAAACB

and now δ2 enables a maximal shift.

In [KMP77, p.343] it is shown that BMor makes at most 6|s| ref-erences to s (the constant 6 probably being much too large).The linearity is solely [BM77, p.767] due to the use4 of δ2.If only δ1 is used (i.e. BMna is considered), Θ(|p||s|) references to s may be made (if e.g. p is of form abbbbbb. . . and a large initial segment of s contains b’s only).However, BMna will suffice for most practical purposes, especially if the alphabet is large.

In [BM77] a theoretical analysis of BMor is given, and the theo-retical performance is compared with the actual performance.For large alphabets the theoretical analysis is completely accurate, but for the binary alphabet actual performance is significantly worse than the one predicted by the theoretical model.As explained by the authors [BM77, p.770], this is because the model does not ac-count for the fact that some matches are guaranteed to occur.For instance, when we from the configuration

...BAA...

^ AAABAAAAA

shift the pattern string three to the right the possibility that a mismatch occurs in position 3,4 or 5 is zero – as aaabaaaaa has been aligned in such a way that its substring baa matches the subject string – whereas it in the simplified theoretical model is

4As pointed out in [BM77, p.771] it is necessary in order to get linearity that δ2 also records theprecedingsymbol, cf.footnote 3.

non-zero.For small alphabets, the length of such aligned substrings can be considerable.

BMst is typically only slightly faster than BMor, but uses somewhat more space: for each position in the pattern string, a table with potentially Σ entries has to be created.However, in practice the space required will be O(|p|), since only a few of these entries will differ from the default action, allowing a compact representation of the tables.

The merits of BMop (versus BMst) are briefly discussed in [KMP77, p.346]. It is trivial that BMop makes at most |s| references to s (it never looks at the same position twice, as no information is forgotten).However, only for small alphabets we may expect BMop to be significantly faster than BMst.On the other hand, as the alphabet grows smaller the table of BMop may grow really large, even though it is doubtful whether 2|p| really is the tightest upper bound (for a further discussion of this issue, see [Cho90]).If the alphabet is large, we may expect BMop to use space O(|p|2).

In document View of Sharing of Computations (Sider 163-167)