Boyer-Moore

Is another efficient string pattern matching algorithm that is the standard benchmark from practical string search literature. Here is some insight.

Consider the naive approach that compares fix pattern P with every substring of T of length P|P|

P: word
T: There would have been a time
         word

As we can see w and o have been matched, but u and r have failed. Instead of moving one more alignment as the naive algorithm does, we can show that skipping 2 alignments can guarantee no fault in the general algorithm. This is because u does not exist anywhere in P, so it should be safe just moving past P after u.

Boyer-Moore uses this principles as one of its heuristics for good character matching. The algorithm is able to determine from its character comparisons whether or not it can skip other pointless comparisons.

Rules

  • The search string T is read from left to right, but the character comparisons are made from right to left.

P: word
T: There would have been a time
    <---- word
  • The Bad Character Rule: Upon mismatch of a character (1) if the character is not in the string P move the alignment pasted the mismatched character, or (2) if the character is otherwise in the string, move the alignment to the first character that aligns with the mismatched character (iterating right to left).

// assuming the bad character rule alone

step 1: initial alignment between T and P
       .
T: GCTTCTGCTACCTTTTGCGCGCGCGCGGAA
P: CCTTTTGC

step 2: C mismatched with T, so we move P to align with 
        the first C
            .
T: GCTTCTGCTACCTTTTGCGCGCGCGCGGAA
P:    CCTTTTGC

step 3: the comparisons begin again, and we find a 
        mismatch with the A and G, where A does not
        exist in P
T: GCTTCTGCTACCTTTTGCGCGCGCGCGGAA
P:           CCTTTTGC
  • The Good Suffix Rule: let t=the substring that is currently matched with between T and P. (1) Skip to there is a match with between t and T, or (2) P moves past t.

Putting it Togeher

Both the bad character rule and good suffix rule are used to skip alignments in the search text, that the naive algorithm would have otherwise ignored. Using both rules, Boyer Moore optimizes the amount of skips by taking maximum of both rules.

Time Complexity: O(n)O(n) Space Complexity: O(P)O(\sum *|P|), where \sum defines the alphabet of T.

Implementation Details

In practice we need to be able to immediately know the length to skip in T. To do this, a preprocessing step must be done to produce a 2 dimensional lookup table given P and T, for both rule BC and GS. For example:

Last updated

Was this helpful?