Boyer-Moore
Is another efficient string pattern matching algorithm that is the standard benchmark from practical string search literature. Here is some insight.
Consider the naive approach that compares fix pattern P with every substring of T of length
P: word
T: There would have been a time
word
As we can see w
and o
have been matched, but u
and r
have failed. Instead of moving one more alignment as the naive algorithm does, we can show that skipping 2 alignments can guarantee no fault in the general algorithm. This is because u
does not exist anywhere in P
, so it should be safe just moving past P
after u
.
Boyer-Moore uses this principles as one of its heuristics for good character matching. The algorithm is able to determine from its character comparisons whether or not it can skip other pointless comparisons.
Rules
The search string
T
is read from left to right, but the character comparisons are made from right to left.
P: word
T: There would have been a time
<---- word
The Bad Character Rule: Upon mismatch of a character (1) if the character is not in the string
P
move the alignment pasted the mismatched character, or (2) if the character is otherwise in the string, move the alignment to the first character that aligns with the mismatched character (iterating right to left).
// assuming the bad character rule alone
step 1: initial alignment between T and P
.
T: GCTTCTGCTACCTTTTGCGCGCGCGCGGAA
P: CCTTTTGC
step 2: C mismatched with T, so we move P to align with
the first C
.
T: GCTTCTGCTACCTTTTGCGCGCGCGCGGAA
P: CCTTTTGC
step 3: the comparisons begin again, and we find a
mismatch with the A and G, where A does not
exist in P
T: GCTTCTGCTACCTTTTGCGCGCGCGCGGAA
P: CCTTTTGC
The Good Suffix Rule: let
t
=the substring that is currently matched with betweenT
andP
. (1) Skip to there is a match with betweent
andT
, or (2)P
moves pastt
.
// assuming the bad character rule alone
step 1: initial alignment between T and P
.<->
T: CGTGCCTACTTACTTACTTACTTACGCGAA
P: CTTACTTAC
step 2: there was a mismatch between C and T, and
we were able to move 3 alignments, to the
next occurrence of `TAC`.
.<----->
T: CGTGCCTACTTACTTACTTACTTACGCGAA
P: CTTACTTAC
step 3: Next we find that suffix of P CTTAC matched
with suffix of T CCTAC, so that is where the
next alignment got shifted.
<------->
T: CGTGCCTACTTACTTACTTACTTACGCGAA
P: CTTACTTAC
Putting it Togeher
Both the bad character rule and good suffix rule are used to skip alignments in the search text, that the naive algorithm would have otherwise ignored. Using both rules, Boyer Moore optimizes the amount of skips by taking maximum of both rules.
// assuming both rules into play
step 1: initial alignment between T and P
.
T: GTTATAGCTGATCGCCGGCGTAGCGGCGAA
P: GTAGCGGCG
step 2: The good suffix rule did not apply because
the mismatch length was 1. So we move to
the first occurrence of T.
BC = 6
GS = 0
.<->
T: GTTATAGCTGATCGCCGGCGTAGCGGCGAA
P: GTAGCGGCG
step 3: Now the GCR @ (.) can move 1 space (0 alignments),
while the GSR can move to the next GCG with 2 alignments.
BC = 0
GS = 2
.<---->
T: GTTATAGCTGATCGCGGCGTAGCGGCGAA
P: GTAGCGGCG
step 4: BCR says to move P past C, as C does not exist in
current P (GTA). GSR says that we can move the entire
string past first G, since the substring does not match
any of P.
BC = 2
GS = 7
<------->
T: GTTATAGCTGATCGCGGCGTAGCGGCGAA
P: GTAGCGGCG
Time Complexity: Space Complexity: , where defines the alphabet of T
.
Implementation Details
In practice we need to be able to immediately know the length to skip in T
. To do this, a preprocessing step must be done to produce a 2 dimensional lookup table given P
and T
, for both rule BC and GS. For example:
Lookup Table for the Bad Character Rule.
P: TCGC
T: AATCAATAGC
\sum() = ACGT
P = TCGC
Look up Table:
T C G C
A 0 1 0 3
C 0 - 2 -
G 0 1 - 0
T - 0 1 2
Last updated
Was this helpful?