This post is part of the “Algos in Plain English” post series.
The Knuth-Morris-Pratt (KMP) string search algorithm aims to make searching for a word w in a body of text s efficient. This algorithm is expected to run in O(n), where n is the size of s.
The problem statement that KMP aims to solve is:
Given a word
wand a body of texts, find the first index inswherewis found. Ifwis not found, return-1.
The naive approach attempts to repeatedly find w in the body of text s and has a polynomial runtime. The naive approach can be written as follows:
def search(w: str, s: str) -> int:
m, n = len(w), len(s)
# Go through the indices of s to start checking
# for w at each start index
for i in range(n - m + 1):
found = True
# Start going through w starting at index i
for j in range(m):
# If there's a mismatch, break out early, as
# w was definitely not found at index i
if w[j] != s[i + j]:
found = False
break
if found:
return i
return -1
In the worst-case, the above runs in O(m*(n-m+1)). A breakdown of this is:
O(n-m+1) comes from the outer for-loopO(m) comes from the inner for-loop where w is attempted to be located as a substring in sIn the worst case, the algorithm fully evaluates the inner loop over and over again to completion.
Examples of worst-case inputs are as follows:
# Example 1: w is not found til the very end of s
w = "XXXY"
s = "XXXXXXXXXXXXXXXXXY"
# Example 2: w is repeatedly almost found, but not found in s
w = "XXXY"
s = "XXXXXXXXXXXXXXXXXX"
The naive approach evaluated above repeats work by performing a complete reset when attempting to find the word w in s. This re-examination makes the naive approach inefficient, as it forces the addition of a factor of m to the time complexity.
KMP’s algorithm design is contingent on being able to decide the following when traversing the body of text s, when w is deemed to be not found:
wis not found insbased on finding a mis-matching character. How much ofwhave we found based on where we found the mis-matching character inwands?
To make this more clear, let’s take the following example inputs:
w = "YYYZ"
s = "YYYYZ"
If the naive algorithm was used, "YYYY" would be evaluated in s to not match w. To find w in s, the naive algorithm would inevitably re-examine the suffix "YYY" of "YYYY". However, a smarter algorithm should be able to recognize that "YYYY"’s suffix, "YYY" is the prefix of w. This would save the algorithm the time-expenditure of re-examining characters in s.
KMP aims to enable this time-savings.
KMP enables string search by performing 2 steps.
TThe goal of this step is the following:
Build a data-structure
Twhere indexidescribes the length of the longest proper prefix ofw[0..i]that’s also a suffix.
Of note: a “proper” prefix is just a prefix that excludes the whole word (e.g. proper prefix of "ABC" excludes "ABC" itself as a prefix).
T would look lke the following for the string "YYYY":
T = [0, 1, 2, 3]
T would look like the following for the string "ZZYZZXZZYZZ":
T = [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]
w in sThe goal of this step is the following:
Find
wins, usingTas a way to prevent re-examination of characters inwands.
The way this is accomplished is by matching characters in w and s in the same way this is performed in the naive algorithm. When a mismatch is found, T is used to determine how much re-examination can be “skipped”. This means we know exactly how much of the beginning of w we’ve found and we can continue scanning s from where we left off.
Below is sample code for KMP. This is heavily-commented and tweaked version of some of the code found on the detailed GeeksforGeeks article on KMP.
from typing import List
def compute_T(w: str) -> List[int]:
# Handle edge-case where w is empty
if not w:
return []
T = [0] * len(w)
prefix_len = 0
# Starts at 1 b/c T[0] = 0 always; it's impossible for a
# proper prefix to be length > 0 if the string is only one
# character
i = 1
while i < len(w):
# If the current character in the prefix matches the
# character at the end of the w[0..i] window, extend
# the candidate prefix length and set T[i] = prefix_len
if w[prefix_len] == w[i]:
prefix_len += 1
T[i] = prefix_len
i += 1
else:
# "Back-up" in T to attempt to find a match and continue
# advancing
if prefix_len > 0:
prefix_len = T[prefix_len - 1]
# If prefix_len is 0, just set T[i] = 0 and advance i
else:
T[i] = 0
i += 1
return T
def kmp(w: str, s: str) -> int:
T = compute_T(w)
# i = position in s being inspected
# j = position in w being inspected
i, j = 0, 0
while i < len(s):
# Advance the positions in w and s; same as in the
# naive algorithm
if w[j] == s[i]:
i += 1
j += 1
# The characters in w and s do not match and some of
# w has been found
elif j > 0:
# "Back-up" in T to grab the length of w that WAS matched
# so far
j = T[j - 1]
else:
# None of w has been found, so just advance where in s
# is being considered for character matching
i += 1
# Do this last so that we can break out if the string was found;
# this avoids having to an extra check after the while-loop
if j == len(w):
# No +1 because i will be one ahead of where
# the substring was found
return i - j
# No match was found in the while-loop, so return -1 as a
# failure indicator
return -1
T being used?The most nuanced part of both the pre-processing algorithm and KMP itself is the part where the algorithm “backs-up” and grabs a value from T to attempt future matching. This is the main optimization KMP has over the naive approach, and it’s worth understanding.
Let’s take an example:
s = "ABCDABYABCDABD"
w = "ABCDABD"
s will match w until the "Y" != "D" mismatch. What KMP will then do is “back-up” and set j = T[j-1]. Why? Because w[0..j-1] represents "ABCDAB" at this point. KMP is grabbing the length value from T that represents what we HAVE matched of w. The length will be 2 at this point, representing the proper prefix and suffix "AB".
KMP will then try to match "Y" to "C", resulting in a failed match:
"ABCDABYABCDABD"
"ABCDABD"
^
This failure causes KMP to set the value of j to T[j-1], which is going to be 0. This leads to one more comparison, "Y" to "A", resulting in a failed match:
"ABCDABYABCDABD"
"ABCDABD"
^
KMP then advances the window being considered, since "Y" isn’t going to match with any characters in w. Comparison resumes in this way:
"ABCDABYABCDABD"
"ABCDABD"
^
The above logic happens the same way for building T itself. It uses prior knowledge of T to build future entries in T. If we consider the current, working prefix as an expanding string that is attempted to be found in w, the above reasoning can be applied in the same way to the T-building algorithm. Thinking through this nuance is a good intellectual exercise, and I encourage you to draw up some examples!
The overall KMP algorithm has a time-complexity of O(n), where n = len(s). This assumes s is larger than w. The biggest thing to understand for this is: how can the algorithm have this linear time-complexity if we’re “backing-up” through previous indices sometimes (e.g. the j = T[j - 1] step, when i is not advanced)?
The answer: in order to “expand” j to be a size, we need to make progress via matching (a.k.a. working through s). This means we can only “contract” j at most O(n) times across the entire algorithm. This leaves us with an overall runtime of O(2n), which is just O(n). The same logic can be used to understand why the building of T also has linear time complexity.
Through a clever pre-computed data-structure T that stores the length of the shared proper prefix and suffix of w[0..i] at index i, KMP enables linear string search time-complexity. Understanding this algorithm is powerful because it builds up re-usable concepts involving prefixes and suffixes of substrings.
strStr()” question.