Joe Antonakakis

Code @ Lumos 🪄 | Ex-Meta

KMP (Knuth-Morris-Pratt) | Algos in Plain English

Posted at — May 28, 2022

This post is part of the “Algos in Plain English” post series.

The Knuth-Morris-Pratt (KMP) string search algorithm aims to make searching for a word w in a body of text s efficient. This algorithm is expected to run in O(n), where n is the size of s.

Problem Statement

The problem statement that KMP aims to solve is:

Given a word w and a body of text s, find the first index in s where w is found. If w is not found, return -1.

The naive approach attempts to repeatedly find w in the body of text s and has a polynomial runtime. The naive approach can be written as follows:

 def search(w: str, s: str) -> int:
    m, n = len(w), len(s)
    # Go through the indices of s to start checking
    # for w at each start index
    for i in range(n - m + 1):
        found = True
        # Start going through w starting at index i
        for j in range(m):
            # If there's a mismatch, break out early, as
            # w was definitely not found at index i
            if w[j] != s[i + j]:
                found = False
                break
        if found:
            return i
    return -1

In the worst-case, the above runs in O(m*(n-m+1)). A breakdown of this is:

In the worst case, the algorithm fully evaluates the inner loop over and over again to completion.

Examples of worst-case inputs are as follows:

# Example 1: w is not found til the very end of s
w = "XXXY"
s = "XXXXXXXXXXXXXXXXXY"
# Example 2: w is repeatedly almost found, but not found in s
w = "XXXY"
s = "XXXXXXXXXXXXXXXXXX"

KMP Intuition

The naive approach evaluated above repeats work by performing a complete reset when attempting to find the word w in s. This re-examination makes the naive approach inefficient, as it forces the addition of a factor of m to the time complexity.

KMP’s algorithm design is contingent on being able to decide the following when traversing the body of text s, when w is deemed to be not found:

w is not found in s based on finding a mis-matching character. How much of w have we found based on where we found the mis-matching character in w and s?

To make this more clear, let’s take the following example inputs:

w = "YYYZ"
s = "YYYYZ"

If the naive algorithm was used, "YYYY" would be evaluated in s to not match w. To find w in s, the naive algorithm would inevitably re-examine the suffix "YYY" of "YYYY". However, a smarter algorithm should be able to recognize that "YYYY"’s suffix, "YYY" is the prefix of w. This would save the algorithm the time-expenditure of re-examining characters in s.

KMP aims to enable this time-savings.

KMP Algorithm

KMP enables string search by performing 2 steps.

Step 1: Pre-Processed T

The goal of this step is the following:

Build a data-structure T where index i describes the length of the longest proper prefix of w[0..i] that’s also a suffix.

Of note: a “proper” prefix is just a prefix that excludes the whole word (e.g. proper prefix of "ABC" excludes "ABC" itself as a prefix).

T would look lke the following for the string "YYYY":

T = [0, 1, 2, 3]

T would look like the following for the string "ZZYZZXZZYZZ":

T = [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]

Step 2: String Search w in s

The goal of this step is the following:

Find w in s, using T as a way to prevent re-examination of characters in w and s.

The way this is accomplished is by matching characters in w and s in the same way this is performed in the naive algorithm. When a mismatch is found, T is used to determine how much re-examination can be “skipped”. This means we know exactly how much of the beginning of w we’ve found and we can continue scanning s from where we left off.

KMP Code

Below is sample code for KMP. This is heavily-commented and tweaked version of some of the code found on the detailed GeeksforGeeks article on KMP.

 from typing import List


def compute_T(w: str) -> List[int]:
    # Handle edge-case where w is empty
    if not w:
        return []

    T = [0] * len(w)

    prefix_len = 0
    # Starts at 1 b/c T[0] = 0 always; it's impossible for a
    # proper prefix to be length > 0 if the string is only one
    # character
    i = 1
    while i < len(w):
        # If the current character in the prefix matches the
        # character at the end of the w[0..i] window, extend
        # the candidate prefix length and set T[i] = prefix_len
        if w[prefix_len] == w[i]:
            prefix_len += 1
            T[i] = prefix_len
            i += 1
        else:
            # "Back-up" in T to attempt to find a match and continue
            # advancing
            if prefix_len > 0:
                prefix_len = T[prefix_len - 1]
            # If prefix_len is 0, just set T[i] = 0 and advance i
            else:
                T[i] = 0
                i += 1

    return T


def kmp(w: str, s: str) -> int:
    T = compute_T(w)

    # i = position in s being inspected
    # j = position in w being inspected
    i, j = 0, 0
    while i < len(s):
        # Advance the positions in w and s; same as in the
        # naive algorithm
        if w[j] == s[i]:
            i += 1
            j += 1
        # The characters in w and s do not match and some of
        # w has been found
        elif j > 0:
            # "Back-up" in T to grab the length of w that WAS matched
            # so far
            j = T[j - 1]
        else:
            # None of w has been found, so just advance where in s
            # is being considered for character matching
            i += 1

        # Do this last so that we can break out if the string was found;
        # this avoids having to an extra check after the while-loop
        if j == len(w):
            # No +1 because i will be one ahead of where
            # the substring was found
            return i - j

    # No match was found in the while-loop, so return -1 as a
    # failure indicator
    return -1

Wait? How is T being used?

The most nuanced part of both the pre-processing algorithm and KMP itself is the part where the algorithm “backs-up” and grabs a value from T to attempt future matching. This is the main optimization KMP has over the naive approach, and it’s worth understanding.

Let’s take an example:

s = "ABCDABYABCDABD"
w = "ABCDABD"

s will match w until the "Y" != "D" mismatch. What KMP will then do is “back-up” and set j = T[j-1]. Why? Because w[0..j-1] represents "ABCDAB" at this point. KMP is grabbing the length value from T that represents what we HAVE matched of w. The length will be 2 at this point, representing the proper prefix and suffix "AB".

KMP will then try to match "Y" to "C", resulting in a failed match:

"ABCDABYABCDABD"
    "ABCDABD"
       ^

This failure causes KMP to set the value of j to T[j-1], which is going to be 0. This leads to one more comparison, "Y" to "A", resulting in a failed match:

"ABCDABYABCDABD"
      "ABCDABD"
       ^

KMP then advances the window being considered, since "Y" isn’t going to match with any characters in w. Comparison resumes in this way:

"ABCDABYABCDABD"
       "ABCDABD"
        ^

The above logic happens the same way for building T itself. It uses prior knowledge of T to build future entries in T. If we consider the current, working prefix as an expanding string that is attempted to be found in w, the above reasoning can be applied in the same way to the T-building algorithm. Thinking through this nuance is a good intellectual exercise, and I encourage you to draw up some examples!

Runtime Analysis

The overall KMP algorithm has a time-complexity of O(n), where n = len(s). This assumes s is larger than w. The biggest thing to understand for this is: how can the algorithm have this linear time-complexity if we’re “backing-up” through previous indices sometimes (e.g. the j = T[j - 1] step, when i is not advanced)?

The answer: in order to “expand” j to be a size, we need to make progress via matching (a.k.a. working through s). This means we can only “contract” j at most O(n) times across the entire algorithm. This leaves us with an overall runtime of O(2n), which is just O(n). The same logic can be used to understand why the building of T also has linear time complexity.

Conclusion

Through a clever pre-computed data-structure T that stores the length of the shared proper prefix and suffix of w[0..i] at index i, KMP enables linear string search time-complexity. Understanding this algorithm is powerful because it builds up re-usable concepts involving prefixes and suffixes of substrings.

Footnotes

Like this post? Add your email here to be subscribed to my future email list.