This post is part of the *“Algos in Plain English”* post series.

The **Knuth-Morris-Pratt (KMP) string search algorithm** aims to make searching for a word `w`

in a body of text `s`

efficient. This algorithm is expected to run in `O(n)`

, where `n`

is the size of `s`

.

The problem statement that KMP aims to solve is:

Given a word

`w`

and a body of text`s`

, find the first index in`s`

where`w`

is found. If`w`

is not found, return`-1`

.

The naive approach attempts to repeatedly find `w`

in the body of text `s`

and has a polynomial runtime. The naive approach can be written as follows:

```
def search(w: str, s: str) -> int:
m, n = len(w), len(s)
# Go through the indices of s to start checking
# for w at each start index
for i in range(n - m + 1):
found = True
# Start going through w starting at index i
for j in range(m):
# If there's a mismatch, break out early, as
# w was definitely not found at index i
if w[j] != s[i + j]:
found = False
break
if found:
return i
return -1
```

In the worst-case, the above runs in `O(m*(n-m+1))`

. A breakdown of this is:

`O(n-m+1)`

comes from the outer`for`

-loop`O(m)`

comes from the inner`for`

-loop where`w`

is attempted to be located as a substring in`s`

*In the worst case, the algorithm fully evaluates the inner loop over and over again to completion.*

Examples of worst-case inputs are as follows:

```
# Example 1: w is not found til the very end of s
w = "XXXY"
s = "XXXXXXXXXXXXXXXXXY"
# Example 2: w is repeatedly almost found, but not found in s
w = "XXXY"
s = "XXXXXXXXXXXXXXXXXX"
```

The naive approach evaluated above repeats work by performing a complete reset when attempting to find the word `w`

in `s`

. This *re-examination* makes the naive approach inefficient, as it forces the addition of a factor of `m`

to the time complexity.

KMP’s algorithm design is contingent on being able to decide the following when traversing the body of text `s`

, when `w`

is deemed to be not found:

`w`

is not found in`s`

based on finding a mis-matching character. How much of`w`

havewe found based on where we found the mis-matching character in`w`

and`s`

?

To make this more clear, let’s take the following example inputs:

```
w = "YYYZ"
s = "YYYYZ"
```

If the naive algorithm was used, `"YYYY"`

would be evaluated in `s`

to not match `w`

. To find `w`

in `s`

, the naive algorithm would inevitably re-examine the suffix `"YYY"`

of `"YYYY"`

. However, a *smarter* algorithm should be able to recognize that `"YYYY"`

’s suffix, `"YYY"`

is the prefix of `w`

. *This would save the algorithm the time-expenditure of re-examining characters in s.*

KMP aims to enable this time-savings.

KMP enables string search by performing 2 steps.

`T`

The goal of this step is the following:

Build a data-structure

`T`

where index`i`

describes the length of the longest proper prefix of`w[0..i]`

that’s also a suffix.

Of note: a “proper” prefix is just a prefix that excludes the whole word (e.g. proper prefix of `"ABC"`

excludes `"ABC"`

itself as a prefix).

`T`

would look lke the following for the string `"YYYY"`

:

```
T = [0, 1, 2, 3]
```

`T`

would look like the following for the string `"ZZYZZXZZYZZ"`

:

```
T = [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]
```

`w`

in `s`

The goal of this step is the following:

Find

`w`

in`s`

, using`T`

as a way to prevent re-examination of characters in`w`

and`s`

.

The way this is accomplished is by matching characters in `w`

and `s`

in the same way this is performed in the naive algorithm. When a mismatch is found, `T`

is used to determine how much re-examination can be “skipped”. *This means we know exactly how much of the beginning of w we’ve found and we can continue scanning s from where we left off.*

Below is sample code for KMP. This is heavily-commented and tweaked version of some of the code found on the detailed GeeksforGeeks article on KMP.

```
from typing import List
def compute_T(w: str) -> List[int]:
# Handle edge-case where w is empty
if not w:
return []
T = [0] * len(w)
prefix_len = 0
# Starts at 1 b/c T[0] = 0 always; it's impossible for a
# proper prefix to be length > 0 if the string is only one
# character
i = 1
while i < len(w):
# If the current character in the prefix matches the
# character at the end of the w[0..i] window, extend
# the candidate prefix length and set T[i] = prefix_len
if w[prefix_len] == w[i]:
prefix_len += 1
T[i] = prefix_len
i += 1
else:
# "Back-up" in T to attempt to find a match and continue
# advancing
if prefix_len > 0:
prefix_len = T[prefix_len - 1]
# If prefix_len is 0, just set T[i] = 0 and advance i
else:
T[i] = 0
i += 1
return T
def kmp(w: str, s: str) -> int:
T = compute_T(w)
# i = position in s being inspected
# j = position in w being inspected
i, j = 0, 0
while i < len(s):
# Advance the positions in w and s; same as in the
# naive algorithm
if w[j] == s[i]:
i += 1
j += 1
# The characters in w and s do not match and some of
# w has been found
elif j > 0:
# "Back-up" in T to grab the length of w that WAS matched
# so far
j = T[j - 1]
else:
# None of w has been found, so just advance where in s
# is being considered for character matching
i += 1
# Do this last so that we can break out if the string was found;
# this avoids having to an extra check after the while-loop
if j == len(w):
# No +1 because i will be one ahead of where
# the substring was found
return i - j
# No match was found in the while-loop, so return -1 as a
# failure indicator
return -1
```

`T`

being used?The most nuanced part of both the pre-processing algorithm and KMP itself is the part where the algorithm “backs-up” and grabs a value from `T`

to attempt future matching. This is the main optimization KMP has over the naive approach, and it’s worth understanding.

Let’s take an example:

```
s = "ABCDABYABCDABD"
w = "ABCDABD"
```

`s`

will match `w`

until the `"Y" != "D"`

mismatch. What KMP will then do is “back-up” and set `j = T[j-1]`

. Why? Because `w[0..j-1]`

represents `"ABCDAB"`

at this point. KMP is grabbing the length value from `T`

that represents what we HAVE matched of `w`

. The length will be `2`

at this point, representing the proper prefix and suffix `"AB"`

.

KMP will then try to match `"Y"`

to `"C"`

, resulting in a failed match:

```
"ABCDABYABCDABD"
"ABCDABD"
^
```

This failure causes KMP to set the value of `j`

to `T[j-1]`

, which is going to be `0`

. This leads to one more comparison, `"Y"`

to `"A"`

, resulting in a failed match:

```
"ABCDABYABCDABD"
"ABCDABD"
^
```

KMP then advances the window being considered, since `"Y"`

isn’t going to match with any characters in `w`

. Comparison resumes in this way:

```
"ABCDABYABCDABD"
"ABCDABD"
^
```

**The above logic happens the same way for building T itself. It uses prior knowledge of T to build future entries in T.** If we consider the current, working prefix as an expanding string that is attempted to be found in

`w`

, the above reasoning can be applied in the same way to the `T`

-building algorithm. Thinking through this nuance is a good intellectual exercise, and I encourage you to draw up some examples!The overall KMP algorithm has a time-complexity of `O(n)`

, where `n = len(s)`

. This assumes `s`

is larger than `w`

. The biggest thing to understand for this is: how can the algorithm have this linear time-complexity if we’re “backing-up” through previous indices sometimes (e.g. the `j = T[j - 1]`

step, when `i`

is not advanced)?

The answer: in order to “expand” `j`

to be a size, we need to make progress via matching (a.k.a. working through `s`

). This means we can only “contract” `j`

at most `O(n)`

times across the entire algorithm. This leaves us with an overall runtime of `O(2n)`

, which is just `O(n)`

. **The same logic can be used to understand why the building of T also has linear time complexity.**

Through a clever pre-computed data-structure `T`

that stores the length of the shared proper prefix and suffix of `w[0..i]`

at index `i`

, KMP enables linear string search time-complexity. Understanding this algorithm is powerful because it builds up re-usable concepts involving prefixes and suffixes of substrings.

- Both the KMP and naive string search algorithms written in this article were tested by being run through Leetcode’s “Implement
`strStr()`

” question. - GeeksforGeeks’s article on KMP and Wikipedia’s article on KMP were both instrumental in enabling me to understand the algorithm myself. These are great supplemental reads if you want more texture!