Site Search:

K-M-P

Back>

K-M-P is a clever improvement for brute force. When we found mismatch in brute force, we increase i by 1, and set j = 0. This reset throws away many information. We have already known the characters in the haystack till the position of mismatch at i + j, why throwing these away and backup to i + 1 instead? We should turn these wastes to profits.

In theory, before we read in the haystack character at i + j + 1, which we haven't known yet, we should already be able to compute the searching of needle in haystack from i + 1 to i + j, because we already seen them in the last iterate. So we could just increase i to i + j + 1 instead of increase i to i + 1. The j index will be tricky, because it depends on why haystack from i to i + j didn't match needle, it also depends on what's the new character is at i + j + 1. However, the possible combinations are finite, so in theory, given any new haystack character at i + j + 1, we should be able to give the corresponding index of j in needle.

Knuth, Morris and Pratt invested time and effort to go through all the possible combinations and created a [R]x[M] matrix, so that giving any character at i + j + 1 at haystack, the index of j in needle can be returned. This matrix is like a machine, it read in one character at a time from haystack, find the j index of needle from the matrix, continue until j = M (match) or i = N (no match).

This algorithm improves on the brutal force by saving the waste information for profit. The best scenario is still when the needle is at be beginning of the haystack, we found it in M haystack and matrix read. The worst scenario is still when the needle is at the end of the haystack or not in the haystack at all, we found it with N read . However, build the machine costs N + M string character access, and RM space to store the matrix. At average, the cost of finding the needle in the haystack is still 1.1N. It also got an important feature -- the i pointer never decrease -- which is important if we handle a stream data, where we process one character at a time, don't store any portion of the haystack string local, which the brute force must do.

KMP.java