The Knuth-Morris-Pratt (KMP) string matching algorithm can perform the search in Ɵ(m + n) operations, which is a significant improvement in. Knuth, Morris and Pratt discovered first linear time string-matching algorithm by analysis of the naive algorithm. It keeps the information that. KMP Pattern Matching algorithm. 1. Knuth-Morris-Pratt Algorithm Prepared by: Kamal Nayan; 2. The problem of String Matching Given a string.
|Published (Last):||6 June 2011|
|PDF File Size:||8.51 Mb|
|ePub File Size:||20.56 Mb|
|Price:||Free* [*Free Regsitration Required]|
In computer sciencethe Knuth—Morris—Pratt string-searching algorithm or KMP algorithm searches for occurrences of a “word” W within a main “text string” S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.
This was the first linear-time algorithm for string matching.
The three published it jointly in A string-matching algorithm wants to find the starting index m in string S that matches the search word W. The most straightforward algorithm is to look for a character match at successive values of the index mthe position in the string being searched, i.
If the index m reaches the end of the string then there is no match, in which case the search is said to “fail”. At each position m the algorithm first checks for equality of the first character in the word being searched, i. If a match is found, the algorithm tests the other characters in the word being searched by checking successive values of the word position index, i.
If all successive characters match in W at position mthen a match is found at that position in the search string. Usually, the trial check will quickly reject the trial match. If the strings are uniformly distributed random letters, then the chance that characters match is 1 in In most cases, the trial check will reject the match at the initial letter.
The chance that the first two letters will match is 1 in 26 2 1 in So if the characters are random, then the expected complexity of searching string S of length k is on the order of k comparisons or O k.
The expected performance is very good. If S is 1 billion characters and W is characters, then the string search should complete after about one billion character comparisons. That expected performance is not guaranteed. If the strings are not random, then checking a trial m may take many character comparisons. The worst case is if the two strings match in all but the last letter. Imagine that the string S consists of 1 billion characters that are all Aand that the word W is A characters terminating in a final B character.
The simple string-matching algorithm will now examine characters at each trial position before rejecting the match and advancing the trial position. The simple string search example would now take about character comparisons times 1 billion positions for 1 trillion character comparisons. The KMP algorithm has a better worst-case performance than the straightforward algorithm.
KMP spends a little time precomputing a table on the order of the size of WO nand then it uses that table to do an efficient search of the string in O k. The difference is that KMP makes use of previous match information that the straightforward algorithm does not.
KMP matched A characters before discovering a mismatch at the th character position Advancing the trial match position m by one throws away the first Aso KMP knows there are A characters that match W and does not retest them; that is, KMP sets i to KMP maintains its knowledge in the precomputed table and two state variables. When KMP discovers a mismatch, the table determines how much KMP will increase variable m and where it will resume testing variable i.
Knuth-Morris-Pratt string matching
At any given time, the algorithm is in a state determined by two integers:. This is depicted, at the start of the run, like. The algorithm compares successive characters of W to “parallel” characters of Smoving from one to the next by incrementing i if they match. Rather than beginning to search again at Swe note that no ‘A’ occurs between positions 1 and 2 in S ; hence, having checked all those characters previously and knowing they matched the corresponding characters in Wthere is no chance of finding the beginning of a match.
However, just prior to the end of the current partial match, there was that substring “AB” that could be the beginning of a new match, so the algorithm must take this into consideration. Thus the algorithm not only omits previously matched characters of S the “AB”but also previously matched characters of W the prefix “AB”. As in the first trial, the mismatch causes the algorithm algorithj return to the beginning of W and begins searching at the mismatched character position of S: The above example contains all the elements of the algorithm.
For the moment, we assume the existence of a “partial match” table Tdescribed belowwhich indicates where we need to look for the start of a new matchint in the event that the current one ends in a mismatch.
This has two implications: The following is a sample pseudocode implementation of the KMP search algorithm.
Assuming the prior existence of the table Tthe mtaching portion of the Knuth—Morris—Pratt algorithm has complexity O nwhere n is the length of S and the O is big-O notation.
Except for the fixed overhead incurred in entering and exiting the function, all the computations are performed in the while loop. This fact implies that the loop can execute at most 2 n times, since at lattern iteration it executes one of the two branches in the loop. The second branch adds i – T[i] to machingand as we have seen, this is always a positive number. Thus the location m of the beginning of the current potential match is increased. Thus the loop executes at most 2 n times, showing that the time complexity of the search algorithm is O n.
Here is another way to think about the runtime: Matchhing us say we begin to match W and S at position i and p.
If W exists as a substring of S at p, then W[ The maximum number of roll-back of i is bounded by ithat is to say, for any failure, we can only roll back as much as we have progressed up to the failure. Then it is clear the runtime is 2 n. The goal of the table is to allow the algorithm not to match any character of S more than once. The key observation about the nature of a linear search that allows this to happen is that in having checked some segment of the main string against an initial segment of the pattern, we know exactly at which places a new potential match which could continue to the current position could begin prior to the current position.
In other words, we “pre-search” the pattern itself and compile a list of all possible fallback positions that bypass a maximum of hopeless characters while not sacrificing any potential matches in doing so. We want to be able to look up, for each position in Wthe length of the longest possible initial segment of W leading up to but not including that position, other than the full segment starting at W that just failed to match; this is how far we have to backtrack in finding the next match.
Hence T[i] is exactly the length of the longest possible proper initial segment of W which is also a segment of the substring ending at W[i – 1].
We use the convention that the empty string has length 0. We will see that it follows much the same pattern as the main search, and is efficient for similar reasons.
To find Twe must discover a proper suffix of “A” which is also a prefix of pattern W. However “B” is not a prefix of the pattern W.
Continuing to Twe first check the proper suffix of length 1, and as in the previous case it fails. Should we also check longer suffixes? No, we now note that there is a shortcut to checking all suffixes: We pass to the subsequent W’A’. The same logic shows that the longest substring we need consider has length 1, and as in the previous case it fails since “D” is not a prefix of W.
Considering now the next character, Wwhich is ‘B’: The example above illustrates the general technique for assembling the table with a minimum of fuss.
The principle is that of algorithn overall search: The only minor complication is that the logic which is correct late in the string erroneously gives non-proper substrings at the beginning. This necessitates some initialization code. The complexity of the table algorithm is O k algoritum, where k is the length of W. As except for some initialization all the work is done in the while loop, it is sufficient to show that this loop executes in O k time, which will be patyern by simultaneously examining the quantities pos and pagtern – cnd.
In the first branch, pos – cnd is preserved, as both pos and cnd are incremented simultaneously, but naturally, pos is increased. In the second branch, cnd is replaced by T[cnd]which we saw above is always strictly less than cndthus increasing pos – cnd. Therefore, the complexity of the table algorithm is O k.
These complexities are the same, no matter how many repetitive patterns are in W or S. A real-time version of KMP can be implemented using a separate failure function table for each character in the alphabet. This satisfies the real-time computing restriction. The Booth algorithm uses a modified version of the KMP preprocessing function to find the lexicographically minimal matchiing rotation. The failure function is progressively calculated as the string is rotated.
I learned in that Yuri Matiyasevich had anticipated the linear-time pattern matching and pattern preprocessing algorithms of this paper, in the special case of a binary alphabet, already in Algoritum presented them as constructions for a Turing machine with a two-dimensional working memory. From Wikipedia, the free encyclopedia. This article needs additional citations for verification.
Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. October Learn how and when to remove this template message. Journal of Soviet Mathematics.
Comparison of regular expression engines Regular tree grammar Thompson’s construction Nondeterministic finite automaton. Hirschberg’s algorithm Needleman—Wunsch algorithm Smith—Waterman algorithm. Parsing Pattern matching Compressed pattern matching Longest common subsequence Longest common substring Sequential pattern mining Sorting. Retrieved from ” https: String matching algorithms Donald Knuth.