String Pattern Matching¶
Find all occurrences of a pattern in a text without regex, implementing the brute-force and Rabin-Karp approaches from scratch.
Problem¶
Given a text string and a pattern, return all starting indices where the pattern appears. No standard library pattern-matching allowed.
Input: text = "hey jude, dont dont be afraid", pattern = "dont"
Output: indices 10 and 15
Approach¶
Two strategies, each with different trade-offs:
-
Brute force: Slide a window of length
macross the text. At each position, compare the substring to the pattern. Simple but O(nm) worst case. -
Rabin-Karp: Compute a hash of the pattern, then compute a rolling hash for each window in the text. Only do a character-by-character comparison when hashes match. Expected O(n) with a good hash function.
Rabin-Karp is the basis for plagiarism detection tools and near-duplicate detection in large document corpora.
Implementation¶
def brute_force_search(text, pattern):
n, m = len(text), len(pattern)
results = []
for i in range(n - m + 1):
if text[i:i + m] == pattern:
results.append(i)
return results
def rabin_karp_search(text, pattern, base=256, mod=101):
n, m = len(text), len(pattern)
if m > n:
return []
h = pow(base, m - 1, mod) # base^(m-1) mod p
p_hash = 0
t_hash = 0
for i in range(m):
p_hash = (p_hash * base + ord(pattern[i])) % mod
t_hash = (t_hash * base + ord(text[i])) % mod
results = []
for i in range(n - m + 1):
if t_hash == p_hash and text[i:i + m] == pattern:
results.append(i)
if i + m < n:
t_hash = (t_hash - ord(text[i]) * h) * base + ord(text[i + m])
t_hash %= mod
return results
text = "hey jude, dont dont be afraid"
pattern = "dont"
assert brute_force_search(text, pattern) == [10, 15]
assert rabin_karp_search(text, pattern) == [10, 15]
Complexity¶
| Method | Time (avg) | Time (worst) | Space |
|---|---|---|---|
| Brute force | O(nm) | O(nm) | O(1) |
| Rabin-Karp | O(n) | O(nm) | O(1) |
where n = text length, m = pattern length.
Takeaway¶
Rabin-Karp's rolling-hash idea extends well beyond string search, it powers shingling in document deduplication and fingerprinting in content-addressable storage.