String Pattern Matching¶

Find all occurrences of a pattern in a text without regex, implementing the brute-force and Rabin-Karp approaches from scratch.

Problem¶

Given a text string and a pattern, return all starting indices where the pattern appears. No standard library pattern-matching allowed.

Input: text = "hey jude, dont dont be afraid", pattern = "dont" Output: indices 10 and 15

Approach¶

Two strategies, each with different trade-offs:

Brute force: Slide a window of length m across the text. At each position, compare the substring to the pattern. Simple but O(nm) worst case.
Rabin-Karp: Compute a hash of the pattern, then compute a rolling hash for each window in the text. Only do a character-by-character comparison when hashes match. Expected O(n) with a good hash function.

Rabin-Karp is the basis for plagiarism detection tools and near-duplicate detection in large document corpora.

Implementation¶

def brute_force_search(text, pattern):
    n, m = len(text), len(pattern)
    results = []
    for i in range(n - m + 1):
        if text[i:i + m] == pattern:
            results.append(i)
    return results


def rabin_karp_search(text, pattern, base=256, mod=101):
    n, m = len(text), len(pattern)
    if m > n:
        return []

    h = pow(base, m - 1, mod)  # base^(m-1) mod p
    p_hash = 0
    t_hash = 0
    for i in range(m):
        p_hash = (p_hash * base + ord(pattern[i])) % mod
        t_hash = (t_hash * base + ord(text[i])) % mod

    results = []
    for i in range(n - m + 1):
        if t_hash == p_hash and text[i:i + m] == pattern:
            results.append(i)
        if i + m < n:
            t_hash = (t_hash - ord(text[i]) * h) * base + ord(text[i + m])
            t_hash %= mod
    return results


text = "hey jude, dont dont be afraid"
pattern = "dont"

assert brute_force_search(text, pattern) == [10, 15]
assert rabin_karp_search(text, pattern) == [10, 15]

Complexity¶

Method	Time (avg)	Time (worst)	Space
Brute force	O(nm)	O(nm)	O(1)
Rabin-Karp	O(n)	O(nm)	O(1)

where n = text length, m = pattern length.

Takeaway¶

Rabin-Karp's rolling-hash idea extends well beyond string search, it powers shingling in document deduplication and fingerprinting in content-addressable storage.

Back to Algorithms & Data Structures