Skip to content

Top Phrases from a Large File

Find the most frequent phrases in a file too large to fit in memory, a systems-design-meets-algorithms problem common in log analysis and NLP corpus processing.

Problem

Given a 10 GB file with 50 pipe-delimited phrases per line, find the top 100,000 most frequent phrases. The file cannot be loaded into memory at once.

Approach

The strategy is external sort, or map-reduce in miniature:

  1. Chunk and count. Read the file in ~500 MB chunks. For each chunk, split lines on |, count phrase frequencies in a dictionary, and flush counts to a temporary file.
  2. Merge counts. Read all temp files, aggregating phrase counts into a global dictionary. If even the unique phrases don't fit in memory, use a k-way merge with a min-heap.
  3. Extract top-K. Use a heap to pull the top 100,000 entries from the merged counts.

This is the same pattern behind Hadoop's MapReduce, local aggregation followed by a global reduce.

Implementation

import heapq
from collections import defaultdict

def process_chunk(lines):
    count = defaultdict(int)
    for line in lines:
        for phrase in line.strip().split('|'):
            count[phrase.strip()] += 1
    return count

def write_temp_counts(count, temp_file):
    with open(temp_file, 'w') as f:
        for phrase, cnt in count.items():
            f.write(f"{phrase}|{cnt}\n")

def merge_counts(temp_files):
    global_count = defaultdict(int)
    for path in temp_files:
        with open(path, 'r') as f:
            for line in f:
                phrase, cnt = line.strip().rsplit('|', 1)
                global_count[phrase] += int(cnt)
    return global_count

def find_top_phrases(global_count, top_n):
    return heapq.nlargest(top_n, global_count.items(), key=lambda x: x[1])

# Usage sketch
chunk_size = 500 * 1024 * 1024  # 500 MB
temp_files = []

with open('large_file.txt', 'r') as f:
    while True:
        lines = f.readlines(chunk_size)
        if not lines:
            break
        count = process_chunk(lines)
        temp_file = f"temp_{len(temp_files)}.txt"
        write_temp_counts(count, temp_file)
        temp_files.append(temp_file)

global_count = merge_counts(temp_files)
top_phrases = find_top_phrases(global_count, 100_000)

Complexity

  • Time: O(N log K) where N is total phrases and K is 100,000 (heap extraction)
  • Space: O(U) where U is unique phrases in memory at merge time

Takeaway

This chunk-merge-topK pattern is everywhere in data engineering. Spark does it automatically, but understanding the mechanics helps when you need to debug partitioning and memory issues, which you will.


Back to Algorithms & Data Structures