Top Phrases from a Large File¶
Find the most frequent phrases in a file too large to fit in memory, a systems-design-meets-algorithms problem common in log analysis and NLP corpus processing.
Problem¶
Given a 10 GB file with 50 pipe-delimited phrases per line, find the top 100,000 most frequent phrases. The file cannot be loaded into memory at once.
Approach¶
The strategy is external sort, or map-reduce in miniature:
- Chunk and count. Read the file in ~500 MB chunks. For each chunk, split lines on
|, count phrase frequencies in a dictionary, and flush counts to a temporary file. - Merge counts. Read all temp files, aggregating phrase counts into a global dictionary. If even the unique phrases don't fit in memory, use a k-way merge with a min-heap.
- Extract top-K. Use a heap to pull the top 100,000 entries from the merged counts.
This is the same pattern behind Hadoop's MapReduce, local aggregation followed by a global reduce.
Implementation¶
import heapq
from collections import defaultdict
def process_chunk(lines):
count = defaultdict(int)
for line in lines:
for phrase in line.strip().split('|'):
count[phrase.strip()] += 1
return count
def write_temp_counts(count, temp_file):
with open(temp_file, 'w') as f:
for phrase, cnt in count.items():
f.write(f"{phrase}|{cnt}\n")
def merge_counts(temp_files):
global_count = defaultdict(int)
for path in temp_files:
with open(path, 'r') as f:
for line in f:
phrase, cnt = line.strip().rsplit('|', 1)
global_count[phrase] += int(cnt)
return global_count
def find_top_phrases(global_count, top_n):
return heapq.nlargest(top_n, global_count.items(), key=lambda x: x[1])
# Usage sketch
chunk_size = 500 * 1024 * 1024 # 500 MB
temp_files = []
with open('large_file.txt', 'r') as f:
while True:
lines = f.readlines(chunk_size)
if not lines:
break
count = process_chunk(lines)
temp_file = f"temp_{len(temp_files)}.txt"
write_temp_counts(count, temp_file)
temp_files.append(temp_file)
global_count = merge_counts(temp_files)
top_phrases = find_top_phrases(global_count, 100_000)
Complexity¶
- Time: O(N log K) where N is total phrases and K is 100,000 (heap extraction)
- Space: O(U) where U is unique phrases in memory at merge time
Takeaway¶
This chunk-merge-topK pattern is everywhere in data engineering. Spark does it automatically, but understanding the mechanics helps when you need to debug partitioning and memory issues, which you will.