Single-Pass Map-Side Aggregation¶
Aggregate and summarize school data from CSV by state, metro locale, and city -- a lightweight ETL pipeline using only the standard library.
Problem¶
Given a CSV file containing school records with fields like state (LSTATE05), metro-centric locale (MLOCALE), and city (LCITY05), compute:
- Total number of schools
- Count of schools grouped by state
- Count of schools grouped by metro-centric locale
- Count of schools grouped by city
Sample data:
NCESSCH,LEAID,LEANM05,SCHNAM05,LCITY05,LSTATE05,LATCOD,LONCOD,MLOCALE,ULOCALE,status05
010000200277,0100002,ALABAMA YOUTH SERVICES,SEQUOYAH SCHOOL,PINSON,AL,33.674697,-86.627775,3,41,1
010000201402,0100002,ALABAMA YOUTH SERVICES,EUFAULA SCHOOL,CLAYTON,AL,31.882350,-85.256640,7,42,1
...
Approach¶
- Single-pass streaming: Read the CSV with
DictReaderand accumulate counts in three dictionaries -- one per grouping dimension. No need to load the entire dataset into memory first. - Generic counter helper: A small
_count_occurrencesfunction handles the increment-or-initialize pattern for any dictionary. In production you would usecollections.Counter, but this shows the mechanic explicitly. - Return structured results so the caller decides how to display or serialize them.
Implementation¶
import csv
def load_data(file_path):
schools_by_state = {}
schools_by_metro_locale = {}
schools_by_city = {}
with open(file_path, newline='') as csvfile:
reader = csv.DictReader(csvfile)
total_schools = 0
for row in reader:
total_schools += 1
_count_occurrences(schools_by_state, row['LSTATE05'])
_count_occurrences(schools_by_metro_locale, row['MLOCALE'])
_count_occurrences(schools_by_city, row['LCITY05'])
return total_schools, schools_by_state, schools_by_metro_locale, schools_by_city
def _count_occurrences(dictionary, key):
if key in dictionary:
dictionary[key] += 1
else:
dictionary[key] = 1
if __name__ == "__main__":
file_path = 'data.csv'
total, by_state, by_locale, by_city = load_data(file_path)
print("Total Schools:", total)
print("Schools by State:", by_state)
print("Schools by Metro-centric Locale:", by_locale)
print("Schools by City:", by_city)
Complexity¶
- Time: O(n) -- single pass over all rows
- Space: O(k) where k = number of unique keys across all groupings
Takeaway¶
When your dataset fits in memory, a single-pass dictionary accumulation is the simplest and fastest approach -- no sorting, no shuffling. It's the same idea behind MapReduce's map-side combine, just without the ceremony.