Single-Pass Map-Side Aggregation

Aggregate and summarize school data from CSV by state, metro locale, and city -- a lightweight ETL pipeline using only the standard library.

Problem

Given a CSV file containing school records with fields like state (LSTATE05), metro-centric locale (MLOCALE), and city (LCITY05), compute:

  1. Total number of schools
  2. Count of schools grouped by state
  3. Count of schools grouped by metro-centric locale
  4. Count of schools grouped by city

Sample data:

NCESSCH,LEAID,LEANM05,SCHNAM05,LCITY05,LSTATE05,LATCOD,LONCOD,MLOCALE,ULOCALE,status05
010000200277,0100002,ALABAMA YOUTH SERVICES,SEQUOYAH SCHOOL,PINSON,AL,33.674697,-86.627775,3,41,1
010000201402,0100002,ALABAMA YOUTH SERVICES,EUFAULA SCHOOL,CLAYTON,AL,31.882350,-85.256640,7,42,1
...

Approach

  1. Single-pass streaming: Read the CSV with DictReader and accumulate counts in three dictionaries -- one per grouping dimension. No need to load the entire dataset into memory first.
  2. Generic counter helper: A small _count_occurrences function handles the increment-or-initialize pattern for any dictionary. In production you would use collections.Counter, but this shows the mechanic explicitly.
  3. Return structured results so the caller decides how to display or serialize them.

Implementation

import csv


def load_data(file_path):
    schools_by_state = {}
    schools_by_metro_locale = {}
    schools_by_city = {}

    with open(file_path, newline='') as csvfile:
        reader = csv.DictReader(csvfile)
        total_schools = 0

        for row in reader:
            total_schools += 1
            _count_occurrences(schools_by_state, row['LSTATE05'])
            _count_occurrences(schools_by_metro_locale, row['MLOCALE'])
            _count_occurrences(schools_by_city, row['LCITY05'])

    return total_schools, schools_by_state, schools_by_metro_locale, schools_by_city


def _count_occurrences(dictionary, key):
    if key in dictionary:
        dictionary[key] += 1
    else:
        dictionary[key] = 1


if __name__ == "__main__":
    file_path = 'data.csv'
    total, by_state, by_locale, by_city = load_data(file_path)

    print("Total Schools:", total)
    print("Schools by State:", by_state)
    print("Schools by Metro-centric Locale:", by_locale)
    print("Schools by City:", by_city)

Complexity

  • Time: O(n) -- single pass over all rows
  • Space: O(k) where k = number of unique keys across all groupings

Takeaway

When your dataset fits in memory, a single-pass dictionary accumulation is the simplest and fastest approach -- no sorting, no shuffling. It's the same idea behind MapReduce's map-side combine, just without the ceremony.


Back to Software Design