Skip to content

Rule-Based String Parser

A lightweight parser that normalizes messy course-registration strings into structured records, the same kind of extract-and-canonicalize pipeline you'd build for cleaning raw text features before they hit a model.

Problem

Given strings like "CS 301 2023 Fall" or "DS:201 2023 f", extract four fields:

Field Rules
Department One or more alpha characters; map abbreviations to full names
Course One or more digits
Semester Full word or abbreviation (f -> Fall)
Year 2 or 4 digits

Department+Course and Semester+Year are always separated by a space, but internal formatting varies (colons, mixed ordering, abbreviations).

Approach

  1. Split the input into a department/course segment and a semester/year segment using whitespace heuristics.
  2. Validate the department by looking up abbreviations in a dictionary; pass through if already a full name.
  3. Extract the course number by pulling digits via regex.
  4. Extract the year, find the numeric token, validate length (2 or 4 digits).
  5. Resolve the semester by stripping the year from the semester/year segment and matching the remainder against an abbreviation map.

Each step is a small, testable function. This mirrors how you'd build a feature-extraction pipeline: composable transforms with clear contracts.

Implementation

import re
from pprint import pprint

DEPARTMENTS = {
    "CS": "Computer Science",
    "DS": "Data Science",
    "M": "Math",
    "ML": "Machine Learning",
}

SEMESTER = {"w": "winter", "f": "fall", "s": "spring", "su": "summer"}

DIGIT_PATTERN = re.compile(r"[0-9]+")

INPUT_VALUES = [
    "MATH 201 2021 Spring",
    "CS 301 2023 Fall",
    "ML 101 Fall 2023",
    "DS:201 2023 f",
]


def parse_input(input_string):
    parts = input_string.replace(":", " ").split()
    if len(parts) == 3 and parts[2].isdigit():
        return parts[0], parts[1] + " " + parts[2]
    if len(parts) >= 3:
        return " ".join(parts[:-2]), " ".join(parts[-2:])
    raise ValueError("Input does not match expected format.")


def validate_department(dept):
    alpha = re.match(r"[A-Za-z]+", dept).group().upper()
    return DEPARTMENTS.get(alpha, dept)


def validate_year(year):
    if not year.isdigit() or len(year) not in (2, 4):
        raise ValueError("Invalid year format")
    return year


def extract_course_number(course_str):
    match = re.search(DIGIT_PATTERN, course_str)
    if not match:
        raise ValueError("No course number found.")
    return match.group()


def resolve_semester(sem_year, year):
    semester_part = sem_year.replace(year, "").strip().lower()
    for abbr, full in SEMESTER.items():
        if semester_part in (abbr, full):
            return full.capitalize()
    raise ValueError("Invalid semester format")


def normalize(input_string):
    dept_course, sem_year = parse_input(input_string)
    department = validate_department(dept_course)
    course_number = extract_course_number(dept_course)
    year = validate_year(re.search(r"\d+", sem_year).group())
    semester = resolve_semester(sem_year, year)
    return {
        "Input String": input_string,
        "Department": department,
        "Course": course_number,
        "Semester": semester,
        "Year": year,
    }


if __name__ == "__main__":
    results = [normalize(v.strip()) for v in INPUT_VALUES]
    pprint(results)

Takeaway

Messy string inputs are everywhere, log lines, user-entered metadata, scraped text. Breaking the parser into small, composable functions makes it easy to test each step independently and swap in new rules as formats change.


Back to Algorithms & Data Structures