Rule-Based String Parser¶
A lightweight parser that normalizes messy course-registration strings into structured records, the same kind of extract-and-canonicalize pipeline you'd build for cleaning raw text features before they hit a model.
Problem¶
Given strings like "CS 301 2023 Fall" or "DS:201 2023 f", extract four fields:
| Field | Rules |
|---|---|
| Department | One or more alpha characters; map abbreviations to full names |
| Course | One or more digits |
| Semester | Full word or abbreviation (f -> Fall) |
| Year | 2 or 4 digits |
Department+Course and Semester+Year are always separated by a space, but internal formatting varies (colons, mixed ordering, abbreviations).
Approach¶
- Split the input into a department/course segment and a semester/year segment using whitespace heuristics.
- Validate the department by looking up abbreviations in a dictionary; pass through if already a full name.
- Extract the course number by pulling digits via regex.
- Extract the year, find the numeric token, validate length (2 or 4 digits).
- Resolve the semester by stripping the year from the semester/year segment and matching the remainder against an abbreviation map.
Each step is a small, testable function. This mirrors how you'd build a feature-extraction pipeline: composable transforms with clear contracts.
Implementation¶
import re
from pprint import pprint
DEPARTMENTS = {
"CS": "Computer Science",
"DS": "Data Science",
"M": "Math",
"ML": "Machine Learning",
}
SEMESTER = {"w": "winter", "f": "fall", "s": "spring", "su": "summer"}
DIGIT_PATTERN = re.compile(r"[0-9]+")
INPUT_VALUES = [
"MATH 201 2021 Spring",
"CS 301 2023 Fall",
"ML 101 Fall 2023",
"DS:201 2023 f",
]
def parse_input(input_string):
parts = input_string.replace(":", " ").split()
if len(parts) == 3 and parts[2].isdigit():
return parts[0], parts[1] + " " + parts[2]
if len(parts) >= 3:
return " ".join(parts[:-2]), " ".join(parts[-2:])
raise ValueError("Input does not match expected format.")
def validate_department(dept):
alpha = re.match(r"[A-Za-z]+", dept).group().upper()
return DEPARTMENTS.get(alpha, dept)
def validate_year(year):
if not year.isdigit() or len(year) not in (2, 4):
raise ValueError("Invalid year format")
return year
def extract_course_number(course_str):
match = re.search(DIGIT_PATTERN, course_str)
if not match:
raise ValueError("No course number found.")
return match.group()
def resolve_semester(sem_year, year):
semester_part = sem_year.replace(year, "").strip().lower()
for abbr, full in SEMESTER.items():
if semester_part in (abbr, full):
return full.capitalize()
raise ValueError("Invalid semester format")
def normalize(input_string):
dept_course, sem_year = parse_input(input_string)
department = validate_department(dept_course)
course_number = extract_course_number(dept_course)
year = validate_year(re.search(r"\d+", sem_year).group())
semester = resolve_semester(sem_year, year)
return {
"Input String": input_string,
"Department": department,
"Course": course_number,
"Semester": semester,
"Year": year,
}
if __name__ == "__main__":
results = [normalize(v.strip()) for v in INPUT_VALUES]
pprint(results)
Takeaway¶
Messy string inputs are everywhere, log lines, user-entered metadata, scraped text. Breaking the parser into small, composable functions makes it easy to test each step independently and swap in new rules as formats change.