Skip to content

Create a program requirements scraper #13

@AJaccP

Description

@AJaccP

Scrape Carleton program pages into a structured requirements manifest, so we can later build program templates and decide which non-COMP courses to include.


🧠 Context

This is the program-side counterpart to the course scraper (ticket 08). Its job is to extract, per program, the structured degree requirements: which courses are required, which "choose N credits from this set" groups exist, and how many elective credits of each category are needed. It produces a requirements manifest. Arranging requirements into terms for recommended plans is manual content work and is out of scope here.

This builds on the scraper infrastructure from ticket 09 (the scripts/ folder, cheerio, Node's built-in fetch, and the fixture-based test pattern). Reuse the existing setup rather than adding new dependencies (unless necessary).

The page to scrape: the Computer Science programs page — https://calendar.carleton.ca/undergrad/undergradprograms/computerscience/ — which lists the CS programs and streams (e.g. "Computer Science B.C.S. Honours", "Computer Science Software Engineering Stream B.C.S. Honours", "Computer Science B.C.S. Major", and several other streams). See the calendar page for the full list.

For the JSON keys, the simplest consistent approach is to lowercase and kebab-case each program name (e.g. "Computer Science Software Engineering Stream B.C.S. Honours" → computer-science-software-engineering-stream-bcs-honours). Any consistent scheme is fine — the exact key format doesn't matter as long as it's stable and unique per program.

Write it to generalize: other program pages on the calendar share this same structure. We're only scraping CS for now, but structure the code so it can be pointed at another program listing page with minimal change, rather than hard-coding CS-specific assumptions.

Output shape — write scripts/output/programs-requirements.json:

{
  "bcs-general": {
    "url": "https://calendar.carleton.ca/...",
    "requiredCourses": ["COMP 1405", "COMP 1406", "..."],
    "chooseGroups": [
      { "credits": 2.0, "courses": ["COMP 3803", "COMP 4001", "COMP 4801", "COMP 4804"] }
    ],
    "electives": [
      { "category": "Breadth Elective", "credits": 5.0 },
      { "category": "Free Elective", "credits": 4.0 }
    ]
  }
}

This shape is a draft — a starting point, not a strict contract. If slightly different fields or structure make more sense once you see the actual page markup, that's fine; just keep it consistent across programs.


🛠️ Implementation Plan

  1. Create scripts/scrape-programs.ts, run via pnpm run scrape:programs (add the script to package.json).
  2. Use cheerio for parsing and Node's built-in fetch or cheerio's equivalent for HTTP (course scraper should be a good example) — do not add dependencies like axios. If a dependency install is blocked by pnpm-workspace.yaml policy, flag it to Jacc. The existing setup from the course scraper should be reusable here.
  3. Inspect the Carleton program pages in your browser to understand how required courses, "choose from" groups, and elective credit lines are marked up. Save a real program page (e.g. BCS General) as a fixture under scripts/fixtures/.
  4. Parse, per program: requiredCourses (flat list of codes), chooseGroups ({ credits, courses[] }), and electives ({ category, credits }).
  5. Write the manifest to scripts/output/programs-requirements.json.
  6. Write tests in scripts/scrape-programs.test.ts that run against the saved fixture (no live network). Assert a known program is extracted with the right required courses and at least one choose-group (e.g. game dev stream has a choose group).
  7. Run pnpm typecheck and pnpm lint.

✅ Acceptance Criteria

  • pnpm run scrape:programs runs and writes scripts/output/programs-requirements.json
  • Output is keyed by program id, each with url, requiredCourses, chooseGroups, and electives in the shape above, or updated shape has been documented
  • requiredCourses are course code strings; chooseGroups carry credits + a courses list; electives carry category + credits
  • Tests run against a saved HTML fixture (no live network calls) and assert a known program is parsed correctly
  • pnpm typecheck passes

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status
In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions