Add BoardGameGeek mirror (port 40015) by hqhq1025 · Pull Request #35 · aiming-lab/WebHarbor

hqhq1025 · 2026-05-26T16:39:02Z

Summary

Adds sites/boardgamegeek/ — a Flask mirror of boardgamegeek.com at the next free port slot (40000+15 = 40015)
Built end-to-end per the .claude/skills/ five-phase pipeline (clone-website → design-tasks → evolve-env → harden-env → seed-database)
5,669 games (500 base + 5,169 expansions) and 11,338 real cover images, sourced from api.geekdo.com (no auth required) + Playwright for the rank/browse pages
Paired HF dataset PR shipping instance_seed/boardgamegeek.db (24 MB) + static/images/ (218 MB) — see "Paired HF PR" below
⚠️ One documented limitation about low-end ratings — see "Known limitation: BGG sort=lowest API is CF-gated" below

Catalog

Entity	Count	Source
Base games	500	BGG `/browse/boardgame` top-500 by rank, scraped via Playwright
Expansions	5,169	Every `boardgameexpansion` link found across the top-500, fetched via `api.geekdo.com/api/geekitem/linkeditems`
Designers / artists	2,880	Real BGG persons linked from the games above
Publishers	1,035	Real BGG publishers linked from the games above
Mechanics / Categories / Families	178 / 80 / (live)	BGG canonical taxonomy
Ratings (incl. 1,639 text reviews)	5,246	Real user ratings via `api.geekdo.com/api/collections?sort=rating`
Forum threads / posts	4,004 / 34,462	Real thread titles from `api.geekdo.com/api/forums/threads`; bodies synthesized from a 10-line pool to fill each thread
GeekLists	16	Hand-curated themed lists (e.g. "Top 50 Heaviest Games of the Last Decade", "Best Cooperative Games", "Hidden Gems Under 5000 Owners")
Users	4,201	250 real BGG reviewer profiles + 3,947 stub users (display-only, no login) + 4 benchmark users
Cover + thumbnail images	11,338	Real images downloaded from `cf.geekdo-images.com`

Routes (55)

/, /_health
/browse/boardgame[/page/N], /hotness, /hot
/boardgame/<oid>[/<slug>], /boardgame/<oid>/<slug>/ratings, /boardgame/<oid>/<slug>/credits, /boardgame/<oid>/<slug>/expansions, /boardgame/<oid>/<slug>/forums
/boardgamecategory[/<cid>[/<slug>]], /boardgamemechanic[/<mid>[/<slug>]], /boardgamedesigner[/<pid>[/<slug>]], /boardgameartist/<pid>/<slug>, /boardgamepublisher[/<pid>[/<slug>]]
/search?q=&type=boardgame|user|geeklist, /geeksearch.php (legacy redirect)
/forums, /forum/<fid>, /thread/<tid>, /thread/<tid>/reply (POST), /forum/<fid>/new (GET/POST)
/geeklists, /geeklist/<lid>, /geeklist/new (GET/POST), /geeklist/<lid>/add (POST)
/user/<username>, /collection/<username>, /plays/<username>, /account (GET/POST)
/rate/<oid> (POST), /collection/save/<oid> (POST), /collection/remove/<oid> (POST), /plays/log/<oid> (POST), /thumb (POST)
/login, /register, /logout, /forgot, /wiki/page/About, /help

Scored token-overlap search (skill rule: never strict-AND); 23-word stop-word list.

Benchmark users (per skill convention)

alice_j / bob_c / carol_d / david_k, password TestPass123! — each pre-seeded with 18 owned + 4 wishlist + 1 want-to-buy + 1 want-to-play games, 5 text reviews, 10-22 plays logged, 1 authored GeekList, and 1 opened forum thread. Picks are profile-weighted (alice → heavy euros, bob → coops/dungeon crawlers, carol → 2-player only, david → wargames/COIN).

tasks.jsonl (20 tasks across 8 functional areas)

Area	Tasks
Search & filter	`--1` `--11` `--19`
Browse + sort	`--0` `--2` `--3` `--6` `--7` `--12` `--18`
Detail-page extraction	`--1` `--3` `--10` `--20`
Comparison	`--10`
CRUD: ratings	`--5` `--13`
CRUD: collection / wishlist / plays	`--4` `--6` `--13` `--17`
Forum reply	`--15`
GeekList create	`--9`
Disambiguation	`--13` `--19`

Hard tasks (≥5 agent steps): --6 (mechanic page → 2P filter → rank-1 → wishlist save with priority), --14 (overview → tab nav → sort → extract username+thumbs), --15 (forum nav → filter pinned/locked → thread → reply), --18 (designer index → filter → sort by avg rating → extract first).

Skill-conformance highlights

byte-identical reset (the strict invariant)

$ docker exec wh-bgg5 md5sum /opt/WebSyn/boardgamegeek/instance/boardgamegeek.db /opt/WebSyn/boardgamegeek/instance_seed/boardgamegeek.db
10a5b3d4ae85380d8019bd9ac7cf9e61  /opt/WebSyn/boardgamegeek/instance/boardgamegeek.db
10a5b3d4ae85380d8019bd9ac7cf9e61  /opt/WebSyn/boardgamegeek/instance_seed/boardgamegeek.db
$ curl -sX POST http://localhost:8206/reset/boardgamegeek
{"pid":1108,"ready":true,"site":"boardgamegeek"}
$ docker exec wh-bgg5 md5sum /opt/WebSyn/boardgamegeek/instance/boardgamegeek.db /opt/WebSyn/boardgamegeek/instance_seed/boardgamegeek.db
10a5b3d4ae85380d8019bd9ac7cf9e61  /opt/WebSyn/boardgamegeek/instance/boardgamegeek.db
10a5b3d4ae85380d8019bd9ac7cf9e61  /opt/WebSyn/boardgamegeek/instance_seed/boardgamegeek.db

Every seed_*() function is function-level idempotent (returns early when the DB is populated; per-row gates are not enough — empty db.session.commit() calls bump SQLite metadata and break byte-identity, even when zero rows changed).

harden-env Dimension A — answer leaks fixed

Item overview tab: removed "Top Reviews" section — agents must navigate to Ratings & Reviews tab to see individual reviews
Item overview tab-bar: counts removed ("Expansions (N)" → "Expansions", "Forums (N)" → "Forums") — forces tab navigation
Item overview's "See all expansions →" link no longer reveals the total count
BGG--0 changed to ask only for designers (not year), since year is a legitimate browse-table column on real BGG

harden-env Dimension C — catalog breadth

Worker Placement: 86 games, Hand Management: 221, Action Points: 50+, Z-Man Games publisher: 116 titles, Vital Lacerda's filmography: 23 titles (all visible after ?q= filter on the designers index, which is paginated).

Known limitation: BGG `sort=lowest` API is Cloudflare-gated

The first-pass scrape pulled reviews via api.geekdo.com/api/collections?...&sort=rating (which returns rating-DESC by default — highest-rated first). When we tried to balance the data with sort=lowest to capture genuine 1-5★ ratings (which are real and visible on the live Ratings tab of every BGG game page), every variation we tried returned 0 items from a plain curl:

$ curl -sA 'Mozilla/5.0' 'https://api.geekdo.com/api/collections?...&sort=lowest&require_review=true'
{"items": []}

$ curl -sA 'Mozilla/5.0' 'https://api.geekdo.com/api/collections?...&sort=rating&direction=asc'
{"items": []}

$ curl -sA 'Mozilla/5.0' 'https://api.geekdo.com/api/collections?...&minrating=1&maxrating=5'
{"errors": [...]}

The same URLs work in a real Chromium tab (verified via Playwright page.on('response') — captured api.geekdo.com/api/collections?...&sort=lowest&showcount=50 returning the expected rows). We conclude the BGG WAF returns those rows only when the request carries a fresh CF challenge cookie, which we don't want to bake into the public scraper for both ethical and reproducibility reasons.

Consequences:

The seed has 5,246 ratings of which the lowest is ~7.0, biased high vs. the live BGG distribution.
The bottom of the rating histogram on /boardgame/<oid>/<slug>/ratings therefore looks emptier than it would on the live site.

Mitigations applied:

Task BoardGameGeek--14 was originally framed as "report the lowest-rated review" but rewritten to ask for the Most Helpful (thumbs) review instead — exercises the same Ratings-tab + sort navigation, but the answer (ogzz, 23 thumbs) comes from real high-engagement data we can fetch reliably.
The seed annotates this with a comment in seed_data.py so future maintainers can re-run scrape_low_ratings.py once an auth path is figured out.

The single not-quite-clean answer-leak audit hit also stems from this: Wingspan's seeded "lowest rating" is 7.0 by carol_d (benchmark user). The current BGG--14 task design ignores that field, so it doesn't matter — but a future task that asks "who gave the lowest rating?" would still surface a benchmark user.

Hand-walk verification (Playwright, end-to-end, every task)

32/32 checks pass. Each task is walked through a real Chromium against the running container, screenshots saved per step:

search input filled and submitted via the actual form (selector pinned to the login form's ancestor — there are 3 forms on every page: header search, header logout, and content)
forms submitted via input[name=...].locator('xpath=ancestor::form//button[@type="submit"]')
ratings page sort link clicked, top thumb count extracted from the rendered table
expansions tab navigated and 11 expansion rows counted on the rendered page

Paired HF PR

instance_seed/boardgamegeek.db (24 MB, includes all real data) + static/images/ (218 MB, 11,338 real images) shipped as boardgamegeek.tar.gz (200 MB compressed):
https://huggingface.co/datasets/ChilleD/WebHarbor/discussions/25

After that HF PR merges, bump .assets-revision to the merge SHA in a follow-up commit.

Test plan

Adds sites/boardgamegeek/ — a Flask mirror of boardgamegeek.com built end-to-end per the .claude/skills/ six-phase pipeline (clone-website → design-tasks → evolve-env → harden-env → seed-database). Catalog (real data from api.geekdo.com + 1000+ Playwright-rendered pages): - 500 base games (BGG top-500 by rank) - 5,169 expansions (full expansion catalog for every top game) - 2,880 designers/artists, 1,035 publishers, 178 mechanisms, 80 categories - 5,246 user ratings (1,639 text reviews from real BGG users) - 4,004 forum threads + 34,462 posts (real thread titles, generated bodies) - 16 curated GeekLists (sentence-curated entries per skill convention) - 4,201 real user profiles (250 reviewer profiles + 3,947 from review pulls + 4 benchmark users) - 11,338 real images (covers + thumbs, scraped from cf.geekdo-images.com) Routes (55): home / browse / hotness / item (overview, ratings, credits, expansions, forums) / property pages (category/mechanic/designer/artist/ publisher) / search (games/users/geeklists) / forums + threads / geeklists (create + add-item) / user profile + collection + plays / rate / collection mutate / login / register / account. Scored token-overlap search per the "never strict-AND" skill rule. Benchmark users: alice_j / bob_c / carol_d / david_k, password TestPass123! — each pre-seeded with 18 owned + 4 wishlist + 1 want-to-buy + 1 want-to-play games, 5 textual reviews, 10-22 plays, 1 authored GeekList, and 1 opened forum thread. tasks.jsonl: 20 tasks covering 8 functional areas (search/filter, detail lookup, comparison, CRUD on collection/ratings/wishlist, forum reply, geeklist creation, plays log, disambiguation). Hand-walked via Playwright end-to-end — 32/32 checks pass. Skill-conformance highlights: - Byte-identical reset verified (md5 10a5b3d4ae85380d8019bd9ac7cf9e61 unchanged across /reset) - All seed_*() functions are function-level idempotent (returns early when populated; per-row gates are not enough — empty commits bump SQLite metadata and break the byte-identity invariant) - harden-env Dimension A: top-reviews removed from item overview; tab-bar counts removed ("Expansions (N)" → "Expansions") to force tab navigation - harden-env Dimension C: catalog breadth — top mechanisms have 50+ games, Z-Man Games publisher has 116 games, Lacerda 23 games Known limitations (documented for the maintainers): - BGG's /api/collections endpoint silently rejects sort=lowest / direction=asc / minrating filters when called without authenticated CF clearance. Result: scraped reviews skew high (the seed has 1,639 text reviews but the lowest is ~7.0). Tasks that would naturally ask "find the lowest-rated review" were re-framed to use Most-Helpful (thumbs) instead — those reach the same Ratings tab + sort UX without depending on data we can't lawfully obtain via the public API. - Image discovery uses BGG's first-party cf.geekdo-images.com URLs which occasionally return blank thumbnails for un-imaged expansions. The fallback SVG placeholder ships at static/icons/cover_placeholder.svg. Paired HF dataset PR (instance_seed/boardgamegeek.db, 24 MB + static/images/, 218 MB, bundled as boardgamegeek.tar.gz, 200 MB): [link will be added once HF discussion is opened] After the HF PR merges, bump .assets-revision to the HF merge SHA in a follow-up commit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BoardGameGeek mirror (port 40015)#35

Add BoardGameGeek mirror (port 40015)#35
hqhq1025 wants to merge 1 commit into
aiming-lab:mainfrom
hqhq1025:feat/boardgamegeek-mirror

hqhq1025 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hqhq1025 commented May 26, 2026

Summary

Catalog

Routes (55)

Benchmark users (per skill convention)

tasks.jsonl (20 tasks across 8 functional areas)

Skill-conformance highlights

byte-identical reset (the strict invariant)

harden-env Dimension A — answer leaks fixed

harden-env Dimension C — catalog breadth

Known limitation: BGG sort=lowest API is Cloudflare-gated

Hand-walk verification (Playwright, end-to-end, every task)

Paired HF PR

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Known limitation: BGG `sort=lowest` API is Cloudflare-gated