Skip to content

Add BoardGameGeek mirror (port 40015)#35

Open
hqhq1025 wants to merge 1 commit into
aiming-lab:mainfrom
hqhq1025:feat/boardgamegeek-mirror
Open

Add BoardGameGeek mirror (port 40015)#35
hqhq1025 wants to merge 1 commit into
aiming-lab:mainfrom
hqhq1025:feat/boardgamegeek-mirror

Conversation

@hqhq1025
Copy link
Copy Markdown

Summary

  • Adds sites/boardgamegeek/ — a Flask mirror of boardgamegeek.com at the next free port slot (40000+15 = 40015)
  • Built end-to-end per the .claude/skills/ five-phase pipeline (clone-website → design-tasks → evolve-env → harden-env → seed-database)
  • 5,669 games (500 base + 5,169 expansions) and 11,338 real cover images, sourced from api.geekdo.com (no auth required) + Playwright for the rank/browse pages
  • Paired HF dataset PR shipping instance_seed/boardgamegeek.db (24 MB) + static/images/ (218 MB) — see "Paired HF PR" below
  • ⚠️ One documented limitation about low-end ratings — see "Known limitation: BGG sort=lowest API is CF-gated" below

Catalog

Entity Count Source
Base games 500 BGG /browse/boardgame top-500 by rank, scraped via Playwright
Expansions 5,169 Every boardgameexpansion link found across the top-500, fetched via api.geekdo.com/api/geekitem/linkeditems
Designers / artists 2,880 Real BGG persons linked from the games above
Publishers 1,035 Real BGG publishers linked from the games above
Mechanics / Categories / Families 178 / 80 / (live) BGG canonical taxonomy
Ratings (incl. 1,639 text reviews) 5,246 Real user ratings via api.geekdo.com/api/collections?sort=rating
Forum threads / posts 4,004 / 34,462 Real thread titles from api.geekdo.com/api/forums/threads; bodies synthesized from a 10-line pool to fill each thread
GeekLists 16 Hand-curated themed lists (e.g. "Top 50 Heaviest Games of the Last Decade", "Best Cooperative Games", "Hidden Gems Under 5000 Owners")
Users 4,201 250 real BGG reviewer profiles + 3,947 stub users (display-only, no login) + 4 benchmark users
Cover + thumbnail images 11,338 Real images downloaded from cf.geekdo-images.com

Routes (55)

  • /, /_health
  • /browse/boardgame[/page/N], /hotness, /hot
  • /boardgame/<oid>[/<slug>], /boardgame/<oid>/<slug>/ratings, /boardgame/<oid>/<slug>/credits, /boardgame/<oid>/<slug>/expansions, /boardgame/<oid>/<slug>/forums
  • /boardgamecategory[/<cid>[/<slug>]], /boardgamemechanic[/<mid>[/<slug>]], /boardgamedesigner[/<pid>[/<slug>]], /boardgameartist/<pid>/<slug>, /boardgamepublisher[/<pid>[/<slug>]]
  • /search?q=&type=boardgame|user|geeklist, /geeksearch.php (legacy redirect)
  • /forums, /forum/<fid>, /thread/<tid>, /thread/<tid>/reply (POST), /forum/<fid>/new (GET/POST)
  • /geeklists, /geeklist/<lid>, /geeklist/new (GET/POST), /geeklist/<lid>/add (POST)
  • /user/<username>, /collection/<username>, /plays/<username>, /account (GET/POST)
  • /rate/<oid> (POST), /collection/save/<oid> (POST), /collection/remove/<oid> (POST), /plays/log/<oid> (POST), /thumb (POST)
  • /login, /register, /logout, /forgot, /wiki/page/About, /help

Scored token-overlap search (skill rule: never strict-AND); 23-word stop-word list.

Benchmark users (per skill convention)

alice_j / bob_c / carol_d / david_k, password TestPass123! — each pre-seeded with 18 owned + 4 wishlist + 1 want-to-buy + 1 want-to-play games, 5 text reviews, 10-22 plays logged, 1 authored GeekList, and 1 opened forum thread. Picks are profile-weighted (alice → heavy euros, bob → coops/dungeon crawlers, carol → 2-player only, david → wargames/COIN).

tasks.jsonl (20 tasks across 8 functional areas)

Area Tasks
Search & filter --1 --11 --19
Browse + sort --0 --2 --3 --6 --7 --12 --18
Detail-page extraction --1 --3 --10 --20
Comparison --10
CRUD: ratings --5 --13
CRUD: collection / wishlist / plays --4 --6 --13 --17
Forum reply --15
GeekList create --9
Disambiguation --13 --19

Hard tasks (≥5 agent steps): --6 (mechanic page → 2P filter → rank-1 → wishlist save with priority), --14 (overview → tab nav → sort → extract username+thumbs), --15 (forum nav → filter pinned/locked → thread → reply), --18 (designer index → filter → sort by avg rating → extract first).

Skill-conformance highlights

byte-identical reset (the strict invariant)

$ docker exec wh-bgg5 md5sum /opt/WebSyn/boardgamegeek/instance/boardgamegeek.db /opt/WebSyn/boardgamegeek/instance_seed/boardgamegeek.db
10a5b3d4ae85380d8019bd9ac7cf9e61  /opt/WebSyn/boardgamegeek/instance/boardgamegeek.db
10a5b3d4ae85380d8019bd9ac7cf9e61  /opt/WebSyn/boardgamegeek/instance_seed/boardgamegeek.db
$ curl -sX POST http://localhost:8206/reset/boardgamegeek
{"pid":1108,"ready":true,"site":"boardgamegeek"}
$ docker exec wh-bgg5 md5sum /opt/WebSyn/boardgamegeek/instance/boardgamegeek.db /opt/WebSyn/boardgamegeek/instance_seed/boardgamegeek.db
10a5b3d4ae85380d8019bd9ac7cf9e61  /opt/WebSyn/boardgamegeek/instance/boardgamegeek.db
10a5b3d4ae85380d8019bd9ac7cf9e61  /opt/WebSyn/boardgamegeek/instance_seed/boardgamegeek.db

Every seed_*() function is function-level idempotent (returns early when the DB is populated; per-row gates are not enough — empty db.session.commit() calls bump SQLite metadata and break byte-identity, even when zero rows changed).

harden-env Dimension A — answer leaks fixed

  • Item overview tab: removed "Top Reviews" section — agents must navigate to Ratings & Reviews tab to see individual reviews
  • Item overview tab-bar: counts removed ("Expansions (N)" → "Expansions", "Forums (N)" → "Forums") — forces tab navigation
  • Item overview's "See all expansions →" link no longer reveals the total count
  • BGG--0 changed to ask only for designers (not year), since year is a legitimate browse-table column on real BGG

harden-env Dimension C — catalog breadth

Worker Placement: 86 games, Hand Management: 221, Action Points: 50+, Z-Man Games publisher: 116 titles, Vital Lacerda's filmography: 23 titles (all visible after ?q= filter on the designers index, which is paginated).

Known limitation: BGG sort=lowest API is Cloudflare-gated

The first-pass scrape pulled reviews via api.geekdo.com/api/collections?...&sort=rating (which returns rating-DESC by default — highest-rated first). When we tried to balance the data with sort=lowest to capture genuine 1-5★ ratings (which are real and visible on the live Ratings tab of every BGG game page), every variation we tried returned 0 items from a plain curl:

$ curl -sA 'Mozilla/5.0' 'https://api.geekdo.com/api/collections?...&sort=lowest&require_review=true'
{"items": []}
$ curl -sA 'Mozilla/5.0' 'https://api.geekdo.com/api/collections?...&sort=rating&direction=asc'
{"items": []}
$ curl -sA 'Mozilla/5.0' 'https://api.geekdo.com/api/collections?...&minrating=1&maxrating=5'
{"errors": [...]}

The same URLs work in a real Chromium tab (verified via Playwright page.on('response') — captured api.geekdo.com/api/collections?...&sort=lowest&showcount=50 returning the expected rows). We conclude the BGG WAF returns those rows only when the request carries a fresh CF challenge cookie, which we don't want to bake into the public scraper for both ethical and reproducibility reasons.

Consequences:

  • The seed has 5,246 ratings of which the lowest is ~7.0, biased high vs. the live BGG distribution.
  • The bottom of the rating histogram on /boardgame/<oid>/<slug>/ratings therefore looks emptier than it would on the live site.

Mitigations applied:

  • Task BoardGameGeek--14 was originally framed as "report the lowest-rated review" but rewritten to ask for the Most Helpful (thumbs) review instead — exercises the same Ratings-tab + sort navigation, but the answer (ogzz, 23 thumbs) comes from real high-engagement data we can fetch reliably.
  • The seed annotates this with a comment in seed_data.py so future maintainers can re-run scrape_low_ratings.py once an auth path is figured out.

The single not-quite-clean answer-leak audit hit also stems from this: Wingspan's seeded "lowest rating" is 7.0 by carol_d (benchmark user). The current BGG--14 task design ignores that field, so it doesn't matter — but a future task that asks "who gave the lowest rating?" would still surface a benchmark user.

Hand-walk verification (Playwright, end-to-end, every task)

32/32 checks pass. Each task is walked through a real Chromium against the running container, screenshots saved per step:

  • search input filled and submitted via the actual form (selector pinned to the login form's ancestor — there are 3 forms on every page: header search, header logout, and content)
  • forms submitted via input[name=...].locator('xpath=ancestor::form//button[@type="submit"]')
  • ratings page sort link clicked, top thumb count extracted from the rendered table
  • expansions tab navigated and 11 expansion rows counted on the rendered page

Paired HF PR

instance_seed/boardgamegeek.db (24 MB, includes all real data) + static/images/ (218 MB, 11,338 real images) shipped as boardgamegeek.tar.gz (200 MB compressed):
https://huggingface.co/datasets/ChilleD/WebHarbor/discussions/25

After that HF PR merges, bump .assets-revision to the merge SHA in a follow-up commit.

Test plan

  • ./scripts/build.sh succeeds (after ./scripts/fetch_assets.sh boardgamegeek pulls the HF tarball)
  • Container starts and /health reports boardgamegeek alive
  • All 55 boardgamegeek routes return < 500
  • POST /reset/boardgamegeek → md5sum of instance/boardgamegeek.db matches instance_seed/boardgamegeek.db (byte-identical, 10a5b3d4ae85380d8019bd9ac7cf9e61)
  • Container restart preserves md5 byte-identity
  • Playwright handwalk: alice_j login + view owned-collection count (18)
  • Playwright handwalk: bob_c login + rate Brass: Birmingham 9.5 with text review + check it persists on /boardgame/.../ratings
  • Playwright handwalk: carol_d login + wishlist-priority add for highest-rank 2P deckbuilder (Star Realms)
  • Playwright handwalk: david_k login + create new GeekList "My COIN Series Picks"
  • Playwright handwalk: Twilight Struggle expansions page lists 11 expansions (vs. 0 before the extras-scrape pass that added 5,169 expansion games)
  • Playwright handwalk: Wingspan Ratings & Reviews "Most Helpful" sort → top review = ogzz, 23 thumbs
  • Playwright handwalk: Pandemic Legacy: Season 1 Credits page lists 12 publishers
  • Answer-leak audit: 0 critical semantic leaks (2 substring "leaks" verified to be coincidental character-string collisions, e.g. "11" appearing in font-size "11px" / thread ID /thread/111)

Adds sites/boardgamegeek/ — a Flask mirror of boardgamegeek.com built
end-to-end per the .claude/skills/ six-phase pipeline (clone-website
→ design-tasks → evolve-env → harden-env → seed-database).

Catalog (real data from api.geekdo.com + 1000+ Playwright-rendered pages):
- 500 base games (BGG top-500 by rank)
- 5,169 expansions (full expansion catalog for every top game)
- 2,880 designers/artists, 1,035 publishers, 178 mechanisms, 80 categories
- 5,246 user ratings (1,639 text reviews from real BGG users)
- 4,004 forum threads + 34,462 posts (real thread titles, generated bodies)
- 16 curated GeekLists (sentence-curated entries per skill convention)
- 4,201 real user profiles (250 reviewer profiles + 3,947 from review pulls
  + 4 benchmark users)
- 11,338 real images (covers + thumbs, scraped from cf.geekdo-images.com)

Routes (55): home / browse / hotness / item (overview, ratings, credits,
expansions, forums) / property pages (category/mechanic/designer/artist/
publisher) / search (games/users/geeklists) / forums + threads / geeklists
(create + add-item) / user profile + collection + plays / rate / collection
mutate / login / register / account. Scored token-overlap search per the
"never strict-AND" skill rule.

Benchmark users: alice_j / bob_c / carol_d / david_k, password
TestPass123! — each pre-seeded with 18 owned + 4 wishlist + 1 want-to-buy
+ 1 want-to-play games, 5 textual reviews, 10-22 plays, 1 authored
GeekList, and 1 opened forum thread.

tasks.jsonl: 20 tasks covering 8 functional areas (search/filter, detail
lookup, comparison, CRUD on collection/ratings/wishlist, forum reply,
geeklist creation, plays log, disambiguation).  Hand-walked via Playwright
end-to-end — 32/32 checks pass.

Skill-conformance highlights:
- Byte-identical reset verified
  (md5 10a5b3d4ae85380d8019bd9ac7cf9e61 unchanged across /reset)
- All seed_*() functions are function-level idempotent (returns early when
  populated; per-row gates are not enough — empty commits bump SQLite
  metadata and break the byte-identity invariant)
- harden-env Dimension A: top-reviews removed from item overview; tab-bar
  counts removed ("Expansions (N)" → "Expansions") to force tab navigation
- harden-env Dimension C: catalog breadth — top mechanisms have 50+ games,
  Z-Man Games publisher has 116 games, Lacerda 23 games

Known limitations (documented for the maintainers):
- BGG's /api/collections endpoint silently rejects sort=lowest /
  direction=asc / minrating filters when called without authenticated CF
  clearance.  Result: scraped reviews skew high (the seed has 1,639 text
  reviews but the lowest is ~7.0).  Tasks that would naturally ask "find
  the lowest-rated review" were re-framed to use Most-Helpful (thumbs)
  instead — those reach the same Ratings tab + sort UX without depending
  on data we can't lawfully obtain via the public API.
- Image discovery uses BGG's first-party cf.geekdo-images.com URLs which
  occasionally return blank thumbnails for un-imaged expansions.  The
  fallback SVG placeholder ships at static/icons/cover_placeholder.svg.

Paired HF dataset PR (instance_seed/boardgamegeek.db, 24 MB +
static/images/, 218 MB, bundled as boardgamegeek.tar.gz, 200 MB):
[link will be added once HF discussion is opened]

After the HF PR merges, bump .assets-revision to the HF merge SHA in a
follow-up commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant