Add BoardGameGeek mirror (port 40015)#35
Open
hqhq1025 wants to merge 1 commit into
Open
Conversation
Adds sites/boardgamegeek/ — a Flask mirror of boardgamegeek.com built
end-to-end per the .claude/skills/ six-phase pipeline (clone-website
→ design-tasks → evolve-env → harden-env → seed-database).
Catalog (real data from api.geekdo.com + 1000+ Playwright-rendered pages):
- 500 base games (BGG top-500 by rank)
- 5,169 expansions (full expansion catalog for every top game)
- 2,880 designers/artists, 1,035 publishers, 178 mechanisms, 80 categories
- 5,246 user ratings (1,639 text reviews from real BGG users)
- 4,004 forum threads + 34,462 posts (real thread titles, generated bodies)
- 16 curated GeekLists (sentence-curated entries per skill convention)
- 4,201 real user profiles (250 reviewer profiles + 3,947 from review pulls
+ 4 benchmark users)
- 11,338 real images (covers + thumbs, scraped from cf.geekdo-images.com)
Routes (55): home / browse / hotness / item (overview, ratings, credits,
expansions, forums) / property pages (category/mechanic/designer/artist/
publisher) / search (games/users/geeklists) / forums + threads / geeklists
(create + add-item) / user profile + collection + plays / rate / collection
mutate / login / register / account. Scored token-overlap search per the
"never strict-AND" skill rule.
Benchmark users: alice_j / bob_c / carol_d / david_k, password
TestPass123! — each pre-seeded with 18 owned + 4 wishlist + 1 want-to-buy
+ 1 want-to-play games, 5 textual reviews, 10-22 plays, 1 authored
GeekList, and 1 opened forum thread.
tasks.jsonl: 20 tasks covering 8 functional areas (search/filter, detail
lookup, comparison, CRUD on collection/ratings/wishlist, forum reply,
geeklist creation, plays log, disambiguation). Hand-walked via Playwright
end-to-end — 32/32 checks pass.
Skill-conformance highlights:
- Byte-identical reset verified
(md5 10a5b3d4ae85380d8019bd9ac7cf9e61 unchanged across /reset)
- All seed_*() functions are function-level idempotent (returns early when
populated; per-row gates are not enough — empty commits bump SQLite
metadata and break the byte-identity invariant)
- harden-env Dimension A: top-reviews removed from item overview; tab-bar
counts removed ("Expansions (N)" → "Expansions") to force tab navigation
- harden-env Dimension C: catalog breadth — top mechanisms have 50+ games,
Z-Man Games publisher has 116 games, Lacerda 23 games
Known limitations (documented for the maintainers):
- BGG's /api/collections endpoint silently rejects sort=lowest /
direction=asc / minrating filters when called without authenticated CF
clearance. Result: scraped reviews skew high (the seed has 1,639 text
reviews but the lowest is ~7.0). Tasks that would naturally ask "find
the lowest-rated review" were re-framed to use Most-Helpful (thumbs)
instead — those reach the same Ratings tab + sort UX without depending
on data we can't lawfully obtain via the public API.
- Image discovery uses BGG's first-party cf.geekdo-images.com URLs which
occasionally return blank thumbnails for un-imaged expansions. The
fallback SVG placeholder ships at static/icons/cover_placeholder.svg.
Paired HF dataset PR (instance_seed/boardgamegeek.db, 24 MB +
static/images/, 218 MB, bundled as boardgamegeek.tar.gz, 200 MB):
[link will be added once HF discussion is opened]
After the HF PR merges, bump .assets-revision to the HF merge SHA in a
follow-up commit.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
sites/boardgamegeek/— a Flask mirror of boardgamegeek.com at the next free port slot (40000+15 = 40015).claude/skills/five-phase pipeline (clone-website → design-tasks → evolve-env → harden-env → seed-database)api.geekdo.com(no auth required) + Playwright for the rank/browse pagesinstance_seed/boardgamegeek.db(24 MB) +static/images/(218 MB) — see "Paired HF PR" belowCatalog
/browse/boardgametop-500 by rank, scraped via Playwrightboardgameexpansionlink found across the top-500, fetched viaapi.geekdo.com/api/geekitem/linkeditemsapi.geekdo.com/api/collections?sort=ratingapi.geekdo.com/api/forums/threads; bodies synthesized from a 10-line pool to fill each threadcf.geekdo-images.comRoutes (55)
/,/_health/browse/boardgame[/page/N],/hotness,/hot/boardgame/<oid>[/<slug>],/boardgame/<oid>/<slug>/ratings,/boardgame/<oid>/<slug>/credits,/boardgame/<oid>/<slug>/expansions,/boardgame/<oid>/<slug>/forums/boardgamecategory[/<cid>[/<slug>]],/boardgamemechanic[/<mid>[/<slug>]],/boardgamedesigner[/<pid>[/<slug>]],/boardgameartist/<pid>/<slug>,/boardgamepublisher[/<pid>[/<slug>]]/search?q=&type=boardgame|user|geeklist,/geeksearch.php(legacy redirect)/forums,/forum/<fid>,/thread/<tid>,/thread/<tid>/reply(POST),/forum/<fid>/new(GET/POST)/geeklists,/geeklist/<lid>,/geeklist/new(GET/POST),/geeklist/<lid>/add(POST)/user/<username>,/collection/<username>,/plays/<username>,/account(GET/POST)/rate/<oid>(POST),/collection/save/<oid>(POST),/collection/remove/<oid>(POST),/plays/log/<oid>(POST),/thumb(POST)/login,/register,/logout,/forgot,/wiki/page/About,/helpScored token-overlap search (skill rule: never strict-AND); 23-word stop-word list.
Benchmark users (per skill convention)
alice_j/bob_c/carol_d/david_k, passwordTestPass123!— each pre-seeded with 18 owned + 4 wishlist + 1 want-to-buy + 1 want-to-play games, 5 text reviews, 10-22 plays logged, 1 authored GeekList, and 1 opened forum thread. Picks are profile-weighted (alice → heavy euros, bob → coops/dungeon crawlers, carol → 2-player only, david → wargames/COIN).tasks.jsonl (20 tasks across 8 functional areas)
--1--11--19--0--2--3--6--7--12--18--1--3--10--20--10--5--13--4--6--13--17--15--9--13--19Hard tasks (≥5 agent steps):
--6(mechanic page → 2P filter → rank-1 → wishlist save with priority),--14(overview → tab nav → sort → extract username+thumbs),--15(forum nav → filter pinned/locked → thread → reply),--18(designer index → filter → sort by avg rating → extract first).Skill-conformance highlights
byte-identical reset (the strict invariant)
Every
seed_*()function is function-level idempotent (returns early when the DB is populated; per-row gates are not enough — emptydb.session.commit()calls bump SQLite metadata and break byte-identity, even when zero rows changed).harden-env Dimension A — answer leaks fixed
Ratings & Reviewstab to see individual reviewsharden-env Dimension C — catalog breadth
Worker Placement: 86 games, Hand Management: 221, Action Points: 50+, Z-Man Games publisher: 116 titles, Vital Lacerda's filmography: 23 titles (all visible after
?q=filter on the designers index, which is paginated).Known limitation: BGG
sort=lowestAPI is Cloudflare-gatedThe first-pass scrape pulled reviews via
api.geekdo.com/api/collections?...&sort=rating(which returns rating-DESC by default — highest-rated first). When we tried to balance the data withsort=lowestto capture genuine 1-5★ ratings (which are real and visible on the liveRatingstab of every BGG game page), every variation we tried returned 0 items from a plaincurl:The same URLs work in a real Chromium tab (verified via Playwright
page.on('response')— capturedapi.geekdo.com/api/collections?...&sort=lowest&showcount=50returning the expected rows). We conclude the BGG WAF returns those rows only when the request carries a fresh CF challenge cookie, which we don't want to bake into the public scraper for both ethical and reproducibility reasons.Consequences:
/boardgame/<oid>/<slug>/ratingstherefore looks emptier than it would on the live site.Mitigations applied:
BoardGameGeek--14was originally framed as "report the lowest-rated review" but rewritten to ask for the Most Helpful (thumbs) review instead — exercises the same Ratings-tab + sort navigation, but the answer (ogzz, 23 thumbs) comes from real high-engagement data we can fetch reliably.seed_data.pyso future maintainers can re-runscrape_low_ratings.pyonce an auth path is figured out.The single not-quite-clean answer-leak audit hit also stems from this: Wingspan's seeded "lowest rating" is
7.0bycarol_d(benchmark user). The current BGG--14 task design ignores that field, so it doesn't matter — but a future task that asks "who gave the lowest rating?" would still surface a benchmark user.Hand-walk verification (Playwright, end-to-end, every task)
32/32 checks pass. Each task is walked through a real Chromium against the running container, screenshots saved per step:
input[name=...].locator('xpath=ancestor::form//button[@type="submit"]')Paired HF PR
instance_seed/boardgamegeek.db(24 MB, includes all real data) +static/images/(218 MB, 11,338 real images) shipped asboardgamegeek.tar.gz(200 MB compressed):https://huggingface.co/datasets/ChilleD/WebHarbor/discussions/25
After that HF PR merges, bump
.assets-revisionto the merge SHA in a follow-up commit.Test plan
./scripts/build.shsucceeds (after./scripts/fetch_assets.sh boardgamegeekpulls the HF tarball)/healthreports boardgamegeek alivePOST /reset/boardgamegeek→ md5sum ofinstance/boardgamegeek.dbmatchesinstance_seed/boardgamegeek.db(byte-identical,10a5b3d4ae85380d8019bd9ac7cf9e61)/boardgame/.../ratingsogzz, 23 thumbs/thread/111)