Skip to content

Add Discogs mirror site (port 40015)#34

Open
hqhq1025 wants to merge 1 commit into
aiming-lab:mainfrom
hqhq1025:feat/discogs-mirror
Open

Add Discogs mirror site (port 40015)#34
hqhq1025 wants to merge 1 commit into
aiming-lab:mainfrom
hqhq1025:feat/discogs-mirror

Conversation

@hqhq1025
Copy link
Copy Markdown

Summary

Adds the 16th WebHarbor mirror at https://discogs.com — the world's largest music release database and marketplace. Real catalogue of 7,042 releases / 5,031 artists / 3,938 labels / 6,825 master records sourced from the Discogs public API + MusicBrainz, with 323 real album covers fetched from Wikipedia. Backed by a benchmark community of 29 users with 86,714 ratings, 3,253 reviews, 4,207 collection items, 2,640 marketplace listings, 40 user lists, 43 forum threads and 333 posts — all idempotently seeded.

Registered as the 16th site at port 40015. Image bumps to 16 mirrors total; EXPOSE 40000-40015.

Site features

  • Release / Master / Artist / Label / Genre / Style / Format pages, all routable by either the public Discogs ID (/release/793593) or the internal PK
  • Token-overlap scored search across releases / artists / labels with genre / style / format / year / country facets and 7 sort orders (relevance / collected / wanted / rating / year ↑↓ / title)
  • Marketplace browse with media-condition + genre + sort filters; per-listing comments, grades, currencies, sellers, shipping-from countries
  • User collection with 5 folders (Uncategorized / All / Vinyl / CD / Wishlist Bought) and media + sleeve grade per item
  • Wantlist with min-grade preferences
  • User-curated public lists (CRUD) — create, browse, view, add releases
  • Forums — 10 topical boards (General / Vinyl / Jazz / Electronic / Hip Hop / Marketplace / Database / Crate Diggers / Help / Updates) with threaded replies
  • Rating (1-5 stars) + Review submission with helpful counts
  • Marketplace seller listing form (auth-gated, CSRF-protected)
  • Full auth (Flask-Login + bcrypt + CSRF), register / settings / logout
  • 20 WebVoyager-format tasks in sites/discogs/tasks.jsonl

Seeded rows

  • releases: 7,042 (3,522 from Discogs API + 3,520 from MusicBrainz)
  • artists: 5,031
  • labels: 3,938
  • masters: 6,825
  • ratings: 86,714 (Discogs-skewed distribution, weighted toward 4-5 stars)
  • reviews: 3,253
  • collection_items: 4,207 with media + sleeve grades
  • wantlist_items: 1,374
  • lists: 40 (89 list_items)
  • marketplace listings: 2,640 (prices follow real Discogs distribution with vintage-pressing premium)
  • forum threads: 43, posts: 333

Benchmark users (passwords noted in tasks.jsonl):

username email password location
alice_crate alice@test.com alice12345 Brooklyn, USA
bob_vinyl bob@test.com bob123456 London, UK
carol_jazz carol@test.com carol12345 Tokyo, Japan
dave_techno dave@test.com dave12345 Berlin, Germany

Plus 25 collector-style users with realistic locations and seller status across 5 continents.

Determinism work for byte-identical reset

  • MIRROR_REFERENCE_DATE = datetime(2026, 5, 26) pins every added_at / created_at / posted_at / joined_at so re-seeding from scraped_data/ is bit-for-bit reproducible across machines
  • Community generator uses random.Random(42) for sampling
  • Per-track placeholder generator seeds RNG with release_id * 13 + 7 so two clean machines produce identical tracklists
  • Every seed_*() function gated by count() > 0 early return (catalogue / community / users / taxonomy / forums) — confirmed no-op on populated DB across 3 successive boots

Paired Hugging Face PR

  • Heavy assets: https://huggingface.co/datasets/ChilleD/WebHarbor/discussions/24
  • Tarball: discogs.tar.gz (33.6 MB)
  • Tarball md5: 95787cc4bf21d21e2b715db14d66cd5b
  • DB md5: 3e7373b547034a45800a62c413a9feeb
  • Asset contents: sites/discogs/instance_seed/discogs.db (13 MB seed) + sites/discogs/static/images/release/*.jpg (323 real Wikipedia album covers, 32 MB)
  • .assets-revision is left at revision: main so the HF merge will roll in automatically (same approach as the TED / Phys.org PRs).

Verification

All checks below were run on this contributor's machine against webharbor:test built from this branch + the HF tarball extracted in place.

./scripts/build.sh

Successfully built ...
Successfully tagged webharbor:test

Container start (alt ports 8801 / 45000-45015)

docker run -d --rm --name wh-discogs-test \
  -p 8801:8101 -p 45000-45015:40000-40015 webharbor:test

All 16 sites came up. Port sweep:

45000:200 45001:200 45002:200 45003:200 45004:200 45005:200 45006:200 45007:200
45008:200 45009:200 45010:200 45011:200 45012:200 45013:200 45014:200 45015:200

POST /reset/discogs — byte-identical reset

$ curl -X POST http://localhost:8801/reset/discogs
{"pid":..,"ready":true,"site":"discogs"}

$ docker exec wh-discogs-test md5sum \
    /opt/WebSyn/discogs/instance/discogs.db \
    /opt/WebSyn/discogs/instance_seed/discogs.db
3e7373b547034a45800a62c413a9feeb  /opt/WebSyn/discogs/instance/discogs.db
3e7373b547034a45800a62c413a9feeb  /opt/WebSyn/discogs/instance_seed/discogs.db

/reset-all parallel reset

$ time curl -s -X POST http://localhost:8801/reset-all
... ok: true, all 16 sites ready ...
elapsed: 1.6s

Idempotency across repeated boot

Booted the app three times against the same seed; every md5 matched 3e7373b547034a45800a62c413a9feeb.

Discogs route smoke (20 routes, all 200)

/                                       200
/release/793593                         200   (Outlandos d'Amour by The Police)
/master/<id>                            200
/artist/<id>?sort=year_asc              200   (Miles Davis discography)
/label/90  (Columbia, 80 releases)      200
/genre/jazz?sort=rating                 200
/style/techno                           200
/format/vinyl                           200
/search?q=Bob+Marley&genre=reggae       200   (23 results)
/explore                                200
/marketplace?sort=price_asc             200
/marketplace?media=Near+Mint            200
/lists  +  /list/<id>  +  /list/new     200
/forum  +  /forum/jazz  +  /thread/<t>  200
/user/<u>  +  /collection +  /wantlist  200
/login  +  /register  +  /settings      200
/sell                                   200   (auth-gated)

Task verification (Playwright + Chromium, 20/20 PASS)

All 20 tasks in sites/discogs/tasks.jsonl were walked end-to-end via real Chromium:

PASS Discogs--0:  Outlandos d'Amour Have count = 23964
PASS Discogs--1:  Jazz top-rated = 'The Time Machine' by DAB trio
PASS Discogs--2:  Marketplace cheapest = $8.01 'Divine / Recall' by Harddrive (seller: garagerocker)
PASS Discogs--3:  Alice adds to Vinyl folder, condition NM
PASS Discogs--4:  Bob adds to wantlist with min_grade VG
PASS Discogs--5:  Miles Davis oldest = 'Plays For Lovers' (1965 US)
PASS Discogs--6:  Carol replies to 'Cartridges: MM vs MC for jazz' thread
PASS Discogs--7:  Columbia label shows 80 releases
PASS Discogs--8:  Dave creates new public list
PASS Discogs--9:  Alice profile: collection=180, wantlist=40
PASS Discogs--10: 'Bob Marley' + Reggae genre = 23 results
PASS Discogs--11: Register craterunner99 + set bio
PASS Discogs--12: Home Most-Collected #3 = 'Hard Promises' by Tom Petty And The Heartbreakers
PASS Discogs--13: Bob lists 'Outlandos d'Amour' for $42.00 (listing visible on release page)
PASS Discogs--14: 'Live-Evil' by Miles Davis has 8 tracklist rows
PASS Discogs--15: Explore→Electronic→Techno first release
PASS Discogs--16: Alice removes 'HOUSE NATION - Aquamarine' from Vinyl folder (persisted)
PASS Discogs--17: Dave's 'Records I Always Bring to the Listening Bar' has 24 items
PASS Discogs--18: Cheapest NM listing = 'Of Mice & Men – Defy' at $8.03 USD
PASS Discogs--19: Dave posts reply to 'Favourite album opener of all time'

Hardening fixes applied during evolve-env / harden-env:

  • /sell route now accepts either the public Discogs ID or the internal PK (matches the IDs visible in /release/<id> URLs)
  • Year-based sort orders use nullslast() so ?sort=year_asc doesn't sink NULL-year releases to the top
  • Marketplace gained media + genre URL filters
  • Release-detail collection/wantlist forms expose folder + media/sleeve/min-grade selectors

Files

  • New: sites/discogs/{app.py, seed_data.py, _health.py, requirements.txt, tasks.jsonl, templates/*, static/{css,icons,js}/*}
  • Modified: websyn_start.sh, control_server.py, Dockerfile
  • Heavy assets (instance_seed/discogs.db, static/images/release/) live in HF PR Add CarMax mirror (port 40015) #24, not in git.

🤖 Generated with Claude Code

Adds the 16th WebHarbor mirror at https://discogs.com — the world's
largest music release database and marketplace. Real catalogue of 7,042
releases / 5,031 artists / 3,938 labels / 6,825 master records pulled
from the Discogs public API + MusicBrainz, plus 323 real album covers
from Wikipedia. Backed by a benchmark community of 29 users with seeded
ratings, reviews, collections, wantlists, lists, marketplace listings,
and forum threads.

Registered as the 16th site at port 40015. Image bumps to 16 mirrors
total; EXPOSE 40000-40015.

## Site features
- Release / master / artist / label / genre / style / format pages
- Token-overlap scored search across releases, artists, labels with
  genre / style / format / year / country facets
- Marketplace with media-condition + genre filters, per-listing
  comments, grades, currencies, sellers
- User collection (5 folders: Uncategorized / All / Vinyl / CD /
  Wishlist Bought) with media + sleeve grades per item
- Wantlist with min-grade preferences
- User-curated public lists (CRUD)
- Forums (10 topical boards) with threaded replies
- Rating (1-5) + Review submission with helpful counts
- Auth (Flask-Login + bcrypt + CSRF), register / settings / logout
- 20 WebVoyager-format tasks in sites/discogs/tasks.jsonl

## Data scale
- releases: 7,042 (3,522 from Discogs API + 3,520 from MusicBrainz)
- artists: 5,031, labels: 3,938, masters: 6,825
- ratings: 86,714, reviews: 3,253
- collection_items: 4,207, wantlist_items: 1,374
- lists: 40, listings: 2,640, threads: 43, posts: 333
- benchmark users: alice_crate, bob_vinyl, carol_jazz, dave_techno
  (passwords: alice12345 / bob123456 / carol12345 / dave12345)
- + 25 collector-style users with realistic locations & seller status

## Determinism work
- MIRROR_REFERENCE_DATE = datetime(2026,5,26) pins all date fields so
  re-seeding from scraped_data/ is bit-for-bit reproducible
- random.Random(42) seed for the community generator
- Idempotent gates on every seed_*() function (count() > 0 → early
  return); byte-identical reset verified

## Verification
- Docker build green; all 16 sites return 200
- POST /reset/discogs keeps DB byte-identical to seed
- All 20 tasks pass when walked via Playwright (Chromium)

## Paired Hugging Face assets
- Heavy assets shipped via the ChilleD/WebHarbor HF dataset:
  - sites/discogs/instance_seed/discogs.db (13 MB)
  - sites/discogs/static/images/release/*.jpg (323 covers, 32 MB)
- .assets-revision is left at `revision: main` so the HF merge will
  roll in automatically (same approach as TED / Phys.org PRs).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant