Add Discogs mirror site (port 40015)#34
Open
hqhq1025 wants to merge 1 commit into
Open
Conversation
Adds the 16th WebHarbor mirror at https://discogs.com — the world's largest music release database and marketplace. Real catalogue of 7,042 releases / 5,031 artists / 3,938 labels / 6,825 master records pulled from the Discogs public API + MusicBrainz, plus 323 real album covers from Wikipedia. Backed by a benchmark community of 29 users with seeded ratings, reviews, collections, wantlists, lists, marketplace listings, and forum threads. Registered as the 16th site at port 40015. Image bumps to 16 mirrors total; EXPOSE 40000-40015. ## Site features - Release / master / artist / label / genre / style / format pages - Token-overlap scored search across releases, artists, labels with genre / style / format / year / country facets - Marketplace with media-condition + genre filters, per-listing comments, grades, currencies, sellers - User collection (5 folders: Uncategorized / All / Vinyl / CD / Wishlist Bought) with media + sleeve grades per item - Wantlist with min-grade preferences - User-curated public lists (CRUD) - Forums (10 topical boards) with threaded replies - Rating (1-5) + Review submission with helpful counts - Auth (Flask-Login + bcrypt + CSRF), register / settings / logout - 20 WebVoyager-format tasks in sites/discogs/tasks.jsonl ## Data scale - releases: 7,042 (3,522 from Discogs API + 3,520 from MusicBrainz) - artists: 5,031, labels: 3,938, masters: 6,825 - ratings: 86,714, reviews: 3,253 - collection_items: 4,207, wantlist_items: 1,374 - lists: 40, listings: 2,640, threads: 43, posts: 333 - benchmark users: alice_crate, bob_vinyl, carol_jazz, dave_techno (passwords: alice12345 / bob123456 / carol12345 / dave12345) - + 25 collector-style users with realistic locations & seller status ## Determinism work - MIRROR_REFERENCE_DATE = datetime(2026,5,26) pins all date fields so re-seeding from scraped_data/ is bit-for-bit reproducible - random.Random(42) seed for the community generator - Idempotent gates on every seed_*() function (count() > 0 → early return); byte-identical reset verified ## Verification - Docker build green; all 16 sites return 200 - POST /reset/discogs keeps DB byte-identical to seed - All 20 tasks pass when walked via Playwright (Chromium) ## Paired Hugging Face assets - Heavy assets shipped via the ChilleD/WebHarbor HF dataset: - sites/discogs/instance_seed/discogs.db (13 MB) - sites/discogs/static/images/release/*.jpg (323 covers, 32 MB) - .assets-revision is left at `revision: main` so the HF merge will roll in automatically (same approach as TED / Phys.org PRs).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the 16th WebHarbor mirror at https://discogs.com — the world's largest music release database and marketplace. Real catalogue of 7,042 releases / 5,031 artists / 3,938 labels / 6,825 master records sourced from the Discogs public API + MusicBrainz, with 323 real album covers fetched from Wikipedia. Backed by a benchmark community of 29 users with 86,714 ratings, 3,253 reviews, 4,207 collection items, 2,640 marketplace listings, 40 user lists, 43 forum threads and 333 posts — all idempotently seeded.
Registered as the 16th site at port
40015. Image bumps to 16 mirrors total; EXPOSE 40000-40015.Site features
/release/793593) or the internal PKsites/discogs/tasks.jsonlSeeded rows
Benchmark users (passwords noted in
tasks.jsonl):alice_cratealice@test.comalice12345bob_vinylbob@test.combob123456carol_jazzcarol@test.comcarol12345dave_technodave@test.comdave12345Plus 25 collector-style users with realistic locations and seller status across 5 continents.
Determinism work for byte-identical reset
MIRROR_REFERENCE_DATE = datetime(2026, 5, 26)pins everyadded_at/created_at/posted_at/joined_atso re-seeding fromscraped_data/is bit-for-bit reproducible across machinesrandom.Random(42)for samplingrelease_id * 13 + 7so two clean machines produce identical tracklistsseed_*()function gated bycount() > 0early return (catalogue / community / users / taxonomy / forums) — confirmed no-op on populated DB across 3 successive bootsPaired Hugging Face PR
discogs.tar.gz(33.6 MB)95787cc4bf21d21e2b715db14d66cd5b3e7373b547034a45800a62c413a9feebsites/discogs/instance_seed/discogs.db(13 MB seed) +sites/discogs/static/images/release/*.jpg(323 real Wikipedia album covers, 32 MB).assets-revisionis left atrevision: mainso the HF merge will roll in automatically (same approach as the TED / Phys.org PRs).Verification
All checks below were run on this contributor's machine against
webharbor:testbuilt from this branch + the HF tarball extracted in place../scripts/build.shContainer start (alt ports 8801 / 45000-45015)
All 16 sites came up. Port sweep:
POST /reset/discogs— byte-identical reset/reset-allparallel resetIdempotency across repeated boot
Booted the app three times against the same seed; every md5 matched
3e7373b547034a45800a62c413a9feeb.Discogs route smoke (20 routes, all 200)
Task verification (Playwright + Chromium, 20/20 PASS)
All 20 tasks in
sites/discogs/tasks.jsonlwere walked end-to-end via real Chromium:Hardening fixes applied during evolve-env / harden-env:
/sellroute now accepts either the public Discogs ID or the internal PK (matches the IDs visible in/release/<id>URLs)nullslast()so?sort=year_ascdoesn't sink NULL-year releases to the topmedia+genreURL filtersFiles
sites/discogs/{app.py, seed_data.py, _health.py, requirements.txt, tasks.jsonl, templates/*, static/{css,icons,js}/*}websyn_start.sh,control_server.py,Dockerfileinstance_seed/discogs.db,static/images/release/) live in HF PR Add CarMax mirror (port 40015) #24, not in git.🤖 Generated with Claude Code