Add IMDb mirror (port 40015)#33
Open
hqhq1025 wants to merge 1 commit into
Open
Conversation
Adds sites/imdb/ — a Flask mirror of imdb.com built per the
.claude/skills/ six-phase contributor pipeline (clone-website,
design-tasks, evolve-env, harden-env, seed-database).
Catalog (scraped from imdb.com via Playwright, gitignored in
sites/imdb/scraped_data/; shipped as instance_seed/imdb.db +
static/images/ via HF dataset):
- 392 titles (Top 250 + Top TV 250 + Most Popular + Box Office)
- 3796 persons (cast / crew with real bios + headshots)
- 4280 real posters + headshots
- 19 canonical IMDb genres
Coverage:
- 22 routes: homepage / title detail / fullcredits / reviews /
rate / watchlist toggle / person detail / scored search /
advanced search (genre / year / rating / sort) /
4 charts (Top 250 / Top TV / Most Popular / Box Office) /
genre browse / news / auth (login / register / account)
- scored token-overlap search (not strict-AND), 18 stop words
- 4 benchmark users alice.j / bob.c / carol.d / david.k
@ test.com (password TestPass123!) — pre-seeded with
4-item watchlists, 3-5 ratings, one written review each
- 12 seeded news items + 12 featured reviews (high
helpful-count) across major titles
- 18 benchmark tasks in tasks.jsonl across 6 functional areas:
chart browse, title detail, person filmography, advanced
search, genre browse, user state — including 4 hard tasks
(>= 5 steps) and 2 disambiguation tasks (bob's SF/fantasy
watchlist subset; carol's crime film subset)
Data integrity:
- canonical-URL guard skips 65 IMDb-redirected tt_ids
(some unknown tt_ids returned the wrong page)
- garbage filter skips 1854 names that hit IMDb's 403/
error fallback during concurrent scraping
- html.unescape on all string fields; hero h1 preferred
over ld.name (which is sometimes original-language)
- lowercase substring match for box-office data-testid keys
HF asset PR:
https://huggingface.co/datasets/ChilleD/WebHarbor/discussions/23
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
sites/imdb/— a Flask mirror of imdb.com that fits WebHarbor's 16th port slot (40000+15 = 40015).claude/skills/six-phase pipeline (clone-website → design-tasks → evolve-env → harden-env → seed-database)Catalog
Routes (22)
/,/_health/title/<tt>,/title/<tt>/fullcredits,/title/<tt>/reviews,/title/<tt>/review(POST)/title/<tt>/rate(POST),/title/<tt>/watchlist(POST)/name/<nm>/find?q=&s=all|tt|nm,/search,/search/title?genre=&year_from=&year_to=&rating_min=&sort=/chart/top,/chart/toptv,/chart/moviemeter,/chart/boxoffice/genre/<slug>/list/watchlist,/list/ratings/news/login,/register,/logout,/accountBenchmark users (per skill convention)
alice.j/bob.c/carol.d/david.k@test.com, passwordTestPass123!— each pre-seeded with a 4-item watchlist, 3-5 ratings, and one written review.tasks.jsonl (18 tasks, 6 functional areas × 3)
Data integrity safeguards (added after live audit)
ld.urltt_id ≠ filename tt_id (IMDb redirects unknown tt_ids to a random valid page; without this guard, e.g. tt0245429 silently became Psycho)html.unescape()on all string fields; heroh1preferred overld.name(which is sometimes original-language:GisaengchungvsParasite)bo_grossdomestic,bo_cumulativeworldwidegross, ...)Paired HF PR
instance_seed/imdb.db+static/images/*shipped via:https://huggingface.co/datasets/ChilleD/WebHarbor/discussions/23
After that HF PR merges, bump
.assets-revisionto the merge SHA in a follow-up commit.Test plan
./scripts/build.shsucceeds (after./scripts/fetch_assets.shpulls HF assets)/healthreports all 16 sites alivePOST /reset/imdb→ md5sum ofinstance/imdb.dbmatchesinstance_seed/imdb.db(byte-identical)/reset-allfinishes under 10s (measured 1.3s on 26-site test image)/reviewspage