Add IMDb mirror (port 40015) by hqhq1025 · Pull Request #33 · aiming-lab/WebHarbor

hqhq1025 · 2026-05-26T14:49:05Z

Summary

Adds sites/imdb/ — a Flask mirror of imdb.com that fits WebHarbor's 16th port slot (40000+15 = 40015)
Built end-to-end per the .claude/skills/ six-phase pipeline (clone-website → design-tasks → evolve-env → harden-env → seed-database)
392 titles / 3796 persons / 4280 real posters & headshots, scraped from imdb.com with Playwright; assets shipped via the paired HF dataset PR

Catalog

Entity	Count	Source
Titles	392	Top 250 movies + Top 250 TV + Most Popular + Box Office (deduped)
Persons	3796	Cast/crew of all titles, with real bios + headshots
Credits	4852	director/writer/actor/producer/composer
Genres	19	Canonical IMDb genre buckets
Posters + headshots	4280	Real images downloaded from m.media-amazon.com
Featured reviews	12	Curated, attached to high-traffic titles
News items	12	Industry-style headlines linked to titles/people
Box office data	145 movies	Real US / WW / opening / budget figures

Routes (22)

/, /_health
/title/<tt>, /title/<tt>/fullcredits, /title/<tt>/reviews, /title/<tt>/review (POST)
/title/<tt>/rate (POST), /title/<tt>/watchlist (POST)
/name/<nm>
/find?q=&s=all|tt|nm, /search, /search/title?genre=&year_from=&year_to=&rating_min=&sort=
/chart/top, /chart/toptv, /chart/moviemeter, /chart/boxoffice
/genre/<slug>
/list/watchlist, /list/ratings
/news
/login, /register, /logout, /account
Scored token-overlap search (skill rule: never strict-AND); 18 stop-word list

Benchmark users (per skill convention)

alice.j / bob.c / carol.d / david.k @ test.com, password TestPass123! — each pre-seeded with a 4-item watchlist, 3-5 ratings, and one written review.

tasks.jsonl (18 tasks, 6 functional areas × 3)

4 hard tasks (≥5 agent steps): hardest-rated Nolan film by cross-clicking each filmography entry, advanced-search + click-through + read multi-field, end-to-end review write, cross-title compare
2 disambiguation tasks: bob's SF/fantasy watchlist subset (GoT, Stranger Things); carol's crime film subset (GoodFellas, Seven, Silence of the Lambs)

Data integrity safeguards (added after live audit)

canonical-URL guard: skip 65 title JSONs where ld.url tt_id ≠ filename tt_id (IMDb redirects unknown tt_ids to a random valid page; without this guard, e.g. tt0245429 silently became Psycho)
garbage filter: skip 1854 name JSONs that hit IMDb's 403/error fallback during concurrent scraping
html.unescape() on all string fields; hero h1 preferred over ld.name (which is sometimes original-language: Gisaengchung vs Parasite)
Lowercase substring match for box-office data-testid keys (bo_grossdomestic, bo_cumulativeworldwidegross, ...)

Paired HF PR

instance_seed/imdb.db + static/images/* shipped via:
https://huggingface.co/datasets/ChilleD/WebHarbor/discussions/23

After that HF PR merges, bump .assets-revision to the merge SHA in a follow-up commit.

Test plan

./scripts/build.sh succeeds (after ./scripts/fetch_assets.sh pulls HF assets)
Container starts and /health reports all 16 sites alive
All 22 imdb routes return < 500
POST /reset/imdb → md5sum of instance/imdb.db matches instance_seed/imdb.db (byte-identical)
Container restart preserves md5 byte-identity
/reset-all finishes under 10s (measured 1.3s on 26-site test image)
Playwright handwalk: login + write review on Interstellar → review appears on /reviews page
Playwright handwalk: bob disambiguation watchlist has ≥2 SF/fantasy candidates
Playwright handwalk: Nolan filmography → click Dark Knight → extract rating 9.1

Adds sites/imdb/ — a Flask mirror of imdb.com built per the .claude/skills/ six-phase contributor pipeline (clone-website, design-tasks, evolve-env, harden-env, seed-database). Catalog (scraped from imdb.com via Playwright, gitignored in sites/imdb/scraped_data/; shipped as instance_seed/imdb.db + static/images/ via HF dataset): - 392 titles (Top 250 + Top TV 250 + Most Popular + Box Office) - 3796 persons (cast / crew with real bios + headshots) - 4280 real posters + headshots - 19 canonical IMDb genres Coverage: - 22 routes: homepage / title detail / fullcredits / reviews / rate / watchlist toggle / person detail / scored search / advanced search (genre / year / rating / sort) / 4 charts (Top 250 / Top TV / Most Popular / Box Office) / genre browse / news / auth (login / register / account) - scored token-overlap search (not strict-AND), 18 stop words - 4 benchmark users alice.j / bob.c / carol.d / david.k @ test.com (password TestPass123!) — pre-seeded with 4-item watchlists, 3-5 ratings, one written review each - 12 seeded news items + 12 featured reviews (high helpful-count) across major titles - 18 benchmark tasks in tasks.jsonl across 6 functional areas: chart browse, title detail, person filmography, advanced search, genre browse, user state — including 4 hard tasks (>= 5 steps) and 2 disambiguation tasks (bob's SF/fantasy watchlist subset; carol's crime film subset) Data integrity: - canonical-URL guard skips 65 IMDb-redirected tt_ids (some unknown tt_ids returned the wrong page) - garbage filter skips 1854 names that hit IMDb's 403/ error fallback during concurrent scraping - html.unescape on all string fields; hero h1 preferred over ld.name (which is sometimes original-language) - lowercase substring match for box-office data-testid keys HF asset PR: https://huggingface.co/datasets/ChilleD/WebHarbor/discussions/23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add IMDb mirror (port 40015)#33

Add IMDb mirror (port 40015)#33
hqhq1025 wants to merge 1 commit into
aiming-lab:mainfrom
hqhq1025:feat/imdb-mirror

hqhq1025 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hqhq1025 commented May 26, 2026

Summary

Catalog

Routes (22)

Benchmark users (per skill convention)

tasks.jsonl (18 tasks, 6 functional areas × 3)

Data integrity safeguards (added after live audit)

Paired HF PR

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant