Skip to content

Add IMDb mirror (port 40015)#33

Open
hqhq1025 wants to merge 1 commit into
aiming-lab:mainfrom
hqhq1025:feat/imdb-mirror
Open

Add IMDb mirror (port 40015)#33
hqhq1025 wants to merge 1 commit into
aiming-lab:mainfrom
hqhq1025:feat/imdb-mirror

Conversation

@hqhq1025
Copy link
Copy Markdown

Summary

  • Adds sites/imdb/ — a Flask mirror of imdb.com that fits WebHarbor's 16th port slot (40000+15 = 40015)
  • Built end-to-end per the .claude/skills/ six-phase pipeline (clone-website → design-tasks → evolve-env → harden-env → seed-database)
  • 392 titles / 3796 persons / 4280 real posters & headshots, scraped from imdb.com with Playwright; assets shipped via the paired HF dataset PR

Catalog

Entity Count Source
Titles 392 Top 250 movies + Top 250 TV + Most Popular + Box Office (deduped)
Persons 3796 Cast/crew of all titles, with real bios + headshots
Credits 4852 director/writer/actor/producer/composer
Genres 19 Canonical IMDb genre buckets
Posters + headshots 4280 Real images downloaded from m.media-amazon.com
Featured reviews 12 Curated, attached to high-traffic titles
News items 12 Industry-style headlines linked to titles/people
Box office data 145 movies Real US / WW / opening / budget figures

Routes (22)

  • /, /_health
  • /title/<tt>, /title/<tt>/fullcredits, /title/<tt>/reviews, /title/<tt>/review (POST)
  • /title/<tt>/rate (POST), /title/<tt>/watchlist (POST)
  • /name/<nm>
  • /find?q=&s=all|tt|nm, /search, /search/title?genre=&year_from=&year_to=&rating_min=&sort=
  • /chart/top, /chart/toptv, /chart/moviemeter, /chart/boxoffice
  • /genre/<slug>
  • /list/watchlist, /list/ratings
  • /news
  • /login, /register, /logout, /account
  • Scored token-overlap search (skill rule: never strict-AND); 18 stop-word list

Benchmark users (per skill convention)

alice.j / bob.c / carol.d / david.k @ test.com, password TestPass123! — each pre-seeded with a 4-item watchlist, 3-5 ratings, and one written review.

tasks.jsonl (18 tasks, 6 functional areas × 3)

  • 4 hard tasks (≥5 agent steps): hardest-rated Nolan film by cross-clicking each filmography entry, advanced-search + click-through + read multi-field, end-to-end review write, cross-title compare
  • 2 disambiguation tasks: bob's SF/fantasy watchlist subset (GoT, Stranger Things); carol's crime film subset (GoodFellas, Seven, Silence of the Lambs)

Data integrity safeguards (added after live audit)

  • canonical-URL guard: skip 65 title JSONs where ld.url tt_id ≠ filename tt_id (IMDb redirects unknown tt_ids to a random valid page; without this guard, e.g. tt0245429 silently became Psycho)
  • garbage filter: skip 1854 name JSONs that hit IMDb's 403/error fallback during concurrent scraping
  • html.unescape() on all string fields; hero h1 preferred over ld.name (which is sometimes original-language: Gisaengchung vs Parasite)
  • Lowercase substring match for box-office data-testid keys (bo_grossdomestic, bo_cumulativeworldwidegross, ...)

Paired HF PR

instance_seed/imdb.db + static/images/* shipped via:
https://huggingface.co/datasets/ChilleD/WebHarbor/discussions/23

After that HF PR merges, bump .assets-revision to the merge SHA in a follow-up commit.

Test plan

  • ./scripts/build.sh succeeds (after ./scripts/fetch_assets.sh pulls HF assets)
  • Container starts and /health reports all 16 sites alive
  • All 22 imdb routes return < 500
  • POST /reset/imdb → md5sum of instance/imdb.db matches instance_seed/imdb.db (byte-identical)
  • Container restart preserves md5 byte-identity
  • /reset-all finishes under 10s (measured 1.3s on 26-site test image)
  • Playwright handwalk: login + write review on Interstellar → review appears on /reviews page
  • Playwright handwalk: bob disambiguation watchlist has ≥2 SF/fantasy candidates
  • Playwright handwalk: Nolan filmography → click Dark Knight → extract rating 9.1

Adds sites/imdb/ — a Flask mirror of imdb.com built per the
.claude/skills/ six-phase contributor pipeline (clone-website,
design-tasks, evolve-env, harden-env, seed-database).

Catalog (scraped from imdb.com via Playwright, gitignored in
sites/imdb/scraped_data/; shipped as instance_seed/imdb.db +
static/images/ via HF dataset):
  - 392 titles  (Top 250 + Top TV 250 + Most Popular + Box Office)
  - 3796 persons (cast / crew with real bios + headshots)
  - 4280 real posters + headshots
  - 19 canonical IMDb genres

Coverage:
  - 22 routes: homepage / title detail / fullcredits / reviews /
    rate / watchlist toggle / person detail / scored search /
    advanced search (genre / year / rating / sort) /
    4 charts (Top 250 / Top TV / Most Popular / Box Office) /
    genre browse / news / auth (login / register / account)
  - scored token-overlap search (not strict-AND), 18 stop words
  - 4 benchmark users alice.j / bob.c / carol.d / david.k
    @ test.com (password TestPass123!) — pre-seeded with
    4-item watchlists, 3-5 ratings, one written review each
  - 12 seeded news items + 12 featured reviews (high
    helpful-count) across major titles
  - 18 benchmark tasks in tasks.jsonl across 6 functional areas:
    chart browse, title detail, person filmography, advanced
    search, genre browse, user state — including 4 hard tasks
    (>= 5 steps) and 2 disambiguation tasks (bob's SF/fantasy
    watchlist subset; carol's crime film subset)

Data integrity:
  - canonical-URL guard skips 65 IMDb-redirected tt_ids
    (some unknown tt_ids returned the wrong page)
  - garbage filter skips 1854 names that hit IMDb's 403/
    error fallback during concurrent scraping
  - html.unescape on all string fields; hero h1 preferred
    over ld.name (which is sometimes original-language)
  - lowercase substring match for box-office data-testid keys

HF asset PR:
  https://huggingface.co/datasets/ChilleD/WebHarbor/discussions/23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant