Skip to content

feat(mega): add new site#2

Open
plzdoo wants to merge 1 commit into
aiming-lab:mainfrom
plzdoo:main
Open

feat(mega): add new site#2
plzdoo wants to merge 1 commit into
aiming-lab:mainfrom
plzdoo:main

Conversation

@plzdoo
Copy link
Copy Markdown

@plzdoo plzdoo commented May 12, 2026

MEGA WebHarbor PR Notes

Real Site Mirrored

MEGA
https://mega.io/

This PR adds a WebHarbor mirror for MEGA, including encrypted cloud storage, pricing, checkout, account management, Cloud drive, MEGA Pass vault, downloads, help center, support tickets, VPN, business, and S4 object storage workflows.

Seeded Rows

  • plans: 19
  • product_pages: 17
  • users: 4
  • help_articles: 36
  • cloud_items: 74
  • downloads: 19
  • vault_items: 13
  • payment_methods: 8
  • subscription_orders: 4
  • support_tickets: 4

Also included:

  • tasks.jsonl: 18 benchmark tasks
  • static/images: 128 real MEGA assets
  • static/icons: 17 icons
  • templates: 24 HTML templates

Hugging Face Assets PR

https://huggingface.co/datasets/ChilleD/WebHarbor/discussions/1

Asset commit shown on HF PR:

https://huggingface.co/datasets/ChilleD/WebHarbor/commit/8997072c

Reset Verification

Command:

curl -X POST http://localhost:8201/reset/mega

Output:

{"pid":211,"ready":true,"site":"mega"}

Byte-identical DB reset verification:

ded8dd2625488968eec03c13f9ba277f  /opt/WebSyn/mega/instance/mega.db
ded8dd2625488968eec03c13f9ba277f  /opt/WebSyn/mega/instance_seed/mega.db

Additional Verification

  • ./scripts/build.sh webharbor:dev passed.
  • Container ran with -p 8201:8101 -p 41000-41015:40000-40015.
  • All ports 41000-41015 returned HTTP 200.
  • Container Playwright smoke passed for homepage, pricing, downloads, help, login, and Cloud drive.

hqhq1025 referenced this pull request in hqhq1025/WebHarbor May 26, 2026
…real page coverage)

Per-site outcomes (data growth + tasks):
- allrecipes: recipes 674→1316, tasks 45→161 (+5 new pages: meals/ingredients/cuisines/newsletter/sitemap; /register+/login+/recipe-box POST verified)
- amazon: products 743→1506 (Open Library + brand SKUs across 8 categories), tasks 41→155 (+/registry+/sell; /api/cart/update + /order/return real persistence; fixed 9 datetime.utcnow defaults)
- apple: products 157→305 (Watch bands, cases, accessories, legacy lineup), trade_in_values 14→54, tasks 43→158 (+9 static pages + /trade-in/quote JSON endpoint with condition multipliers)
- arxiv: papers 4079→8692 (35 cats × 3 windows, 3.2s throttle), tasks 43→153 (MEGA-style index reorder + VACUUM applied)
- bbc_news: articles 632→1304 (25 BBC sub-feeds), tasks 42→156 (+/privacy+/terms+/cookies+/accessibility)
- booking: property 649→1258, city 85→136, landmark 203→329, tasks 44→150 (+/customer-service+/legal+/careers+/press incl. aliases)
- cambridge_dictionary: words 3321→7021 (+IPA/audio columns), tasks 43→159 (+/word-of-the-day archive+/wordlists×6+/blog+/about+/help)
- coursera: courses 581→1369 (Specializations/Pro-Certs/Guided-Projects/Degrees), tasks 42→160 (+/for-universities+/for-government+/help+/careers+/mobile+/blog)
- espn: articles 501→1033, games 316→526, player_stats 228→508, tasks 44→160 (+/about+/press+/careers+/watch; /favorite POST persists)
- github: repos 1158→2846 (3 star bands), commits 6850→26422, PRs 0→1500, tasks 41→157 (+Solutions/Enterprise/Docs/API/Status/Blog/Contact/Privacy/Terms stubs)
- google_flights: airports 166→365 (OpenFlights full), bookings 12→63, tasks 42→152 (+/about+/help+/privacy+/terms+/trips+ map view)
- google_map: places 6872→16706, cities 219→415, reviews 280→1050, photos 140→440, timeline 75→225, tasks 41→156 (+5 new categories incl. chains, +/contribute+/your-data)
- google_search: search_results 1170→2744, paa 319→817, topics 213→418, tasks 43→176 (real pagination 10/page × 10 tabs)
- huggingface: repos 2276→6941 (datasets +4138), discussions 54→112, tasks 43→158 (+/posts+/solutions+/compete; fixed 12 datetime.utcnow defaults + md5-derived non-randomness)
- wolfram_alpha: computation_results 573→1314, topics 61→116, notebook_entries 84→204, topic_feedback 32→67, tasks 46→156

Determinism: every site now applies the full gotchas.md fix set
(PINNED bcrypt #1, alpha-sorted indexes + VACUUM #2, MIRROR_REFERENCE_DATE
for both seed-loop and Column-default datetime.utcnow aiming-lab#3, md5-derived
seeds replacing hash() #2-tail, tie-breaker .id.asc() on top-N aiming-lab#12).
hqhq1025 referenced this pull request in hqhq1025/WebHarbor May 26, 2026
Tasks now average 1556/site (R4: 870, R3: 387, R2: 158, R1: 45).

- allrecipes: recipes 5790→8434 (make-ahead/freezer/kid-friendly/global-fusion), tasks 805→1512; skip-link, mobile drawer, AJAX meal-plan, form validation, 4 error templates
- amazon: products 5626→12284, tasks 804→1504; +/departments+/gift-finder+/subscribe-save+/registry; AJAX search-suggest with full keyboard nav, sold-out + 6 alternatives, 30-min cancel window
- apple: products 904→1405, tasks 811→1616; 11 new routes (trade-in/IMEI, AppleCare/coverage, repair/status, apple-card, wallet/add, find-my/locate-airtag, family-sharing/add-member, shipping/promo APIs); sticky CTA, configurator 5-step indicator, toast + focus-trap
- arxiv: papers 47175→78414, tasks 837→1509, +4 columns (paper_version/license/submitter_email_masked/computer_classification); filter chips, sort dropdown, version history modal, /alerts saved-searches, dyslexia-font toggle
- bbc_news: articles 4031→6525 (+BBC Verify 754, Newsround 800, 24 live blogs + 720 updates, 250 quizzes), tasks 803→1509; dark/high-contrast toggles, ARIA-live, video chapters, transcript, reaction bar
- booking: property 7869→11080, tasks 1152→1972, +4 columns (sustainability/payment/languages/neighborhood); <dialog> lightbox, map-cluster Search-this-area, min-stay banner
- cambridge_dictionary: words 18021→26021 (phrasal-verb + idiom dicts), tasks 906→1706; /flashcards (swipe), /settings/accessibility (dyslexia font, color-blind CEFR), playback-speed
- coursera: courses 4378→6597 (GenAI/LLM/Agentic/Quantum/Robotics), partners 278→299 (xAI/AI21/LangChain/IBM Quantum/Boston Dynamics), tasks 818→1649, +3 columns (preview_video/textbook_isbn/workload_hours); hover-preview, continue-learning panel, quiz autosave
- espn: games 2251→3511, play_by_play 1130→2990, podcasts 50→164, tasks 803→1543; scoreboard date-swiper, fantasy keyboard drag, parlay line-shift, mobile bottom-nav, /awards+/recruiting aliases
- github: repos 12000→18200, commits 131437→205000, pulls 3000→7200, tasks 890→1610; file-tree async with role=tree+arrow-keys, code-search <mark>+aria-live, issue templates dropdown, star burst animation
- google_flights: airports 1250→2050, flights 260313→405200, tasks 802→1503; airport autocomplete, calendar drag-select+arrow-keys, seat-map keyboard, mobile bottom-sheet
- google_map: places 138561→200273, tasks 812→1512, +9 Place columns (noise/crowd/mask/indoor_zone/floors/parking/EV) + 3 TransitLine (delay); /transit/realtime+/floors+/qr+/your-data/export+/meetup; 5 layer toggles, swipe carousel
- google_search: search_result 10704→15732, paa 2907→5421, kfact 4094→7447, tasks 894→1714; /api/suggestions w/ history+spelling, voice modal, sticky tabs, infinite scroll toggle
- huggingface: repos 58062→81560 (models 36034/datasets 39010/spaces 6516), tasks 823→1620, +4 columns (license_url/eval_results/hardware/compute_hours); 8 routes (snippet/viewer-row/readme-section/like-animate/thread/autotrain-estimate/paper-impl/billing); color-blind-safe code palette
- wolfram_alpha: computation_results 5194→12277 (4-7 sub-pods each: alt-forms/step-by-step/WL/Python/comparison/SVG plot), topics 354→554, notebook_entries 704→1504, tasks 833→1512; /share?as=image, /widget/builder?source=, assumption tooltip, plot zoom/pan controls

Hidden-bug fixes during R5: 2 more Python hash() non-determinism leaks caught (allrecipes _det_hash + wolfram SVG plot keys) — gotcha #2 reinforced.
hqhq1025 referenced this pull request in hqhq1025/WebHarbor May 26, 2026
Tasks avg 2628/site (R5: 1556, R4: 870, R3: 387, R2: 158, R1: 45). On track for ×100 by R10.

- allrecipes: recipes 8434→14054 (copycat/budget/holiday/air-fryer + 48 chef), tasks 1512→2550; 'Recipes from {chef}' + 'More Like X' carousels, /__rate-limit-demo /__session-expired-demo /__loading-demo /__server-error-demo
- amazon: products 12284→21117 (long-tail brands + 1031 sold-out + 1000 low-stock), tasks 1504→2934; 4-col compare-with-similar, recently-viewed (session), /compare, /notify-when-back, /session-expired
- apple: products 1405→2017 (224 iPhone 17 cases + 80 Watch bands + …), tasks 1616→2571; inject_r6_context processor (15 URL prefixes), 6 edge routes (notify-arrival/config-check/imei-verify/applecare-eligibility/repair-lookup/gift-card age-verify)
- arxiv: papers 78414→114114 (+16 OAI windows 2020-2025), tasks 1509→2595; /papers/<id>/citing-list, /papers/<id>/authors-also, withdrawn+replaced banners, 404 with nearby suggestions
- bbc_news: articles 6525→10367 (50 reporters × 30 beats + 60 countries × 12 angles), tasks 1509→2713; 6 edge banner codes, 'More from reporter'+'Top story today'+'Related topics' sidebars
- booking: property 11080→16795, tasks 1972→3072; breadcrumb (Home>City>Neighborhood>Property), same-area + you-might-also-like + Compare-3 + Wishlist cross-link
- cambridge_dictionary: words 26021→36021, tasks 1706→2616; 8 new routes (same-root/shared-collocation/antonyms/learner-vs-academic toggle/audio-fallback/cooldown/level-mismatch/word-not-found)
- coursera: courses 6597→10034 (Sustainability/BioTech/FinTech/Cyber/SpaceTech/EdgeAI), tasks 1649→2519; 4 sidebar panels (Specialization-includes/Same-instructor/Next-recommended/Prereq-path), /credit-transfer, /financial-aid/<status>×4
- espn: games 3511→5061, articles 2238→3618, betting_odds 540→1380, tasks 1543→2513; 6 edge banner categories (postponed/injured/fantasy-lock/ESPN+-paywall/bet-region-block/future-protected)
- github: repos 18200→25000, tasks 1610→2510; r6lab org + 14 sentinel repos covering 7 edge modes (archived/protected/conflict/action-failed/codespace-quota/dmca/fork-rate-limit) wired in route layer; /<repo>/network/dependents 'Used by'
- google_flights: airports 2050→3050, flights 405200→869700 (extended to 2027-12-31), tasks 1503→2561; flight_detail breadcrumb 4-level + Other-airlines + Different-dates + Connections-via-X
- google_map: places 200273→290405 (chain_brand backfilled 22606 rows), tasks 1512→2512; 6 edge banners (permanently/temporarily-closed/accessibility/floor-not-mapped/no-route/after-hours)
- google_search: topics 838→1323, search_result 15732→31562, paa 5421→11462, tasks 1714→2736; 6 edge routes (sorry/dmca/disambiguate/trending-region/voice-no-mic/zero-results), Refine + Searches-related blocks
- huggingface: repos 81560→121417 (+10k models/+5k datasets/+17k spaces), tasks 1620→2508; +7 columns (gated/not_for_production/build_status/citing_papers/etc.), 5 routes (/access/logs/papers/arxiv-fallback/fine-tuned/endpoint-quota), 4 lineage cards
- wolfram_alpha: computation_results 12277→18224 (each with 5 new R6 pods), tasks 1512→2503; 6 edge routes (ambiguous/timeout/step-locked/notebook-quota/share-expired/widget-blocked)

R6 caught yet more Python hash() non-determinism: google_search trending.term, wolfram svg-plot key, google_map chain_brand — all switched to md5/sha1. Gotcha #2 reinforced 5+ times across rounds.
hqhq1025 referenced this pull request in hqhq1025/WebHarbor May 27, 2026
…entries / rewritten 668 tasks

Replaces the failed R4/R5/R6/R10 subagent attempt that had four bugs:

1. Entry chain broken: /images, /videos, /scholar/search were 404 because
   no <a href> on index/results pointed to them.
   Fix: added Flask routes /images, /videos, /scholar/search (alias of
   /scholar/results) plus visible <a href> tabs on base.html, index.html,
   search.html, and two new hub templates (r4_image_hub.html,
   r5_video_hub.html) with Tools / Usage-rights / Quality / Duration
   pivot links. 30/30 GUI-chain sample test now hits the answer in
   ~10 HTTP GETs per task.

2. Data was synthetic: R4_CARDS / R5_VIDEOS / R10_* were hardcoded
   strings ("nasa.gov" placeholders, fabricated channels, fake captions).
   Fix: 24 image cards + 24 video cards now seeded from Tavily live
   search hits (unsplash / gettyimages / dezeen / wikimedia / iso.500px
   / motionarray / bigcatphotography / stock.adobe / cntraveler /
   alamy etc. for images; YouTube + Vimeo + NASA+ + ted.com upstream
   URLs with real channel names like 3Blue1Brown / TED-Ed / BBC Earth /
   Berliner Philharmoniker / Stefan Forster / Sebastian Lague / NeurIPS
   Foundation / Pasta Grammar / Lets Get Rusty / EEVblog for videos).
   R6 papers + R10 entities were already real and are preserved.

3. In-memory module dict bypassed byte-id checks: data lived in
   _r4_r10_routes.py global lists, never in SQLite.
   Fix: added 9 SQLAlchemy tables (ImageCard / VideoCard / ScholarPaper
   / ScholarCitation / FeaturedSnippet / PaaBundle / PaaQuestionRow /
   KnowledgePanel / KnowledgePanelFact) populated by seed_r4_r10_tables()
   from _real_data.py at seed time only. All runtime route handlers
   now query the DB. Double-rebuild md5 still matches:
   76eb1cfcc3ee48e27c545e682e0642b9 (instance, instance_seed, /tmp
   first-run copy and second-run copy all four identical).

4. Same-句式 batches: 887 old tasks like "Aurora / Lavender / Prague"
   shared 100% identical templates.
   Fix: new _build_tasks_quality_r4_r10.py with 5 phrasing variants per
   field, 7 detail-page answer fields per surface (source / dims /
   license / type / alt / source-owner / caption), filter-count tasks,
   plus 3-prompt multi-step chains. Result: 668 tasks across 52 groups,
   max group size 24 (down from old groups of 12-24 identical-sentence
   tasks; nothing exceeds 30). Numeric WebVoyager tasks (3122 of them,
   id 0..N) are preserved unchanged.

Verification:
  - Real data harvested: 24 image cards (real source domains), 24 video
    cards (real upstream URLs), 16 papers, 31 citation edges, 12
    snippets, 10 PAA bundles, 32 PAA rows, 8 knowledge panels, 48 facts.
    Total 205 rows in 9 new DB tables.
  - 30-sample GUI chain test: 30/30 PASS; avg 10 HTTP GETs/task; each
    chain starts at "/" -> tab -> hub -> filter/list -> detail.
  - Byte-id reset: 4 md5s match (rebuild #1, rebuild #2, instance copy,
    instance_seed copy).
  - All r-task groups <= 30 (max 24).
  - No new /api/, /graphql, /healthz routes added.
  - bcrypt hash still pinned, seed users unchanged.
hqhq1025 referenced this pull request in hqhq1025/WebHarbor May 27, 2026
…sks / real compass.com)

Adds Compass-realistic surfaces sourced from real compass.com:

New models (18): Neighborhood, School, ListingSchool, Office, Team,
TeamMember, MarketReport, BlogPost, PriceHistory, AgentReview, AgentAward,
SoldListing, Note, Offer, MortgageScenario, AffordabilityResult,
HomeEvaluation, CMARequest, NewsletterSignup, NeighborhoodAlertSubscription,
MarketReportSubscription. All deterministically seeded.

New templates (37): listing photos/floor-plan/video-tour/price-history/
walkscore/schools/neighborhood; neighborhood index + detail; sold homes
index/city/detail; market reports index/detail; agent reviews/sold/awards;
teams + team detail; offices + office detail; buy hub; sell hub +
evaluation + CMA; blog index + post; mortgage/affordability/closing-cost
calculators; notes_index + offers_index + offer_new + listing_share +
collection_invite; newsletter + simple_landing.

New POST endpoints (15): submit offer, share listing, invite to collection,
add/delete note, neighborhood alert subscribe, market report subscribe,
submit agent review, sell home evaluation, CMA request, save mortgage
scenario, save affordability result, newsletter signup, listing
contact-agent/schedule-tour 307 redirects.

Counts: 70 templates (was 33), 32 POST routes (was 17), 86 total routes
(was 39), 2009 tasks (was 18). Reseed is byte-identical (md5
814a339e8cf9733a10893789121da8cd) via normalize_seed_db_layout() applied
after first fresh seed (harden-env gotcha #2 fix).

Image utilization: full gallery now exposed via /listing/<slug>/photos
(was hero + 3 thumbs only).

All new pages cross-linked from base.html nav, footer, listing_detail,
agent_detail, account nav.
hqhq1025 referenced this pull request in hqhq1025/WebHarbor May 27, 2026
…asks / real berkeley.edu)

Routes 24 -> 69; templates 23 -> 67; POST 5 -> 24; tasks 30 -> 1984.
Image utilization 0% -> 73.7% (76 SVG assets across campus / faculty /
event / fund / library / sport categories, deterministic md5-to-file
mapping). 1984 GUI tasks, 0 API-style, 19 disambiguation, 100% unique.

New page families mirroring berkeley.edu structure:
- Library: 23 branch hub + branch detail + reservation form
- Athletics: 28 varsity sports hub + team detail + tickets request
- Alumni: 25 chapter hub + chapter detail + directory profile update
- Giving: 15 funds + fund detail + donate form + monthly recurring
- Leadership: 15 profiles + profile detail + diversity hub
- Student services: 12 services hub + service detail
- Financial aid: 10 programs + forms + cost calculator + apply form
- Admissions: undergrad / grad / transfer / international + apply + visit + request-info
- News category hubs (10)
- Event RSVP / signup / suggest, faculty contact, dept meeting, program
  inquiry, research contact, newsletter, contact, careers, history,
  strategic plan, account edit.

Byte-identical reset preserved: pinned User.created_at to SNAPSHOT_DT
(was datetime.utcnow), added index-normalize + VACUUM pattern after
seed (gotcha #2/aiming-lab#3) -> instance_seed/berkeley.db md5 stable across
rebuilds.

All 24 POST routes redirect 302 with flash. Tasks pinned to port 40016
(matches site_runner index). Image files + seed DB are gitignored per
repo policy (managed via HF assets dataset).
hqhq1025 referenced this pull request in hqhq1025/WebHarbor May 27, 2026
… / real imdb.com)

Brings the IMDb mirror from baseline (25 routes / 18 templates / 18 tasks /
5 POST) to vanilla parity. New content derives from existing real scraped
imdb.com payloads (keywords, cast characters, box-office, akas) — no external
APIs.

New surfaces (33 new HTML templates wiring 36 GET pages):
  - Title sub-pages: /trivia /quotes /goofs /awards /parents-guide
    /technical-specs /keywords /locations /companies /release-info
    /external-sites /connections /photos /soundtrack /faq /episodes
  - Name sub-pages: /bio /personal-life /awards /quotes /trivia /photos
    /filmography
  - Lists: /lists /list/<id> /lists/new /list/<id>/edit
  - Polls: /polls /poll/<slug>
  - Charts: /chart/popular_tv /chart/lowest_rated
  - Other: /news/<id> /search/name /myaccount/recently-viewed
    /account/edit /account/password

27 POST endpoints (was 5). New POSTs: submit trivia / quote / goof,
report title / name / review, vote helpful on review, flag review,
delete own review, vote on poll, suggest poll option, list CRUD,
follow/unfollow person, mark watched, edit profile, change password,
clear watchlist / ratings / follows.

12 polls + 8 user lists + 30 list items + 2311 trivia + 1924 quotes +
1297 goofs + seed follows + helpful votes — all deterministic-seeded.

Byte-identical reset invariant: gotchas #1/#2/aiming-lab#3/aiming-lab#12 applied —
fixed-salt pbkdf2 hashes, alpha-sorted index recreation, repacked
title_genre M2M, MIRROR_REFERENCE_DATE everywhere, VACUUM. Two clean
rebuilds produce md5 4ee4fd9c6a687fade6f611fa98609c52.

tasks.jsonl: 2120 tasks across 57 task_types (was 18 tasks). 94% reach
image-bearing surfaces (posters / headshots / photo galleries) — well
above the 40% target. IDs use IMDb--gui_<page>_<NNN>; every task carries
a task_type field.

Note: instance_seed/imdb.db must be re-shipped to the HF dataset and
.assets-revision bumped to the new HF SHA before this lands on main.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant