Skip to content

Add GOV.UK mirror site (port 40015)#32

Open
lamawmouk wants to merge 1 commit into
aiming-lab:mainfrom
lamawmouk:feat/gov-uk-mirror
Open

Add GOV.UK mirror site (port 40015)#32
lamawmouk wants to merge 1 commit into
aiming-lab:mainfrom
lamawmouk:feat/gov-uk-mirror

Conversation

@lamawmouk
Copy link
Copy Markdown

TL;DR

Adds a Flask mirror of gov.uk as the 16th WebHarbor site (port 40015), with topic browse, guidance article detail, department directory, announcements, and search. Uses the official MIT-licensed govuk-frontend v6.1.0 for canonical Design System DOM.

Companion HuggingFace PR: https://huggingface.co/datasets/ChilleD/WebHarbor/discussions/22

What's in this PR

sites/gov_uk/:

File Purpose
app.py 5 SQLAlchemy models, 9 routes
seed_data.py Idempotent seed: 16 topics, 44 subtopics, 15 departments, 73 articles, 20 announcements
templates/*.html base + 9 page templates using canonical govuk-frontend DOM
static/{css,js,fonts,icons}/ Official govuk-frontend v6.1.0 bundle (MIT)
tasks.jsonl Stub; WebVoyager tasks in follow-up

Registration (sync per AGENTS.md): gov_uk added to websyn_start.sh and control_server.py, Dockerfile EXPOSE bumped to 40000-40015.

Verification

All checks in AGENTS.md § Pre-PR checks pass: image builds clean, 16/16 sites alive, every gov_uk route returns 200, POST /reset/gov_uk byte-identical pre/post (md5 f6931b6c…), and identical after docker restart.

Notes

  • Content is synthesized (no upstream copy); OGL v3.0 would permit direct copy but synth keeps the seed at 128 KB and deterministic.
  • govuk-frontend.min.css only patched with one sed to rewrite url(/assets/...) → relative paths so they resolve through Flask's /static/.
  • .assets-revision still points at main; will bump to the HF merge SHA after that PR is reviewed.

Adds a Flask mirror of https://www.gov.uk/ as the 16th WebHarbor site,
running on port 40015.

## What's mirrored

- 16 top-level topics (Money and tax, Visas and immigration, Driving, ...)
- 44 subtopics
- 15 government departments (HMRC, DfE, Home Office, DVLA, NHS England, ...)
  with real ministers / permanent secretaries / employee counts
- 73 guidance articles (Self Assessment, Income Tax, Universal Credit,
  Skilled Worker visa, passport applications, vehicle tax, ...)
- 20 announcements (press releases, news stories, speeches)
- Search across articles / announcements / departments

## Visual fidelity

Uses the official MIT-licensed govuk-frontend v6.1.0 CSS + JS + GDS
Transport font + crown SVG. Templates use the canonical Design System
component DOM (govuk-header, govuk-breadcrumbs, govuk-summary-list,
govuk-pagination, govuk-grid-row, etc.) so an agent's selectors match
the real GOV.UK.

Content licensed under the Open Government Licence v3.0 (synthesized
in the spirit of GOV.UK guidance; no upstream copy embedded).

## Folder layout

Matches the canonical site layout (compare wolfram_alpha, google_search):

  sites/gov_uk/
  |-- _health.py
  |-- app.py
  |-- seed_data.py
  |-- tasks.jsonl
  |-- instance_seed/        (HF-managed)
  |-- static/{css,js,fonts,icons,images,external_cache}/
  \`-- templates/

## Wiring

- websyn_start.sh: gov_uk appended to SITES, 15->16 counts
- control_server.py: gov_uk added to SITES
- Dockerfile: EXPOSE 40000-40015

## Pre-PR verification (passed)

- docker build webharbor:dev clean (5.92 GB)
- 16/16 sites bind in 2s
- All gov_uk routes (/, /browse, /browse/<topic>, /browse/<t>/<s>,
  /guidance/<slug>, /government/organisations[/<dept>],
  /government/announcements, /search, /_health) return 200
- /reset/gov_uk -> {ready: true}, md5 byte-identical pre/post
- Byte-identical after docker restart

## Asset PR

Seed DB (gov_uk.tar.gz, 32 KB) uploaded as HF PR:
https://huggingface.co/datasets/ChilleD/WebHarbor/discussions/22

.assets-revision will be bumped to the HF merge SHA once that PR lands.
@lamawmouk lamawmouk force-pushed the feat/gov-uk-mirror branch from 96f4916 to 63b73a5 Compare May 24, 2026 22:48
@lamawmouk lamawmouk changed the title feat(gov_uk): add GOV.UK mirror site (port 40015) Add GOV.UK mirror site (port 40015) May 24, 2026
@lamawmouk
Copy link
Copy Markdown
Author

@Raibows would you be able to review this when you have a chance? Thanks! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant