fix: BexleyCouncil - adapt to async page loading#2001
fix: BexleyCouncil - adapt to async page loading#2001InertiaUK wants to merge 1 commit intorobbrad:masterfrom
Conversation
Bexley's WasteWorks site moved from server-side meta-refresh to client-side JS polling. The old approach of requesting ?page_loading=1 no longer returns bin data. The fix uses a session for cookie persistence, hits the main URL first to trigger server-side computation, then polls ?page_loading=1 with the x-requested-with: fetch header that the site's refresh.js expects. Timeout bumped from 30s to 60s per request, with up to 30 poll attempts at 2s intervals.
📝 WalkthroughWalkthroughThe Bexley Council data parser now uses a persistent ChangesBexley Council Page Fetch and Date Parsing
Sequence DiagramsequenceDiagram
actor User
participant Script as BexleyCouncil Parser
participant Session as requests.Session
participant Server as Bexley Server
User->>Script: parse_data(page)
Script->>Session: Set headers (User-Agent, x-requested-with)
Session->>Server: GET page (initial request)
Server-->>Session: Response + cookies/state
activate Script
loop Polling (up to 30 iterations)
Script->>Session: GET page?page_loading=1
Session->>Server: Fetch with x-requested-with: fetch
Server-->>Session: HTML response
Script->>Script: Check for waste-service-name
alt Found
Script->>Script: Break polling loop
deactivate Script
else Not found
Script->>Script: Sleep 200ms, retry
end
end
Script->>Script: Parse collection dates (handle "collected today")
Script->>User: Return parsed bin collection data
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes The changes introduce a polling mechanism with state management and session handling alongside conditional date-parsing logic. The review requires understanding the intent of the polling loop, verifying the page-readiness check is robust, and confirming the date-parsing fallback logic handles all expected date formats correctly. Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 60 minutes.Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #2001 +/- ##
=======================================
Coverage 86.67% 86.67%
=======================================
Files 9 9
Lines 1141 1141
=======================================
Hits 989 989
Misses 152 152 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
uk_bin_collection/uk_bin_collection/councils/BexleyCouncil.py (1)
73-90:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't swallow unexpected date formats here.
print(...)drops the bin entry and returns partial data, which makes Bexley markup changes easy to miss. This parser should raise with the raw date string instead.Based on learnings: In `uk_bin_collection/**/*.py`, when parsing council bin collection data, prefer explicit failures (raise exceptions on unexpected formats) over silent defaults or swallowed errors.Proposed fix
- except ValueError as e: - print(f"Error parsing date for {bin_type}: {e}") + except ValueError as e: + raise ValueError( + f"Unexpected Bexley collection date for {bin_type}: {next_collection}" + ) from e🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@uk_bin_collection/uk_bin_collection/councils/BexleyCouncil.py` around lines 73 - 90, The code currently swallows date parsing errors (in the try/except around parsing next_collection) and just prints them, which drops the bin entry; instead, replace the print swallow with an explicit raise so failures surface: in the block that uses remove_ordinal_indicator_from_date_string, parsed_date, date_format and data["bins"].append (inside the try/except), remove the print(...) in the except ValueError and raise a new ValueError (or re-raise) that includes the bin_type and the raw next_collection string plus the original exception message so callers see the unexpected format and the raw input that failed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@uk_bin_collection/uk_bin_collection/councils/BexleyCouncil.py`:
- Around line 35-45: The polling loop that fetches the council page (using
session.get and inspecting response.text for "waste-service-name") must raise an
explicit error when all 30 attempts still return the loader fragment instead of
falling through to parsing; modify the logic around the for-attempts loop to
check after the loop if "waste-service-name" is not in response.text and raise a
clear RuntimeError (or a project-standard exception) with context (e.g. last
response.status_code and a brief snippet or note about loader/timeout) so
BeautifulSoup(...) is never invoked on a still-loading page; update the code
paths around the variables response, session.get and the subsequent
BeautifulSoup call to use this raised error.
- Around line 30-45: The current polling in BexleyCouncil.py uses timeout=60 per
request which can exceed a 60s overall deadline when combined with the initial
session.get, multiple polls and sleeps; change this to track a single deadline
(use time.monotonic()) set to now + 60s before the initial request, compute
remaining = deadline - now before each network call, and pass
timeout=min(remaining, 60) to session.get (for both the initial bootstrap
request and the "{page}?page_loading=1" polls), abort/raise a timeout if
remaining <= 0, and also cap the time.sleep() between attempts to not sleep past
the remaining budget. Ensure identifiers referenced are the existing session.get
calls and the polling loop around response =
session.get(f"{page}?page_loading=1", ...).
---
Outside diff comments:
In `@uk_bin_collection/uk_bin_collection/councils/BexleyCouncil.py`:
- Around line 73-90: The code currently swallows date parsing errors (in the
try/except around parsing next_collection) and just prints them, which drops the
bin entry; instead, replace the print swallow with an explicit raise so failures
surface: in the block that uses remove_ordinal_indicator_from_date_string,
parsed_date, date_format and data["bins"].append (inside the try/except), remove
the print(...) in the except ValueError and raise a new ValueError (or re-raise)
that includes the bin_type and the raw next_collection string plus the original
exception message so callers see the unexpected format and the raw input that
failed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6a1ce570-1916-4ab3-83ba-b787de63b659
📒 Files selected for processing (1)
uk_bin_collection/uk_bin_collection/councils/BexleyCouncil.py
| # Initial request triggers server-side schedule computation and sets session cookie | ||
| session.get(page, timeout=60) | ||
|
|
||
| # Poll for results — server computes async, JS client polls ?page_loading=1 | ||
| # with x-requested-with header to get the content fragment | ||
| response = None | ||
| for attempt in range(30): | ||
| response = session.get( | ||
| f"{page}?page_loading=1", | ||
| headers={"x-requested-with": "fetch"}, | ||
| timeout=60, | ||
| ) | ||
| response.raise_for_status() | ||
| if "waste-service-name" in response.text: | ||
| found = True | ||
| break | ||
| time.sleep(3) | ||
|
|
||
| if not found: | ||
| raise ValueError("Bexley WasteWorks page did not return expected content after 3 attempts") | ||
| time.sleep(2) |
There was a problem hiding this comment.
Enforce a real 60-second overall deadline.
timeout=60 is per request, so this can block for far longer than 60 seconds once you include the bootstrap request, 30 polls, and sleeps. Track a deadline and cap each request by the remaining budget.
Proposed fix
- # Initial request triggers server-side schedule computation and sets session cookie
- session.get(page, timeout=60)
+ deadline = time.monotonic() + 60
+
+ # Initial request triggers server-side schedule computation and sets session cookie
+ bootstrap = session.get(page, timeout=10)
+ bootstrap.raise_for_status()
# Poll for results — server computes async, JS client polls ?page_loading=1
# with x-requested-with header to get the content fragment
response = None
- for attempt in range(30):
+ for _ in range(30):
+ remaining = deadline - time.monotonic()
+ if remaining <= 0:
+ raise TimeoutError("Timed out waiting for Bexley waste data to load")
response = session.get(
f"{page}?page_loading=1",
headers={"x-requested-with": "fetch"},
- timeout=60,
+ timeout=min(10, remaining),
)
response.raise_for_status()
if "waste-service-name" in response.text:
break
- time.sleep(2)
+ time.sleep(min(2, max(0, deadline - time.monotonic())))🧰 Tools
🪛 Ruff (0.15.12)
[warning] 36-36: Loop control variable attempt not used within loop body
Rename unused attempt to _attempt
(B007)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@uk_bin_collection/uk_bin_collection/councils/BexleyCouncil.py` around lines
30 - 45, The current polling in BexleyCouncil.py uses timeout=60 per request
which can exceed a 60s overall deadline when combined with the initial
session.get, multiple polls and sleeps; change this to track a single deadline
(use time.monotonic()) set to now + 60s before the initial request, compute
remaining = deadline - now before each network call, and pass
timeout=min(remaining, 60) to session.get (for both the initial bootstrap
request and the "{page}?page_loading=1" polls), abort/raise a timeout if
remaining <= 0, and also cap the time.sleep() between attempts to not sleep past
the remaining budget. Ensure identifiers referenced are the existing session.get
calls and the polling loop around response =
session.get(f"{page}?page_loading=1", ...).
| response = None | ||
| for attempt in range(30): | ||
| response = session.get( | ||
| f"{page}?page_loading=1", | ||
| headers={"x-requested-with": "fetch"}, | ||
| timeout=60, | ||
| ) | ||
| response.raise_for_status() | ||
| if "waste-service-name" in response.text: | ||
| found = True | ||
| break | ||
| time.sleep(3) | ||
|
|
||
| if not found: | ||
| raise ValueError("Bexley WasteWorks page did not return expected content after 3 attempts") | ||
| time.sleep(2) |
There was a problem hiding this comment.
Raise when the page never leaves the loading state.
If all 30 polls return the loader fragment, this still falls through to BeautifulSoup(response.text) and returns an empty result instead of surfacing a timeout/site change.
Proposed fix
response = None
- for attempt in range(30):
+ for _ in range(30):
response = session.get(
f"{page}?page_loading=1",
headers={"x-requested-with": "fetch"},
timeout=60,
)
response.raise_for_status()
if "waste-service-name" in response.text:
break
time.sleep(2)
+ else:
+ raise TimeoutError("Timed out waiting for Bexley waste data to load")🧰 Tools
🪛 Ruff (0.15.12)
[warning] 36-36: Loop control variable attempt not used within loop body
Rename unused attempt to _attempt
(B007)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@uk_bin_collection/uk_bin_collection/councils/BexleyCouncil.py` around lines
35 - 45, The polling loop that fetches the council page (using session.get and
inspecting response.text for "waste-service-name") must raise an explicit error
when all 30 attempts still return the loader fragment instead of falling through
to parsing; modify the logic around the for-attempts loop to check after the
loop if "waste-service-name" is not in response.text and raise a clear
RuntimeError (or a project-standard exception) with context (e.g. last
response.status_code and a brief snippet or note about loader/timeout) so
BeautifulSoup(...) is never invoked on a still-loading page; update the code
paths around the variables response, session.get and the subsequent
BeautifulSoup call to use this raised error.
Bexley's WasteWorks site moved from server-side meta-refresh to client-side JS polling (via
refresh.auto.min.jswithdata-every="2"). The old approach of requesting?page_loading=1without a session or the correct headers no longer returns bin data.Changes:
requests.Sessionfor cookie persistence (the server tracks computation state per session)?page_loading=1withx-requested-with: fetchheader, matching what the site's JS doesTested against multiple UPRNs on the live site.
Summary by CodeRabbit
Bug Fixes
Refactor