Skip to content

reddit: document blanket 403 walls + JSON-via-browser fallback#429

Open
ardasisbot wants to merge 1 commit into
browser-use:mainfrom
ardasisbot:reddit-json-via-browser
Open

reddit: document blanket 403 walls + JSON-via-browser fallback#429
ardasisbot wants to merge 1 commit into
browser-use:mainfrom
ardasisbot:reddit-json-via-browser

Conversation

@ardasisbot

@ardasisbot ardasisbot commented Jun 11, 2026

Copy link
Copy Markdown

What

Adds a Path 1.5 to the reddit scraping skill: fetching reddit's .json endpoints by navigating them in the user's Chrome and parsing document.body.innerText.

Why

Field-tested 2026-06-11: reddit.com returned 403 for every anonymous .json request from a datacenter IP - browser User-Agent made no difference, and the response is an HTML challenge page, not JSON. The existing Path 1 (http_get) only documents 401/429 failure modes, so an agent hitting the blanket 403 wall has no documented recovery. Navigating the same URLs in the user's real browser session passes cleanly.

Also adds:

  • the subreddit search.json endpoint (q supports OR and quoted phrases, t=month etc.) for topic sweeps
  • thread-comments params (?limit=10&sort=top&depth=1&raw_json=1)
  • stale-session recovery (Runtime.evaluate timed out mid-loop → ensure_real_tab() + retry)
  • the single-quote f-string trap inside browser-harness -c '...'

🤖 Generated with Claude Code


Summary by cubic

Documented Reddit’s blanket 403 on anonymous .json requests and added a JSON‑via‑browser fallback (Path 1.5) that uses the user’s Chrome to fetch .json and parse document.body.innerText. This gives the scraping skill a reliable path when http_get is blocked.

  • New Features
    • Path 1.5: navigate .json in a real tab and parse document.body.innerText to bypass CDN 403s.
    • Documented endpoints: subreddit search (/r/<sub>/search.json with q, t, limit, raw_json=1) and thread + top comments (/comments/<id>.json?...).
    • Retry guidance: if the tab goes stale (Runtime.evaluate timed out), call ensure_real_tab() and retry.
    • Shell note: avoid single quotes in f-strings with browser-harness -c; use double quotes or .format().

Written for commit 39bfbb0. Summary will update on new commits.

Review in cubic

…1.5)

Anonymous .json requests can be 403-blocked wholesale at the CDN level
(observed June 2026, datacenter IP, browser UA made no difference).
Document the recovery: navigate .json URLs in the user's Chrome and parse
document.body.innerText. Also adds the subreddit search.json endpoint,
thread-comments params, stale-session retry, and -c shell-quoting trap.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="agent-workspace/domain-skills/reddit/scraping.md">

<violation number="1" location="agent-workspace/domain-skills/reddit/scraping.md:51">
P2: Path 1.5 documentation for `/comments/<id>.json` omits `kind: "more"` entries in `data[1]["data"]["children"]`, giving an incorrect data-shape guarantee that could cause KeyErrors in agent-generated code. The existing Path 1 section already correctly documents `kind: "more"` for the same endpoint.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

Useful JSON endpoints beyond single posts:

- **Subreddit search:** `/r/<sub>/search.json?q=<query>&restrict_sr=on&sort=top&t=month&limit=25&raw_json=1` — `q` supports quoted phrases and `OR` (`q=tax efficient OR "tax loss harvesting"`, URL-encoded). `t` ∈ hour/day/week/month/year/all.
- **Thread + top comments:** `/r/<sub>/comments/<id>.json?limit=10&sort=top&depth=1&raw_json=1` — `data[1]["data"]["children"]` are top-level comments (`body`, `score`, `author`); filter out `stickied`.

@cubic-dev-ai cubic-dev-ai Bot Jun 11, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Path 1.5 documentation for /comments/<id>.json omits kind: "more" entries in data[1]["data"]["children"], giving an incorrect data-shape guarantee that could cause KeyErrors in agent-generated code. The existing Path 1 section already correctly documents kind: "more" for the same endpoint.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At agent-workspace/domain-skills/reddit/scraping.md, line 51:

<comment>Path 1.5 documentation for `/comments/<id>.json` omits `kind: "more"` entries in `data[1]["data"]["children"]`, giving an incorrect data-shape guarantee that could cause KeyErrors in agent-generated code. The existing Path 1 section already correctly documents `kind: "more"` for the same endpoint.</comment>

<file context>
@@ -29,6 +29,32 @@ Fails on:
+Useful JSON endpoints beyond single posts:
+
+- **Subreddit search:** `/r/<sub>/search.json?q=<query>&restrict_sr=on&sort=top&t=month&limit=25&raw_json=1` — `q` supports quoted phrases and `OR` (`q=tax efficient OR "tax loss harvesting"`, URL-encoded). `t` ∈ hour/day/week/month/year/all.
+- **Thread + top comments:** `/r/<sub>/comments/<id>.json?limit=10&sort=top&depth=1&raw_json=1` — `data[1]["data"]["children"]` are top-level comments (`body`, `score`, `author`); filter out `stickied`.
+- `raw_json=1` stops Reddit HTML-escaping `&`, `<`, `>` in text fields.
+
</file context>
Fix with cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant