Skip to content

feat: add SRE incident response agent#541

Open
neelay-aign wants to merge 1 commit intomainfrom
feat/sre-incident-response-agent
Open

feat: add SRE incident response agent#541
neelay-aign wants to merge 1 commit intomainfrom
feat/sre-incident-response-agent

Conversation

@neelay-aign
Copy link
Copy Markdown
Contributor

Summary

  • Add a background SRE agent that triages BetterStack incidents for the Python SDK using the Anthropic Managed Agents API
  • Agent uses GitHub MCP server + web search + built-in tools (zero custom tools) to read workflow run logs, diagnose failures, and create draft fix PRs
  • Orchestrator runs as a GitHub Actions workflow triggered by repository_dispatch (zero separate infrastructure)
  • Includes 15 unit tests for incident filtering and prompt construction

Architecture

BetterStack incident → GitHub repository_dispatch → GH Actions workflow
  → Python orchestrator fetches incident from BetterStack API
  → Creates Managed Agent session on Anthropic infra
  → Agent triages using GitHub MCP + web search + mounted repo
  → Creates draft PR or issue with findings

Files

File Purpose
sre-agent/src/sre_agent/main.py Orchestrator: fetch incident, filter, create session, stream
sre-agent/src/sre_agent/_config.py Pydantic Settings for env vars
sre-agent/src/sre_agent/_setup.py One-time script to create agent/environment/skill/vault on Anthropic
sre-agent/skills/sre-runbook/SKILL.md Repo-specific triage context for the agent
.github/workflows/sre-incident-response.yml GH Actions workflow (dispatch + manual trigger)
sre-agent/tests/test_main.py 15 unit tests

Setup steps (post-merge)

1. Run one-time setup to create Anthropic resources

cd sre-agent
ANTHROPIC_API_KEY=<key> SRE_GITHUB_PAT=<fine-grained-pat> uv run python -m sre_agent._setup

The PAT needs contents:write, pull-requests:write, issues:write scopes on this repo. PRs will appear under the PAT owner's GitHub identity.

2. Store output as GitHub Actions secrets

The setup script prints three IDs. Add these as repo secrets:

  • SRE_AGENT_ID
  • SRE_ENVIRONMENT_ID
  • SRE_VAULT_ID

Also add:

  • BETTERSTACK_API_TOKEN — BetterStack API token (Settings → API tokens)

3. Configure BetterStack webhook

Set up a webhook integration in BetterStack that POSTs to:

URL: https://api.github.com/repos/aignostics/python-sdk/dispatches
Headers: Authorization: token <GITHUB_PAT>, Accept: application/vnd.github+json
Body:

{
  "event_type": "betterstack-incident",
  "client_payload": {
    "incident_id": "{{incident_id}}"
  }
}

Testing

Simulated incident (no external deps needed)

gh workflow run sre-incident-response.yml -f simulate=true

Real BetterStack incident

gh workflow run sre-incident-response.yml -f incident_id=949981259 -f simulate=false

Simulated repository_dispatch (mimics BetterStack webhook)

gh api repos/aignostics/python-sdk/dispatches \
  -f event_type=betterstack-incident \
  -f 'client_payload={"incident_id":"949981259"}'

Unit tests

cd sre-agent && uv sync --extra dev && uv run pytest -v

Test plan

  • Run unit tests locally (uv run pytest -v in sre-agent/)
  • Run setup script to create Anthropic resources
  • Test with simulated incident via workflow_dispatch
  • Test with real incident ID via workflow_dispatch
  • Configure BetterStack webhook and test end-to-end

🤖 Generated with Claude Code

Add a background SRE agent that triages BetterStack incidents for the
Python SDK and creates fix PRs via the GitHub MCP server.

Architecture: BetterStack webhook -> GitHub repository_dispatch ->
GH Actions workflow -> Managed Agent session on Anthropic infra.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@neelay-aign neelay-aign added the skip:test:long_running Skip long-running tests (≥5min) label Apr 14, 2026
Copilot AI review requested due to automatic review settings April 14, 2026 12:39
@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
1 Security Hotspot

See analysis details on SonarQube Cloud

]

if attrs.get("response_content"):
ctx = json.loads(attrs["response_content"])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The json.loads() call on response_content is not wrapped in a try-except block, which can cause a crash if the API returns malformed JSON.
Severity: HIGH

Suggested Fix

Wrap the json.loads(attrs["response_content"]) call in a try-except json.JSONDecodeError block. Log the error and gracefully handle the case where response_content cannot be parsed, for example, by proceeding with an empty context ctx = {}.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: sre-agent/src/sre_agent/main.py#L84

Potential issue: The code at `sre-agent/src/sre_agent/main.py:84` calls
`json.loads(attrs["response_content"])` without any error handling. The
`response_content` is fetched from the external BetterStack API. If this API returns a
non-empty but invalid JSON string due to an API bug, network issue, or other edge case,
the `json.loads()` call will raise an unhandled `json.JSONDecodeError`. This will crash
the orchestrator, preventing it from triaging the incident.

Did we get this right? 👍 / 👎 to inform future reviews.

f"**Run URL**: {gh['run_url']}",
f"**Workflow**: {gh.get('workflow', 'unknown')}",
f"**Commit**: {gh.get('sha', 'unknown')}",
f"**Job**: {gh.get('job', 'unknown')}",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The code incorrectly looks for the job key within the github sub-dictionary instead of the top-level context, causing the job status to always be 'unknown'.
Severity: MEDIUM

Suggested Fix

Modify the line to extract the job status from the correct location in the context dictionary. Change gh.get('job', 'unknown') to ctx.get('job', {}).get('status', 'unknown') to correctly access the nested status field.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: sre-agent/src/sre_agent/main.py#L92

Potential issue: The code attempts to extract job status using `gh.get('job',
'unknown')`. However, the `gh` dictionary only contains the `github` sub-dictionary from
the API response. The `job` key actually exists at the top level of the response
context. As a result, the code will always fail to find the job status, and the prompt
sent to the agent will incorrectly state `**Job**: unknown`, even when the status is
available. This deprives the agent of potentially critical diagnostic information.

Did we get this right? 👍 / 👎 to inform future reviews.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new sre-agent/ subproject and a GitHub Actions workflow to automatically triage BetterStack incidents for the Python SDK using Anthropic Managed Agents (with GitHub MCP + web search), including a runbook skill and unit tests for incident filtering/prompt building.

Changes:

  • Introduce a standalone SRE incident-response orchestrator (sre_agent.main) plus one-time Anthropic resource setup script (sre_agent._setup).
  • Add a repo runbook skill (skills/sre-runbook/SKILL.md) and a GitHub Actions workflow to trigger triage via repository_dispatch or manual dispatch.
  • Add unit tests for incident relevance filtering and prompt construction; add a dedicated uv.lock for the subproject.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
.github/workflows/sre-incident-response.yml New workflow to run the SRE agent on BetterStack dispatch / manual trigger.
sre-agent/pyproject.toml Defines the standalone sre-agent Python project (deps, build, pytest config).
sre-agent/uv.lock Lockfile for the new subproject’s dependency resolution.
sre-agent/src/sre_agent/__init__.py Package initializer for the SRE agent module.
sre-agent/src/sre_agent/__main__.py Enables python -m sre_agent execution entrypoint.
sre-agent/src/sre_agent/_config.py Pydantic settings model for agent IDs, vault/environment IDs, BetterStack token, repo mount target.
sre-agent/src/sre_agent/_setup.py One-time setup script to create Anthropic agent/environment/skill/vault resources.
sre-agent/src/sre_agent/main.py Orchestrator: fetch/simulate incident, filter, build prompt, run Managed Agent session and stream output.
sre-agent/skills/sre-runbook/SKILL.md Runbook/triage guidance provided to the agent as a skill file.
sre-agent/tests/test_main.py Unit tests for is_python_sdk_incident and build_prompt.

Comment on lines +104 to +117
print("Using simulated incident for testing.")
incident = SAMPLE_INCIDENT
elif incident_id:
settings = SREAgentSettings() # type: ignore[call-arg]
incident = fetch_incident(incident_id, settings.betterstack_api_token.get_secret_value())
else:
print("No INCIDENT_ID provided and SIMULATE is not true. Exiting.")
sys.exit(0)

if not is_python_sdk_incident(incident):
print(f"Skipping non-Python-SDK incident: {incident.get('attributes', {}).get('name', 'unknown')}")
sys.exit(0)

settings = SREAgentSettings() # type: ignore[call-arg]
Comment on lines +65 to +70
req = urllib.request.Request(
f"https://uptime.betterstack.com/api/v2/incidents/{incident_id}",
headers={"Authorization": f"Bearer {token}"},
)
with urllib.request.urlopen(req) as resp:
return json.loads(resp.read())["data"] # type: ignore[no-any-return]
Comment on lines +83 to +93
if attrs.get("response_content"):
ctx = json.loads(attrs["response_content"])
gh = ctx.get("github", {})
if gh.get("run_url"):
parts.extend([
"\n## Failed GitHub Actions Run",
f"**Run URL**: {gh['run_url']}",
f"**Workflow**: {gh.get('workflow', 'unknown')}",
f"**Commit**: {gh.get('sha', 'unknown')}",
f"**Job**: {gh.get('job', 'unknown')}",
])
Comment on lines +86 to +92
if gh.get("run_url"):
parts.extend([
"\n## Failed GitHub Actions Run",
f"**Run URL**: {gh['run_url']}",
f"**Workflow**: {gh.get('workflow', 'unknown')}",
f"**Commit**: {gh.get('sha', 'unknown')}",
f"**Job**: {gh.get('job', 'unknown')}",
Comment on lines +27 to +29
- uses: actions/checkout@v4

- uses: astral-sh/setup-uv@v6

- name: Install dependencies
working-directory: sre-agent
run: uv sync

### "Scheduled Testing" incidents (staging)
- Cause: Unit, integration, or e2e tests failed against staging.
- Runs every 6 hours via .github/workflows/_scheduled-test-hourly.yml.
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

❌ Your project check has failed because the head coverage (63.78%) is below the target coverage (70.00%). You can increase the head coverage or adjust the target coverage.

❗ There is a different number of reports uploaded between BASE (1b1b4b6) and HEAD (678bb69). Click for more details.

HEAD has 10 uploads less than BASE
Flag BASE (1b1b4b6) HEAD (678bb69)
11 1

see 24 files with indirect coverage changes

@neelay-aign neelay-aign removed the skip:test:long_running Skip long-running tests (≥5min) label Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants