Skip to content

RFC0055 Identity-Aware Routing#535

Open
rkoster wants to merge 66 commits into
developfrom
feature/app-to-app-mtls-routing
Open

RFC0055 Identity-Aware Routing#535
rkoster wants to merge 66 commits into
developfrom
feature/app-to-app-mtls-routing

Conversation

@rkoster
Copy link
Copy Markdown

@rkoster rkoster commented Mar 5, 2026

RFC0055: Identity-Aware mTLS Routing

Implements Phase 1 (1a + 1b) of RFC0055: App-to-App mTLS Routing.

Tracking: cloudfoundry/community#1481
Acceptance Testing Guide: https://gist.github.com/rkoster/5b252b0edca606f10be2dbdcb81a796f

What This Does

Enables GoRouter to enforce mutual TLS and identity-based authorization on a per-domain basis. Apps calling routes on configured mTLS domains must present a valid Diego instance identity certificate. GoRouter extracts the caller's app/space/org identity and checks it against route policies before forwarding the request.

Phase 1a: mTLS Infrastructure

  • Per-domain TLS configuration via GetConfigForClient callback (SNI-based)
  • Domain-specific client certificate validation against configurable CA
  • Domain-aware XFCC header handling with two formats:
    • raw: base64-encoded full certificate (~1.5KB)
    • envoy: compact Hash=...;Subject="..." format (~250 bytes)
  • SNI/Host mismatch protection (prevents connection reuse attacks across domains)
  • BOSH job properties for router.domains

Phase 1b: Authorization

  • Identity extraction from Diego instance identity certificates (Subject DN OUs + SPIFFE URIs)
  • Pre-selection auth: validates mTLS domain, client cert presence, identity extraction
  • Post-selection auth: enforces route policies (scope and allowed_sources) against selected endpoint
  • Supports authorization at app, space, and org granularity
  • Default deny when no route policies are configured
  • RTR access logs emitted for denied requests (401/403)

Key Design Decisions

  • Two-layer authorization: Pre-selection (before endpoint is chosen) handles domain/cert/identity checks. Post-selection (after load balancer picks a backend) handles scope and route-policy checks against the specific endpoint's tags.
  • Feature is dormant by default: No behavior change unless router.domains is configured in the BOSH manifest and a shared domain with --enforce-access-rules is created.
  • No regression on existing traffic: Non-mTLS domains are completely unaffected.

Testing

  • Unit tests for all new handlers and config validation
  • Integration tests for end-to-end mTLS routing flows
  • BOSH template tests for configuration rendering
  • CI runs go fmt, go vet, staticcheck, ginkgo with --race

Configuration Example

# BOSH manifest (via ops-file)
router:
  domains:
    - name: "*.apps.identity"
      ca_certs: "((diego_instance_identity_ca.certificate))"
      forwarded_client_cert: sanitize_set
      xfcc_format: envoy

Related PRs

Component PR Status
cloud_controller_ng cloudfoundry/cloud_controller_ng#4910 Open
capi-release cloudfoundry/capi-release#625 Open
CLI cloudfoundry/cli#3758 Draft

Merge Ordering

All PRs are independently safe to merge — the feature is dormant without the ops-file and domain creation. No strict ordering required. Recommend merging around the same time once all are approved.

AI Disclosure

This PR was developed with AI assistance. All code has been read and verified manually. Each error path, branch, and edge case has corresponding test coverage.

@rkoster
Copy link
Copy Markdown
Author

rkoster commented Apr 16, 2026

Latest Update: RFC-Compliant Post-Selection Authorization

Implemented breaking change to replace pre-selection authorization with strict post-selection enforcement per RFC lines 475-517.

Key Changes (commit cbf0695)

Architecture:

  • ✅ Composable PostSelectionHandler interface for middleware pipeline
  • ✅ Separation of pre-selection checks (SNI, route lookup, identity) from post-selection authorization
  • ✅ Immediate 403 on authorization failure (non-retriable, per RFC)
  • ✅ Post-selection scope checking with :post-selection suffix in metrics

Implementation:

  • handlers/post_selection_pipeline.go - Infrastructure for composable checks
  • handlers/mtls_scope_auth.go - Org/space boundary enforcement
  • handlers/mtls_access_rules_auth.go - Access rules evaluation (cf:app:, cf:space:, etc.)
  • handlers/mtls_pre_auth.go - Pre-selection checks only
  • handlers/mtls_auth_error.go - Custom error type with Rule/Reason/HTTPStatus

Test Coverage:

  • +44 new tests (14 scope + 17 access rules + 13 pipeline)
  • +4 integration tests for shared route scenarios
  • All 393 tests passing

RFC Compliance

Intermittent 403s - Expected for shared routes across scope boundaries (RFC-compliant)
Error messages - Include "caller org X does not match selected backend org Y"
Strict enforcement - Prevents unauthorized cross-scope access

Breaking Change

⚠️ This replaces the permissive pre-selection authorization entirely. No feature flag provided as this is a security improvement required by the RFC.

Deprecated:

  • handlers/mtls_authorization.go (old implementation with migration notes)
  • route/pool.go EndpointOrgIDs/SpaceIDs methods

Integration Test Results

All integration tests compile successfully. Shared route scenarios validate:

  • Intermittent 403s with scope=space (different spaces in same org)
  • Always succeed with scope=org (same org, different spaces)
  • Always fail with scope=org (different orgs)
  • Per-endpoint access rules with intermittent behavior

Ready for full integration test run and review.

@rkoster
Copy link
Copy Markdown
Author

rkoster commented Apr 16, 2026

Refactoring: AuthError for Future Extensibility

Commit: 4ff64b9

Renamed MtlsAuthError to AuthError to prepare for future authentication methods beyond mTLS, such as SPIFFE JWT tokens.

Changes

  • ✅ Renamed handlers/mtls_auth_error.gohandlers/auth_error.go
  • ✅ Updated struct, constructor functions, and all references
  • ✅ Changed error messages from "mTLS authorization denied" to "authorization denied"
  • ✅ Updated all test files

Benefits

  • 🔮 Future-proof: Ready for SPIFFE JWT token authentication
  • 🏗️ Generic design: Error type not tied to specific auth mechanism
  • 🧩 Reusable: Can be used across different authentication methods
  • Clean: Better naming convention for authorization errors

No functional changes - pure refactoring for extensibility.

@rkoster rkoster force-pushed the feature/app-to-app-mtls-routing branch 3 times, most recently from 1f9b804 to 79271b7 Compare April 17, 2026 12:12
@rkoster rkoster force-pushed the feature/app-to-app-mtls-routing branch from 5cc4170 to b875867 Compare April 20, 2026 09:18
@rkoster
Copy link
Copy Markdown
Author

rkoster commented May 21, 2026

Short update for the people following this PR. After debugging Amelia's environment a bit together, the current working theory is an older version of Diego, without generic route options support is causing the observed behavior.

@ameowlia
Copy link
Copy Markdown
Member

Things work much better with an up-to-date diego :)

@ameowlia
Copy link
Copy Markdown
Member

ameowlia commented May 22, 2026

Issue: Inconsistent Access Log Fields

The new access log fields are conditionally present instead of always appearing with "-" when empty. Compare these logs:

backend.apps.identity - [2026-05-22T20:26:47.534183015Z] "GET / HTTP/2.0" 403 0 9 "-" "curl/7.81.0" "88.0.0.14:49748" "88.0.0.14:61030" x_forwarded_for:"88.0.0.14" x_forwarded_proto:"https" vcap_request_id:"bdc5651a-ba96-4ef0-56c5-094c4dee5ac8" response_time:0.000861 gorouter_time:0.000461 app_id:"8ba2b84a-7d48-45d6-a09b-0d9cd328c5e0" app_index:"0" instance_id:"e677b14a-d066-4040-478b-5a39" x_cf_routererror:"-" tls_sni:"backend.apps.identity"
caller_app:"49c236fa-9b00-48b4-8b6f-f9d74c6a09e3"  <-- missing below
caller_space:"10785795-fdf0-4178-9f74-d56276b8dfa4" <-- missing below
caller_org:"9ebf2ef5-5298-4faa-a554-c3236dfd3c11" <-- missing below
auth:"denied" auth_rule:"route:no_route_policies"  <-- missing below
auth_denied_reason:"route has no route policies configured"   <-- missing below
x_b3_traceid:"bdc5651aba964ef056c5094c4dee5ac8" x_b3_spanid:"56c5094c4dee5ac8" x_b3_parentspanid:"-" b3:"bdc5651aba964ef056c5094c4dee5ac8-56c5094c4dee5ac8"

log-cache.sys.pcf.tasonvsphere.com - [2026-05-22T20:26:47.790646205Z] "GET /api/v1/read/8ba2b84a-7d48-45d6-a09b-0d9cd328c5e0?envelope_types=LOG&start_time=1779481473833525000 HTTP/1.1" 200 0 2023 "-" "cf/8.8.0+c1b66a0.2024-08-21 (go1.22.5; arm64 darwin)" "192.168.115.1:60464" "88.0.0.31:8083" x_forwarded_for:"192.168.115.1" x_forwarded_proto:"https" vcap_request_id:"fa0351f9-2a13-4f7b-4e98-b120e449f99a" response_time:0.003497 gorouter_time:0.000298 app_id:"-" app_index:"-" instance_id:"090e5ff8-bf0f-444d-7c7e-bbf4bd30bb62" x_cf_routererror:"-" tls_sni:"log-cache.sys.pcf.tasonvsphere.com" x_b3_traceid:"fa0351f92a134f7b4e98b120e449f99a" x_b3_spanid:"4e98b120e449f99a" x_b3_parentspanid:"-" b3:"fa0351f92a134f7b4e98b120e449f99a-4e98b120e449f99a"

First log has no caller_* or auth* fields, second has them. Standard pattern is all fields should always present.

@ameowlia
Copy link
Copy Markdown
Member

Issue: fields and names in access logs

here are the new fields

tls_sni:"backend.apps.identity"
caller_app:"49c236fa-9b00-48b4-8b6f-f9d74c6a09e3" 
caller_space:"10785795-fdf0-4178-9f74-d56276b8dfa4" 
caller_org:"9ebf2ef5-5298-4faa-a554-c3236dfd3c11" 
auth:"denied"
auth_rule:"route:no_route_policies"
auth_denied_reason:"route has no route policies configured"

I want to make it clear that these are only for this type of route. I suggest the following renames, but am open to suggestions.

caller_app --> caller_cf_app
caller_space --> caller_cf_space
caller_org --> caller_cf_org
auth_rule --> route_policy (the value for this should be empty when there are no route policies)

I don't know if these two values are needed and I would advocate for removing them
auth --> Can't we tell that it is denied based on the status code?
auth_denied_reason -> isn't the reason always the same? either there are 0 route policies or there are 0 matching route policies?

@ameowlia
Copy link
Copy Markdown
Member

Issue: per request logs in gorouter.stdout.log

these logs are being logged per request in gorouter.stdout.log. This duplicates access log information and creates log volume amplification risk.

{
  "log_level": 1,
  "timestamp": "2026-05-22T20:36:14.559830443Z",
  "message": "mtls-route-policies-denied",
  "source": "vcap.gorouter",
  "data": {
    "route": "backend.apps.identity",
    "caller-app": "0fefb63d-8661-4305-95bd-c20efa21e1d7",
    "reason": "route-policies-deny",
    "endpoint": "88.0.0.14:61030"
  }
}

{
  "log_level": 1,
  "timestamp": "2026-05-22T20:36:14.560190128Z",
  "message": "post-selection-auth-denied",
  "source": "vcap.gorouter",
  "data": {
    "route-endpoint": {
      "ApplicationId": "8ba2b84a-7d48-45d6-a09b-0d9cd328c5e0",
      "Addr": "88.0.0.14:61030",
      "Tags": {
        "app_id": "8ba2b84a-7d48-45d6-a09b-0d9cd328c5e0",
        "app_name": "backend",
        "component": "route-emitter",
        "instance_id": "0",
        "organization_id": "9ebf2ef5-5298-4faa-a554-c3236dfd3c11",
        "organization_name": "o",
        "process_id": "8ba2b84a-7d48-45d6-a09b-0d9cd328c5e0",
        "process_instance_id": "e677b14a-d066-4040-478b-5a39",
        "process_type": "web",
        "source_id": "8ba2b84a-7d48-45d6-a09b-0d9cd328c5e0",
        "space_id": "10785795-fdf0-4178-9f74-d56276b8dfa4",
        "space_name": "s"
      },
      "RouteServiceUrl": "",
      "AZ": "az-0"
    },
    "rule": "route:route_policies",
    "reason": "caller app 0fefb63d-8661-4305-95bd-c20efa21e1d7 not in route_policies",
    "endpoint": "88.0.0.14:61030"
  }
}

I suggest removing these logs (and any others that are per request).

@rkoster
Copy link
Copy Markdown
Author

rkoster commented May 26, 2026

Issue: I was surprised that there was nothing in the routes table to indicate that this is a special type of route. Or even to indicate which policies exist.

Good point. This is addressed by commit c058620 which added ?include=route_policies support on the routes endpoints (GET /v3/routes and GET /v3/routes/:guid). The IncludeRoutePoliciesDecorator batch-loads policies via a single query and returns them in included.route_policies, following the same pattern as the existing include=domain and include=space decorators.

- Rename caller_app/space/org → caller_cf_app/space/org for clarity
- Remove auth, auth_rule, auth_denied_reason fields (not needed)
- Always emit tls_sni and caller_cf_* fields with "-" when empty
- Removes conditional emission that caused inconsistent log output
@rkoster
Copy link
Copy Markdown
Author

rkoster commented May 26, 2026

Issue: Inconsistent Access Log Fields

Issue: fields and names in access logs

Fixed — the identity-aware routing access log fields are now always emitted with "-" when empty (matching the standard pattern for all other log fields). The fields have been renamed: caller_appcaller_cf_app, caller_spacecaller_cf_space, caller_orgcaller_cf_org. The auth, auth_rule, and auth_denied_reason fields have been removed as suggested.

Per-request denial log statements (mtls-route-policies-denied,
mtls-pre-auth-denied, mtls-scope-auth-denied, post-selection-auth-denied)
now log at DEBUG level to avoid log volume amplification in production.

The access log already captures all denial information via caller_cf_*
fields and HTTP status codes. These DEBUG logs remain available for
local debugging when operators enable debug-level logging.
@rkoster
Copy link
Copy Markdown
Author

rkoster commented May 26, 2026

Issue: per request logs in gorouter.stdout.log

Downgraded all 7 per-request denial log statements from INFO to DEBUG level (1c7f1c5). This eliminates the log volume amplification concern in production while keeping them available for operators who enable debug logging during troubleshooting.

The access log already captures denial information via caller_cf_app, caller_cf_space, caller_cf_org fields and HTTP status codes (421/403).

If you'd prefer these statements be removed entirely rather than downgraded, happy to do that instead — let me know.

@ameowlia
Copy link
Copy Markdown
Member

The auth, auth_rule, and auth_denied_reason fields have been removed as suggested.

I didn't say auth_rule should be deleted! I like that one.

I suggested a rename to route_policy. And when there is no matching routing policy I suggested using a "-" instead of "route:no_route_policies"

@ameowlia
Copy link
Copy Markdown
Member

If you'd prefer these statements be removed entirely rather than downgraded, happy to do that instead — let me know.

I vote for removing them entirely since they duplicate the access log.

Comment thread jobs/gorouter/spec
Comment thread jobs/gorouter/templates/gorouter.yml.erb
Comment thread jobs/gorouter/templates/gorouter.yml.erb
Comment thread jobs/gorouter/templates/gorouter.yml.erb
Comment thread jobs/gorouter/templates/gorouter.yml.erb
Comment thread jobs/gorouter/templates/gorouter.yml.erb
Comment thread jobs/gorouter/templates/gorouter.yml.erb
Comment thread jobs/gorouter/templates/gorouter.yml.erb
Comment thread src/code.cloudfoundry.org/gorouter/handlers/identity.go Outdated
Comment thread src/code.cloudfoundry.org/gorouter/handlers/identity.go
Comment thread src/code.cloudfoundry.org/gorouter/proxy/proxy.go Outdated
Comment thread src/code.cloudfoundry.org/gorouter/handlers/identity.go Outdated
return ""
}

return p.endpoints[0].endpoint.ApplicationId
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a chance that routes are stale (by design) and these route policies are out of date.

I think it would be better to store route policies at the pool level instead of on the endpoint object. This would also reduce mutex contention by avoiding the need to acquire the pool lock twice per request.

rkoster added 2 commits May 26, 2026 19:02
…ation tests

- Update router.client_cert_validation description to note that router.domains
  enforce mTLS independently
- Update router.domains description to clarify relationship with
  router.client_cert_validation
- Add rspec tests for all ERB template validation branches: non-array input,
  non-hash entry, missing/empty name, missing/empty ca_certs, invalid
  forwarded_client_cert mode, and invalid xfcc_format value

Addresses PR #535 review threads 1-8.
- Rename identityHandler to cfIdentityHandler / NewCfIdentity to clarify
  it is specific to CF app instance identity certificates (thread 9)
- Guard identity extraction: only run when (1) TLS was used and (2) the
  host is a configured mTLS domain, preventing XFCC header spoofing on
  non-mTLS routes (thread 10)
- Move MtlsPreAuth handler above ClientCert in the proxy chain so a 421
  response skips unnecessary certificate processing (thread 11)
- Use configured xfcc_format from domain config instead of auto-detecting
  format at runtime; reject if format doesn't match (thread 12)

All 386 handler tests and 179 proxy tests passing.
Comment on lines +48 to +50
if !strings.HasSuffix(hostname, suffix) {
return false
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is case sensitive, but URLs are case insensitive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

3 participants