Skip to content

reftracker: return a snapshot from AppsForRef / RefsForApp to fix concurrent-map panic#1817

Closed
SAY-5 wants to merge 1775 commits intocarvel-dev:developfrom
SAY-5:fix/reftracker-return-snapshot-1812
Closed

reftracker: return a snapshot from AppsForRef / RefsForApp to fix concurrent-map panic#1817
SAY-5 wants to merge 1775 commits intocarvel-dev:developfrom
SAY-5:fix/reftracker-return-snapshot-1812

Conversation

@SAY-5
Copy link
Copy Markdown

@SAY-5 SAY-5 commented Apr 20, 2026

Fixes #1812.

AppRefTracker protects its internal maps with a sync.Mutex, but AppsForRef and RefsForApp returned a.refsToApps[refKey] / a.appsToRefs[appKey] directly. The caller (e.g. SecretHandler.enqueueAppsForUpdate) then iterates that returned map without holding the tracker lock:

apps, err := sch.appRefTracker.AppsForRef(reftracker.NewSecretKey(...))
...
for refKey := range apps {  // concurrent modifier can fire here
    ...
}

Under the reported production load (1,680+ namespaces, rapid Secret and ConfigMap churn, many reconcile goroutines) a parallel ReconcileRefs or RemoveAppFromAllRefs mutates the very same inner map the handler is ranging over, and the Go runtime aborts with

fatal error: concurrent map iteration and map write

crashing the kapp-controller pod.

This returns a shallow copy of the inner set from both lookup methods so callers can iterate without holding the tracker lock. The copy is cheap (refs per app is small in practice; the outer app-to-refs map stays unbounded either way), and the concurrent writers keep exclusive ownership of the originals. A small cloneRefKeySet helper keeps the two call sites in sync.

SAY-5 added 30 commits August 9, 2023 11:48
Signed-off-by: SAY-5 <say.apm35@gmail.com>
Signed-off-by: SAY-5 <say.apm35@gmail.com>
Add checksums for darwin/arm64

Signed-off-by: SAY-5 <say.apm35@gmail.com>
Signed-off-by: SAY-5 <say.apm35@gmail.com>
…golang-1.20.7

Bump golang from 1.20.5 to 1.20.7

Signed-off-by: SAY-5 <say.apm35@gmail.com>
Bump dependencies

Signed-off-by: SAY-5 <say.apm35@gmail.com>
…photon-5.0

Bump photon from 4.0 to 5.0

Signed-off-by: SAY-5 <say.apm35@gmail.com>
…actions/golangci/golangci-lint-action-3.6.0

Bump golangci/golangci-lint-action from 3.5.0 to 3.6.0

Signed-off-by: SAY-5 <say.apm35@gmail.com>
signed-off-by: Nanci Lancaster <nancil@vmware.com>

update readme cii badge

signed-off-by: Nanci Lancaster <nancil@vmware.com>

Signed-off-by: SAY-5 <say.apm35@gmail.com>
add cii badge to readme.md

Signed-off-by: SAY-5 <say.apm35@gmail.com>
Signed-off-by: SAY-5 <say.apm35@gmail.com>
Signed-off-by: SAY-5 <say.apm35@gmail.com>
Ensure that `--build-values` does not affect package output

Signed-off-by: SAY-5 <say.apm35@gmail.com>
Bumps [helm/kind-action](https://github.com/helm/kind-action) from 1.7.0 to 1.8.0.
- [Release notes](https://github.com/helm/kind-action/releases)
- [Commits](helm/kind-action@v1.7.0...v1.8.0)

---
updated-dependencies:
- dependency-name: helm/kind-action
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Signed-off-by: SAY-5 <say.apm35@gmail.com>
Signed-off-by: SAY-5 <say.apm35@gmail.com>
Bumping go version to 1.21.1

Signed-off-by: SAY-5 <say.apm35@gmail.com>
Signed-off-by: SAY-5 <say.apm35@gmail.com>
Signed-off-by: SAY-5 <say.apm35@gmail.com>
Bump dependencies

Signed-off-by: SAY-5 <say.apm35@gmail.com>
During the introduction of defaultNamespace feature, we started using --app-namespace flag from kapp which should be used carefully when using cluster options instead of service account

Signed-off-by: SAY-5 <say.apm35@gmail.com>
Signed-off-by: SAY-5 <say.apm35@gmail.com>
…e not present in kapp controller

Signed-off-by: SAY-5 <say.apm35@gmail.com>
…t-error

Adding a hint when the APP CR installation fails due to ca cert error

Signed-off-by: SAY-5 <say.apm35@gmail.com>
…-trust-ca-certs

Fixing the test case TestConfig_TrustCACerts ( ssl on is removed)

Signed-off-by: SAY-5 <say.apm35@gmail.com>
Bumps [golang.org/x/net](https://github.com/golang/net) from 0.10.0 to 0.17.0.
- [Commits](golang/net@v0.10.0...v0.17.0)

---
updated-dependencies:
- dependency-name: golang.org/x/net
  dependency-type: indirect
...

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Signed-off-by: SAY-5 <say.apm35@gmail.com>
Bumps [golang.org/x/net](https://github.com/golang/net) from 0.10.0 to 0.17.0.
- [Commits](golang/net@v0.10.0...v0.17.0)

---
updated-dependencies:
- dependency-name: golang.org/x/net
  dependency-type: indirect
...

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Signed-off-by: SAY-5 <say.apm35@gmail.com>
Signed-off-by: SAY-5 <say.apm35@gmail.com>
…anges (carvel-dev#1316)

* Make kctrl to exit smoothly on adding the package registry with no changes


* Additonal checks added to the test cases


* review comments fixed


---------

Signed-off-by: SAY-5 <say.apm35@gmail.com>
Signed-off-by: SAY-5 <say.apm35@gmail.com>
SAY-5 added 25 commits February 17, 2026 06:39
Signed-off-by: SAY-5 <say.apm35@gmail.com>
Bump dependencies

Signed-off-by: SAY-5 <say.apm35@gmail.com>
Signed-off-by: SAY-5 <say.apm35@gmail.com>
Bump dependencies

Signed-off-by: SAY-5 <say.apm35@gmail.com>
Bump deps


Bump cli

Signed-off-by: SAY-5 <say.apm35@gmail.com>
Bump Carvel Dependencies

Signed-off-by: SAY-5 <say.apm35@gmail.com>
Bumps [sigstore/cosign-installer](https://github.com/sigstore/cosign-installer) from 3.9.1 to 4.0.0.
- [Release notes](https://github.com/sigstore/cosign-installer/releases)
- [Commits](sigstore/cosign-installer@v3.9.1...v4.0.0)

---
updated-dependencies:
- dependency-name: sigstore/cosign-installer
  dependency-version: 4.0.0
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: SAY-5 <say.apm35@gmail.com>
Bumps [docker/login-action](https://github.com/docker/login-action) from 3.4.0 to 3.6.0.
- [Release notes](https://github.com/docker/login-action/releases)
- [Commits](docker/login-action@v3.4.0...v3.6.0)

---
updated-dependencies:
- dependency-name: docker/login-action
  dependency-version: 3.6.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: SAY-5 <say.apm35@gmail.com>
Bumps [go.opentelemetry.io/otel/sdk](https://github.com/open-telemetry/opentelemetry-go) from 1.32.0 to 1.40.0.
- [Release notes](https://github.com/open-telemetry/opentelemetry-go/releases)
- [Changelog](https://github.com/open-telemetry/opentelemetry-go/blob/main/CHANGELOG.md)
- [Commits](open-telemetry/opentelemetry-go@v1.32.0...v1.40.0)

---
updated-dependencies:
- dependency-name: go.opentelemetry.io/otel/sdk
  dependency-version: 1.40.0
  dependency-type: indirect
...

Signed-off-by: SAY-5 <say.apm35@gmail.com>
…actions/docker/login-action-3.6.0

Bump docker/login-action from 3.4.0 to 3.6.0

Signed-off-by: SAY-5 <say.apm35@gmail.com>
…actions/sigstore/cosign-installer-4.0.0

Bump sigstore/cosign-installer from 3.9.1 to 4.0.0

Signed-off-by: SAY-5 <say.apm35@gmail.com>
v4 introduces breaking changes for url scheme for registries

Signed-off-by: SAY-5 <say.apm35@gmail.com>
Signed-off-by: SAY-5 <say.apm35@gmail.com>
Bump dependencies

Signed-off-by: SAY-5 <say.apm35@gmail.com>
Signed-off-by: SAY-5 <say.apm35@gmail.com>
…-downgrade

Downgrade helm from v4 to latest v3 patch

Signed-off-by: SAY-5 <say.apm35@gmail.com>
…les/go.opentelemetry.io/otel/sdk-1.40.0

Bump go.opentelemetry.io/otel/sdk from 1.32.0 to 1.40.0

Signed-off-by: SAY-5 <say.apm35@gmail.com>
for cosign v4 changes

Signed-off-by: SAY-5 <say.apm35@gmail.com>
Add --bundle flag for cosign blob sign and verify commands

Signed-off-by: SAY-5 <say.apm35@gmail.com>
* Fix zombie process race condition in sidecar container

Replace custom reapZombies implementation with tini as PID 1 to eliminate
race condition causing "waitid: no child processes" errors during template
operations.

Problem:
- reapZombies() used syscall.Wait4(-1, ...) which reaps ANY child process
- This interfered with normal parent-child process waiting in CmdExec.Run
- Race condition: reapZombies could reap ytt/vendir/imgpkg processes before
  their actual parent (sidecar process) could wait for them
- Result: cmd.Wait() failed with ECHILD ("waitid: no child processes")

Solution:
- Install and use tini as proper PID 1 init system in Dockerfile
- Remove problematic reapZombies function entirely
- tini correctly handles only orphaned processes, not normal children
- Eliminates race condition while maintaining proper zombie cleanup

Changes:
- Dockerfile: Install tini package and set as entrypoint
- sidecarexec.go: Remove reapZombies function and unused imports
- deployment.yml: Add documentation comment about tini configuration

This fixes intermittent failures during PackageRepository reconciliation
and other template-heavy operations under concurrent load.

Fixes: Race condition between zombie reaper and command execution
Made-with: Cursor
Made-with: Cursor

* Fix ytt template comment syntax in deployment.yml

Use ytt-specific comment syntax (#!) instead of regular comments (#)
to avoid template compilation errors.

Made-with: Cursor

---------

Signed-off-by: SAY-5 <say.apm35@gmail.com>
Signed-off-by: SAY-5 <say.apm35@gmail.com>
…1803)

* fix(apiserver): register OpenAPI v2 spec for aggregation


* fix(apiserver): initialize OpenAPI specs before applyTo and enforce v3 version

This commit addresses OpenAPI spec generation issues by ensuring both V2 and V3 configurations are instantiated before calling 'recommendedOptions.ApplyTo(serverConfig)'. This allows the authentication wiring to properly decorate the OpenAPI security definitions. Additionally, it explicitly sets the V3 spec 'Info.Version' to 'v1alpha1' to satisfy strict OpenAPI schema requirements.


---------

Signed-off-by: SAY-5 <say.apm35@gmail.com>
…-dev#1808)

* feat(apiserver): implement APIService caBundle reconciliation


* chore: resolve review comments


* chore: add vendor modules.txt changes


* update golangci-lint version to 2.11


* go mod vendor changes


---------

Signed-off-by: SAY-5 <say.apm35@gmail.com>
…current-map panic

AppRefTracker protects its internal maps with a sync.Mutex, but
AppsForRef and RefsForApp returned a.refsToApps[refKey] / a.appsToRefs[appKey]
directly. The caller (e.g. SecretHandler.enqueueAppsForUpdate) then
iterates that returned map without holding the tracker lock:

    apps, err := sch.appRefTracker.AppsForRef(reftracker.NewSecretKey(...))
    ...
    for refKey := range apps {  // concurrent modifier can fire here
        ...
    }

Under the reported production load (1,680+ namespaces, rapid Secret
and ConfigMap churn, many reconcile goroutines) a parallel
ReconcileRefs or RemoveAppFromAllRefs mutates the very same inner
map the handler is ranging over, and the Go runtime aborts with

    fatal error: concurrent map iteration and map write

crashing the kapp-controller pod (carvel-dev#1812).

Return a shallow copy of the inner set from both lookup methods so
callers can iterate without holding the tracker lock. The copy is
cheap (refs per app is small in practice; the outer app-to-refs map
stays unbounded either way), and the concurrent writers keep
exclusive ownership of the originals. A small cloneRefKeySet helper
keeps the two call sites in sync.

Fixes carvel-dev#1812

Signed-off-by: SAY-5 <say.apm35@gmail.com>
Signed-off-by: SAY-5 <say.apm35@gmail.com>
@SAY-5 SAY-5 closed this May 2, 2026
@SAY-5 SAY-5 force-pushed the fix/reftracker-return-snapshot-1812 branch from a059980 to 4b13039 Compare May 2, 2026 02:14
@github-project-automation github-project-automation Bot moved this to Closed in Carvel May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Closed

Development

Successfully merging this pull request may close these issues.

Race Condition in AppRefTracker: Concurrent Map Iteration and Write

2 participants