[do not merge] [v1.40] feat: auto-recover host-networked pods when node IP changes by coutinhop · Pull Request #4903 · tigera/operator

coutinhop · 2026-06-08T00:28:31Z

Cherry-pick history

Pick onto release-v1.40: [CORE-12452] feat: auto-recover host-networked pods when node IP changes #4784

Description

Detect Calico's host-networked pods (calico-typha, calico-node, calico-node-windows) whose status.podIPs no longer matches the node's current InternalIP, and delete them so the Deployment / DaemonSet controller recreates them with the correct IP.

This works around an upstream Kubernetes limitation [1] where status.podIPs is immutable for hostNetwork pods once set: when a node's IP changes (e.g. KubeVirt VM reboot pulls a new DHCP lease), existing hostNetwork pods keep their old IP. The kube EndpointSlice controller reads from status.podIPs, so the calico-typha EndpointSlice ends up advertising stale IPs and Felix times out connecting to Typha. Restarting the container does not help — only deleting and recreating the pod itself causes the kubelet to repopulate status.podIPs from the current node IP.

Implementation lives in the existing Typha autoscaler tick (every 10s, already has a Node informer cache):

Compare each pod's status.podIPs to its node's status.InternalIP (which the kubelet does update promptly via heartbeat).
Delete stale pods, paced one per workload-batch per tick. Batch size is read from each workload's existing rolling-update setting: the Typha PDB's maxUnavailable, or the DaemonSet's updateStrategy.rollingUpdate.maxUnavailable. Falls back to 1 if not set or if the resolved value is < 1 (minimum-progress guarantee).
Order: Typha first; if any Typha was deleted this cycle, skip the calico-node deletions until the next tick to give the new Typha pod a clean window to come up. Linux and Windows DaemonSets are paced independently of each other.
Skipped entirely on the non-cluster-host autoscaler instance.

Tested by ODCN on KubeVirt: 3-node cluster with all node IPs changed, all calico-node and Typha pods recovered automatically without manual intervention.

[1] kubernetes/kubernetes#93897

Jira: CI-1951, CORE-12452

Release Note

Automatically recover Calico pods stranded with stale pod IPs after a node IP change (e.g. KubeVirt node reboot).

For PR author

Tests for change.
If changing pkg/apis/, run make gen-files
If changing versions, run make gen-versions

For PR reviewers

A note for code reviewers - all pull requests must have the following:

Milestone set according to targeted release.
Appropriate labels:
- kind/bug if this is a bugfix.
- kind/enhancement if this is a a new feature.
- enterprise if this PR applies to Calico Enterprise only.

Detect Calico's host-networked pods (calico-typha, calico-node, calico-node-windows) whose status.podIPs no longer matches the node's current InternalIP, and delete them so the Deployment / DaemonSet controller recreates them with the correct IP. This works around an upstream Kubernetes limitation [1] where status.podIPs is immutable for hostNetwork pods once set: when a node's IP changes (e.g. KubeVirt VM reboot pulls a new DHCP lease), existing hostNetwork pods keep their old IP. The kube EndpointSlice controller reads from status.podIPs, so the calico-typha EndpointSlice ends up advertising stale IPs and Felix times out connecting to Typha. Restarting the container does not help — only deleting and recreating the pod itself causes the kubelet to repopulate status.podIPs from the current node IP. Implementation lives in the existing Typha autoscaler tick (every 10s, already has a Node informer cache): - Compare each pod's status.podIPs to its node's status.InternalIP (which the kubelet does update promptly via heartbeat). - Delete stale pods, paced one per workload-batch per tick. Batch size is read from each workload's existing rolling-update setting: the Typha PDB's maxUnavailable, or the DaemonSet's updateStrategy.rollingUpdate.maxUnavailable. Falls back to 1 if not set or if the resolved value is < 1 (minimum-progress guarantee). - Order: Typha first; if any Typha was deleted this cycle, skip the calico-node deletions until the next tick to give the new Typha pod a clean window to come up. Linux and Windows DaemonSets are paced independently of each other. - Skipped entirely on the non-cluster-host autoscaler instance. Tested by ODCN on KubeVirt: 3-node cluster with all node IPs changed, all calico-node and Typha pods recovered automatically without manual intervention. [1] kubernetes/kubernetes#93897 Jira: CI-1951, CORE-12452

Add a new Installation.Spec.StalePodIPRecovery field (Enabled / Disabled, default Enabled) that gates the host-networked stale pod IP detection and deletion logic in the typha autoscaler. When set to Disabled, the entire detection path is skipped each tick. The default-on choice is consistent with other operator-managed automation (e.g. the typha autoscaler is itself always-on with no toggle), avoids opt-in friction for users who don't know the bug exists, and provides an escape hatch for environments where the detection might interact badly with custom node-IP management. Implementation notes: - api/v1: new StalePodIPRecoveryType enum and IsStalePodIPRecoveryEnabled helper, modeled on the existing FIPSMode pattern. nil is treated as Enabled so the default-on behavior is encoded in one place. - typha_autoscaler.go: new optional func() bool field on the autoscaler consulted at the top of each tick. Wired via the existing option pattern (typhaAutoscalerOptionStalePodIPRecoveryEnabled) so tests can inject true / false / nil. A nil getter is treated as enabled, which keeps existing tests and the non-cluster-host autoscaler path unchanged. - core_controller.go: the closure reads the Installation named "default" from the manager's cached client at call time so toggles take effect on the next tick (~10s). Failures fall through to enabled — recovery is the safer default for the kubelet bug we're working around. Tests: - 3 new gate tests covering nil getter, true, and false. - Defensive Maybe() expectations on SetDegraded in the existing stale pod IP detection and maxUnavailable resolution contexts to fix a pre-existing race-condition flakiness exposed by this work.

When a node's IP changes (e.g. KubeVirt VM reboot pulls a new DHCP lease), Kubernetes does not update status.podIPs on existing hostNetwork pods, it treats the field as immutable once set (kubernetes/kubernetes#93897). The Typha EndpointSlice ends up advertising the stale IP, Felix can't reach Typha, and calico-node pods stay NotReady until pods are manually deleted. Add a small controller (pkg/controller/podiprecovery) that watches Nodes with a predicate enqueueing reconciles only when the set of NodeInternalIP addresses actually changes, so routine heartbeats don't trigger any work. On reconcile, it lists operator-managed host-networked pods on the affected node and deletes any whose status.podIPs no longer matches the node's current InternalIP. The owning Deployment / DaemonSet recreates each pod with a fresh sandbox, which the kubelet populates with the correct IP. Covers calico-typha, calico-node (Linux + Windows), tigera-dpi, l7-log-collector, calico-apiserver, and calico-webhooks. A per-pod spec.hostNetwork check makes the conditional ones (apiserver, webhooks) work naturally without mirroring HostNetworkRequired() logic. No pacing, as stale-IP pods are non-functional by definition, so deleting them all at once doesn't worsen availability.

Address review comments on the PodIPRecovery controller. The big change: stop maintaining a hardcoded list of per-workload label selectors (k8s-app=calico-typha, k8s-app=calico-node, apiserver=true, ...) and instead use a single uniform marker label, operator.tigera.io/hostNetworked=true, applied at render time by every package that produces a hostNetwork pod template. Combined with a spec.nodeName field-indexer registered in cmd/main.go, reconcile is now one server-side List call regardless of how many workloads are in scope, and the controller no longer needs to import per-workload (enterprise) packages. aws-securitygroup-setup is intentionally not labeled: one-shot Job, pod IP isn't user-visible. Other changes: gate reconcile on the Installation CR via utils.GetInstallationSpec; switch internal helpers to sets.Set[string]; drop the redundant Status.PodIP == "" check (PodIP is PodIPs[0] by contract); shorter delete log line; for _, pod := range pods. Tests: controller fixtures updated to the new label; added Installation-gate and "unlabeled pod is ignored" tests; the three *DaemonSet/*Deployment override tests in render now expect two labels instead of one.

Mirror upstream `k8s.io/kubernetes/pkg/util/node.GetNodeHostIPs`: the kubelet populates `status.podIPs` for a hostNetwork pod from the node's InternalIPs when present and falls back to ExternalIPs otherwise. The recovery controller's comparison must use the same selection, or it would skip recovery on the (rare) ExternalIP-only node where the kubelet still writes a host IP into `status.podIPs`. Rename `internalIPSet` → `nodeHostIPSet` and `internalIPChangedPredicate` → `hostIPsChangedPredicate` so call sites describe the kubelet semantics rather than a single address type. Update log fields and doc comments accordingly. Add tests for both branches: - ExternalIP-only node → uses ExternalIPs for matching. - Node with both → ExternalIP is ignored; pod carrying the ExternalIP is deleted as stale (InternalIP wins). - Predicate fires when ExternalIP changes on an InternalIP-less node.

Move HostNetworkedPodLabel out of seven render packages and into setStandardSelectorAndLabels, which every rendered object already flows through on its way to apply. The helper reads podTemplate.Spec.HostNetwork directly off the template it is mutating, so the conditional (label only on hostNetwork pods) is local to where the decision is made; render packages no longer need to know the label exists. Removes the construction-site labels in typha.go (and the matching delete in the NonClusterHost variant), node.go, windows.go, dpi.go, applicationlayer.go, and the conditional label blocks in apiserver.go and webhooks/render.go. The webhooks DNSPolicy adjustment stays — only the label code was removed from that `if Spec.HostNetwork` branch. Render-package tests that asserted label presence on freshly-rendered output were testing intermediate state that no longer ships; updated three override-merge assertions accordingly and pushed the host-networked invariant down to four new specs in component_test.go (hostNetwork Deployment / DaemonSet get the label, pod-networked Deployment doesn't, existing labels survive).

coutinhop added 10 commits June 7, 2026 16:44

stop treating unpopulated pod IPs due to them starting up as stale

fb07050

make gen-files gen-versions

756a49e

fix host-networked label name

15bfec6

fixes

36d4153

coutinhop requested a review from a team as a code owner June 8, 2026 00:28

coutinhop added docs-pr-required release-note-required labels Jun 8, 2026

marvin-tigera added this to the v1.40.12 milestone Jun 8, 2026

coutinhop self-assigned this Jun 8, 2026

coutinhop added the hold merge Do not merge label Jun 8, 2026

coutinhop changed the title ~~[v1.40] feat: auto-recover host-networked pods when node IP changes~~ [do not merge] [v1.40] feat: auto-recover host-networked pods when node IP changes Jun 8, 2026

danudey modified the milestones: v1.40.12, v1.40.13 Jun 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[do not merge] [v1.40] feat: auto-recover host-networked pods when node IP changes#4903

[do not merge] [v1.40] feat: auto-recover host-networked pods when node IP changes#4903
coutinhop wants to merge 10 commits into
tigera:release-v1.40from
coutinhop:auto-pick-of-#4784-upstream-release-v1.40

coutinhop commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

coutinhop commented Jun 8, 2026

Description

Release Note

For PR author

For PR reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants