[do not merge] [v1.40] feat: auto-recover host-networked pods when node IP changes#4903
Open
coutinhop wants to merge 10 commits into
Open
[do not merge] [v1.40] feat: auto-recover host-networked pods when node IP changes#4903coutinhop wants to merge 10 commits into
coutinhop wants to merge 10 commits into
Conversation
Detect Calico's host-networked pods (calico-typha, calico-node,
calico-node-windows) whose status.podIPs no longer matches the node's
current InternalIP, and delete them so the Deployment / DaemonSet
controller recreates them with the correct IP.
This works around an upstream Kubernetes limitation [1] where
status.podIPs is immutable for hostNetwork pods once set: when a node's
IP changes (e.g. KubeVirt VM reboot pulls a new DHCP lease), existing
hostNetwork pods keep their old IP. The kube EndpointSlice controller
reads from status.podIPs, so the calico-typha EndpointSlice ends up
advertising stale IPs and Felix times out connecting to Typha.
Restarting the container does not help — only deleting and recreating
the pod itself causes the kubelet to repopulate status.podIPs from the
current node IP.
Implementation lives in the existing Typha autoscaler tick (every 10s,
already has a Node informer cache):
- Compare each pod's status.podIPs to its node's status.InternalIP
(which the kubelet does update promptly via heartbeat).
- Delete stale pods, paced one per workload-batch per tick. Batch
size is read from each workload's existing rolling-update setting:
the Typha PDB's maxUnavailable, or the DaemonSet's
updateStrategy.rollingUpdate.maxUnavailable. Falls back to 1 if not
set or if the resolved value is < 1 (minimum-progress guarantee).
- Order: Typha first; if any Typha was deleted this cycle, skip the
calico-node deletions until the next tick to give the new Typha pod
a clean window to come up. Linux and Windows DaemonSets are paced
independently of each other.
- Skipped entirely on the non-cluster-host autoscaler instance.
Tested by ODCN on KubeVirt: 3-node cluster with all node IPs changed,
all calico-node and Typha pods recovered automatically without manual
intervention.
[1] kubernetes/kubernetes#93897
Jira: CI-1951, CORE-12452
Add a new Installation.Spec.StalePodIPRecovery field (Enabled /
Disabled, default Enabled) that gates the host-networked stale pod
IP detection and deletion logic in the typha autoscaler. When set
to Disabled, the entire detection path is skipped each tick.
The default-on choice is consistent with other operator-managed
automation (e.g. the typha autoscaler is itself always-on with no
toggle), avoids opt-in friction for users who don't know the bug
exists, and provides an escape hatch for environments where the
detection might interact badly with custom node-IP management.
Implementation notes:
- api/v1: new StalePodIPRecoveryType enum and IsStalePodIPRecoveryEnabled
helper, modeled on the existing FIPSMode pattern. nil is treated as
Enabled so the default-on behavior is encoded in one place.
- typha_autoscaler.go: new optional func() bool field on the autoscaler
consulted at the top of each tick. Wired via the existing option
pattern (typhaAutoscalerOptionStalePodIPRecoveryEnabled) so tests can
inject true / false / nil. A nil getter is treated as enabled, which
keeps existing tests and the non-cluster-host autoscaler path
unchanged.
- core_controller.go: the closure reads the Installation named "default"
from the manager's cached client at call time so toggles take effect
on the next tick (~10s). Failures fall through to enabled — recovery
is the safer default for the kubelet bug we're working around.
Tests:
- 3 new gate tests covering nil getter, true, and false.
- Defensive Maybe() expectations on SetDegraded in the existing stale
pod IP detection and maxUnavailable resolution contexts to fix a
pre-existing race-condition flakiness exposed by this work.
When a node's IP changes (e.g. KubeVirt VM reboot pulls a new DHCP lease), Kubernetes does not update status.podIPs on existing hostNetwork pods, it treats the field as immutable once set (kubernetes/kubernetes#93897). The Typha EndpointSlice ends up advertising the stale IP, Felix can't reach Typha, and calico-node pods stay NotReady until pods are manually deleted. Add a small controller (pkg/controller/podiprecovery) that watches Nodes with a predicate enqueueing reconciles only when the set of NodeInternalIP addresses actually changes, so routine heartbeats don't trigger any work. On reconcile, it lists operator-managed host-networked pods on the affected node and deletes any whose status.podIPs no longer matches the node's current InternalIP. The owning Deployment / DaemonSet recreates each pod with a fresh sandbox, which the kubelet populates with the correct IP. Covers calico-typha, calico-node (Linux + Windows), tigera-dpi, l7-log-collector, calico-apiserver, and calico-webhooks. A per-pod spec.hostNetwork check makes the conditional ones (apiserver, webhooks) work naturally without mirroring HostNetworkRequired() logic. No pacing, as stale-IP pods are non-functional by definition, so deleting them all at once doesn't worsen availability.
Address review comments on the PodIPRecovery controller. The big change: stop maintaining a hardcoded list of per-workload label selectors (k8s-app=calico-typha, k8s-app=calico-node, apiserver=true, ...) and instead use a single uniform marker label, operator.tigera.io/hostNetworked=true, applied at render time by every package that produces a hostNetwork pod template. Combined with a spec.nodeName field-indexer registered in cmd/main.go, reconcile is now one server-side List call regardless of how many workloads are in scope, and the controller no longer needs to import per-workload (enterprise) packages. aws-securitygroup-setup is intentionally not labeled: one-shot Job, pod IP isn't user-visible. Other changes: gate reconcile on the Installation CR via utils.GetInstallationSpec; switch internal helpers to sets.Set[string]; drop the redundant Status.PodIP == "" check (PodIP is PodIPs[0] by contract); shorter delete log line; for _, pod := range pods. Tests: controller fixtures updated to the new label; added Installation-gate and "unlabeled pod is ignored" tests; the three *DaemonSet/*Deployment override tests in render now expect two labels instead of one.
Mirror upstream `k8s.io/kubernetes/pkg/util/node.GetNodeHostIPs`: the kubelet populates `status.podIPs` for a hostNetwork pod from the node's InternalIPs when present and falls back to ExternalIPs otherwise. The recovery controller's comparison must use the same selection, or it would skip recovery on the (rare) ExternalIP-only node where the kubelet still writes a host IP into `status.podIPs`. Rename `internalIPSet` → `nodeHostIPSet` and `internalIPChangedPredicate` → `hostIPsChangedPredicate` so call sites describe the kubelet semantics rather than a single address type. Update log fields and doc comments accordingly. Add tests for both branches: - ExternalIP-only node → uses ExternalIPs for matching. - Node with both → ExternalIP is ignored; pod carrying the ExternalIP is deleted as stale (InternalIP wins). - Predicate fires when ExternalIP changes on an InternalIP-less node.
Move HostNetworkedPodLabel out of seven render packages and into setStandardSelectorAndLabels, which every rendered object already flows through on its way to apply. The helper reads podTemplate.Spec.HostNetwork directly off the template it is mutating, so the conditional (label only on hostNetwork pods) is local to where the decision is made; render packages no longer need to know the label exists. Removes the construction-site labels in typha.go (and the matching delete in the NonClusterHost variant), node.go, windows.go, dpi.go, applicationlayer.go, and the conditional label blocks in apiserver.go and webhooks/render.go. The webhooks DNSPolicy adjustment stays — only the label code was removed from that `if Spec.HostNetwork` branch. Render-package tests that asserted label presence on freshly-rendered output were testing intermediate state that no longer ships; updated three override-merge assertions accordingly and pushed the host-networked invariant down to four new specs in component_test.go (hostNetwork Deployment / DaemonSet get the label, pod-networked Deployment doesn't, existing labels survive).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cherry-pick history
Description
Detect Calico's host-networked pods (calico-typha, calico-node, calico-node-windows) whose status.podIPs no longer matches the node's current InternalIP, and delete them so the Deployment / DaemonSet controller recreates them with the correct IP.
This works around an upstream Kubernetes limitation [1] where status.podIPs is immutable for hostNetwork pods once set: when a node's IP changes (e.g. KubeVirt VM reboot pulls a new DHCP lease), existing hostNetwork pods keep their old IP. The kube EndpointSlice controller reads from status.podIPs, so the calico-typha EndpointSlice ends up advertising stale IPs and Felix times out connecting to Typha. Restarting the container does not help — only deleting and recreating the pod itself causes the kubelet to repopulate status.podIPs from the current node IP.
Implementation lives in the existing Typha autoscaler tick (every 10s, already has a Node informer cache):
Tested by ODCN on KubeVirt: 3-node cluster with all node IPs changed, all calico-node and Typha pods recovered automatically without manual intervention.
[1] kubernetes/kubernetes#93897
Jira: CI-1951, CORE-12452
Release Note
For PR author
make gen-filesmake gen-versionsFor PR reviewers
A note for code reviewers - all pull requests must have the following:
kind/bugif this is a bugfix.kind/enhancementif this is a a new feature.enterpriseif this PR applies to Calico Enterprise only.