Skip to content

[do not merge] [v1.40] feat: auto-recover host-networked pods when node IP changes#4903

Open
coutinhop wants to merge 10 commits into
tigera:release-v1.40from
coutinhop:auto-pick-of-#4784-upstream-release-v1.40
Open

[do not merge] [v1.40] feat: auto-recover host-networked pods when node IP changes#4903
coutinhop wants to merge 10 commits into
tigera:release-v1.40from
coutinhop:auto-pick-of-#4784-upstream-release-v1.40

Conversation

@coutinhop

Copy link
Copy Markdown
Member

Cherry-pick history

Description

Detect Calico's host-networked pods (calico-typha, calico-node, calico-node-windows) whose status.podIPs no longer matches the node's current InternalIP, and delete them so the Deployment / DaemonSet controller recreates them with the correct IP.

This works around an upstream Kubernetes limitation [1] where status.podIPs is immutable for hostNetwork pods once set: when a node's IP changes (e.g. KubeVirt VM reboot pulls a new DHCP lease), existing hostNetwork pods keep their old IP. The kube EndpointSlice controller reads from status.podIPs, so the calico-typha EndpointSlice ends up advertising stale IPs and Felix times out connecting to Typha. Restarting the container does not help — only deleting and recreating the pod itself causes the kubelet to repopulate status.podIPs from the current node IP.

Implementation lives in the existing Typha autoscaler tick (every 10s, already has a Node informer cache):

  • Compare each pod's status.podIPs to its node's status.InternalIP (which the kubelet does update promptly via heartbeat).
  • Delete stale pods, paced one per workload-batch per tick. Batch size is read from each workload's existing rolling-update setting: the Typha PDB's maxUnavailable, or the DaemonSet's updateStrategy.rollingUpdate.maxUnavailable. Falls back to 1 if not set or if the resolved value is < 1 (minimum-progress guarantee).
  • Order: Typha first; if any Typha was deleted this cycle, skip the calico-node deletions until the next tick to give the new Typha pod a clean window to come up. Linux and Windows DaemonSets are paced independently of each other.
  • Skipped entirely on the non-cluster-host autoscaler instance.

Tested by ODCN on KubeVirt: 3-node cluster with all node IPs changed, all calico-node and Typha pods recovered automatically without manual intervention.

[1] kubernetes/kubernetes#93897

Jira: CI-1951, CORE-12452

Release Note

Automatically recover Calico pods stranded with stale pod IPs after a node IP change (e.g. KubeVirt node reboot).

For PR author

  • Tests for change.
  • If changing pkg/apis/, run make gen-files
  • If changing versions, run make gen-versions

For PR reviewers

A note for code reviewers - all pull requests must have the following:

  • Milestone set according to targeted release.
  • Appropriate labels:
    • kind/bug if this is a bugfix.
    • kind/enhancement if this is a a new feature.
    • enterprise if this PR applies to Calico Enterprise only.

coutinhop added 10 commits June 7, 2026 16:44
Detect Calico's host-networked pods (calico-typha, calico-node,
calico-node-windows) whose status.podIPs no longer matches the node's
current InternalIP, and delete them so the Deployment / DaemonSet
controller recreates them with the correct IP.

This works around an upstream Kubernetes limitation [1] where
status.podIPs is immutable for hostNetwork pods once set: when a node's
IP changes (e.g. KubeVirt VM reboot pulls a new DHCP lease), existing
hostNetwork pods keep their old IP. The kube EndpointSlice controller
reads from status.podIPs, so the calico-typha EndpointSlice ends up
advertising stale IPs and Felix times out connecting to Typha.
Restarting the container does not help — only deleting and recreating
the pod itself causes the kubelet to repopulate status.podIPs from the
current node IP.

Implementation lives in the existing Typha autoscaler tick (every 10s,
already has a Node informer cache):

  - Compare each pod's status.podIPs to its node's status.InternalIP
    (which the kubelet does update promptly via heartbeat).
  - Delete stale pods, paced one per workload-batch per tick. Batch
    size is read from each workload's existing rolling-update setting:
    the Typha PDB's maxUnavailable, or the DaemonSet's
    updateStrategy.rollingUpdate.maxUnavailable. Falls back to 1 if not
    set or if the resolved value is < 1 (minimum-progress guarantee).
  - Order: Typha first; if any Typha was deleted this cycle, skip the
    calico-node deletions until the next tick to give the new Typha pod
    a clean window to come up. Linux and Windows DaemonSets are paced
    independently of each other.
  - Skipped entirely on the non-cluster-host autoscaler instance.

Tested by ODCN on KubeVirt: 3-node cluster with all node IPs changed,
all calico-node and Typha pods recovered automatically without manual
intervention.

[1] kubernetes/kubernetes#93897

Jira: CI-1951, CORE-12452
Add a new Installation.Spec.StalePodIPRecovery field (Enabled /
Disabled, default Enabled) that gates the host-networked stale pod
IP detection and deletion logic in the typha autoscaler. When set
to Disabled, the entire detection path is skipped each tick.

The default-on choice is consistent with other operator-managed
automation (e.g. the typha autoscaler is itself always-on with no
toggle), avoids opt-in friction for users who don't know the bug
exists, and provides an escape hatch for environments where the
detection might interact badly with custom node-IP management.

Implementation notes:
  - api/v1: new StalePodIPRecoveryType enum and IsStalePodIPRecoveryEnabled
    helper, modeled on the existing FIPSMode pattern. nil is treated as
    Enabled so the default-on behavior is encoded in one place.
  - typha_autoscaler.go: new optional func() bool field on the autoscaler
    consulted at the top of each tick. Wired via the existing option
    pattern (typhaAutoscalerOptionStalePodIPRecoveryEnabled) so tests can
    inject true / false / nil. A nil getter is treated as enabled, which
    keeps existing tests and the non-cluster-host autoscaler path
    unchanged.
  - core_controller.go: the closure reads the Installation named "default"
    from the manager's cached client at call time so toggles take effect
    on the next tick (~10s). Failures fall through to enabled — recovery
    is the safer default for the kubelet bug we're working around.

Tests:
  - 3 new gate tests covering nil getter, true, and false.
  - Defensive Maybe() expectations on SetDegraded in the existing stale
    pod IP detection and maxUnavailable resolution contexts to fix a
    pre-existing race-condition flakiness exposed by this work.
When a node's IP changes (e.g. KubeVirt VM reboot pulls a new DHCP lease),
Kubernetes does not update status.podIPs on existing hostNetwork pods,
it treats the field as immutable once set (kubernetes/kubernetes#93897).
The Typha EndpointSlice ends up advertising the stale IP, Felix can't reach
Typha, and calico-node pods stay NotReady until pods are manually deleted.

Add a small controller (pkg/controller/podiprecovery) that watches Nodes
with a predicate enqueueing reconciles only when the set of NodeInternalIP
addresses actually changes, so routine heartbeats don't trigger any work.
On reconcile, it lists operator-managed host-networked pods on the
affected node and deletes any whose status.podIPs no longer matches the
node's current InternalIP. The owning Deployment / DaemonSet recreates
each pod with a fresh sandbox, which the kubelet populates with the
correct IP.

Covers calico-typha, calico-node (Linux + Windows), tigera-dpi,
l7-log-collector, calico-apiserver, and calico-webhooks. A per-pod
spec.hostNetwork check makes the conditional ones (apiserver, webhooks)
work naturally without mirroring HostNetworkRequired() logic. No pacing,
as stale-IP pods are non-functional by definition, so deleting them all
at once doesn't worsen availability.
Address review comments on the PodIPRecovery controller. The big
change: stop maintaining a hardcoded list of per-workload label
selectors (k8s-app=calico-typha, k8s-app=calico-node, apiserver=true, ...)
and instead use a single uniform marker label,
operator.tigera.io/hostNetworked=true, applied at render time by every
package that produces a hostNetwork pod template. Combined with a
spec.nodeName field-indexer registered in cmd/main.go, reconcile is now
one server-side List call regardless of how many workloads are in scope,
and the controller no longer needs to import per-workload (enterprise)
packages.

aws-securitygroup-setup is intentionally not labeled: one-shot Job, pod
IP isn't user-visible.

Other changes: gate reconcile on the Installation CR via
utils.GetInstallationSpec; switch internal helpers to sets.Set[string];
drop the redundant Status.PodIP == "" check (PodIP is PodIPs[0] by
contract); shorter delete log line; for _, pod := range pods.

Tests: controller fixtures updated to the new label; added
Installation-gate and "unlabeled pod is ignored" tests; the three
*DaemonSet/*Deployment override tests in render now expect two labels
instead of one.
Mirror upstream `k8s.io/kubernetes/pkg/util/node.GetNodeHostIPs`:
the kubelet populates `status.podIPs` for a hostNetwork pod from
the node's InternalIPs when present and falls back to ExternalIPs
otherwise. The recovery controller's comparison must use the same
selection, or it would skip recovery on the (rare) ExternalIP-only
node where the kubelet still writes a host IP into `status.podIPs`.

Rename `internalIPSet` → `nodeHostIPSet` and
`internalIPChangedPredicate` → `hostIPsChangedPredicate` so call
sites describe the kubelet semantics rather than a single address
type. Update log fields and doc comments accordingly.

Add tests for both branches:
- ExternalIP-only node → uses ExternalIPs for matching.
- Node with both → ExternalIP is ignored; pod carrying the
  ExternalIP is deleted as stale (InternalIP wins).
- Predicate fires when ExternalIP changes on an InternalIP-less
  node.
Move HostNetworkedPodLabel out of seven render packages and into
setStandardSelectorAndLabels, which every rendered object already
flows through on its way to apply. The helper reads
podTemplate.Spec.HostNetwork directly off the template it is
mutating, so the conditional (label only on hostNetwork pods) is
local to where the decision is made; render packages no longer need
to know the label exists.

Removes the construction-site labels in typha.go (and the matching
delete in the NonClusterHost variant), node.go, windows.go,
dpi.go, applicationlayer.go, and the conditional label blocks in
apiserver.go and webhooks/render.go. The webhooks DNSPolicy
adjustment stays — only the label code was removed from that
`if Spec.HostNetwork` branch.

Render-package tests that asserted label presence on
freshly-rendered output were testing intermediate state that no
longer ships; updated three override-merge assertions accordingly
and pushed the host-networked invariant down to four new specs in
component_test.go (hostNetwork Deployment / DaemonSet get the label,
pod-networked Deployment doesn't, existing labels survive).
@coutinhop coutinhop requested a review from a team as a code owner June 8, 2026 00:28
@marvin-tigera marvin-tigera added this to the v1.40.12 milestone Jun 8, 2026
@coutinhop coutinhop self-assigned this Jun 8, 2026
@coutinhop coutinhop added the hold merge Do not merge label Jun 8, 2026
@coutinhop coutinhop changed the title [v1.40] feat: auto-recover host-networked pods when node IP changes [do not merge] [v1.40] feat: auto-recover host-networked pods when node IP changes Jun 8, 2026
@danudey danudey modified the milestones: v1.40.12, v1.40.13 Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants