Skip to content

iSCSI self-healing does not recover sessions stuck in REOPEN after node reboot #1138

@stratogit

Description

@stratogit

Describe the bug

Trident's iSCSI self-healing (node_server=heal_iscsi) does not recover iSCSI sessions that are stuck in FAILED/REOPEN state after a node reboot. The Linux iSCSI initiator (iscsid) enters an exponential backoff retry loop that never converges, leaving multipath paths degraded indefinitely.

The self-healing loop logs "Skipping iSCSI self-heal cycle; no iSCSI volumes published on the host." on nodes without published volumes, but even on nodes WITH published volumes and active iSCSI sessions, it does not detect or fix sessions stuck in REOPEN state with exponential backoff.

The only manual fix is to reset recovery_tmo to 1 via sysfs (/sys/class/iscsi_session/sessionN/recovery_tmo), which forces the session to tear down and re-establish. This is exactly what the self-healing loop should do when it detects a session has been in REOPEN for an extended period post-reboot.

Frequency

This issue occurs approximately 1 in every 10 node reboots, or during cluster upgrades (rolling reboot of all nodes). It is not deterministic — the same node may recover fine on one reboot and get stuck on the next, depending on timing and load during the iSCSI login burst at boot.

Environment

  • Trident version: v26.02.0
  • Trident installation flags used: standard CSI deployment
  • Container runtime: containerd 2.2.3
  • Kubernetes version: v1.36.0
  • OS: Sles 15
  • NetApp backend types: ONTAP AFF, ontap-san driver, iSCSI, LUKS encryption enabled
  • Other: 16 iSCSI data LIFs across 2 ONTAP nodes, multipath with ALUA

To Reproduce

  1. Deploy a cluster with ontap-san backend using iSCSI with multiple data LIFs (16 in our case)
  2. Create pods and PVCs using using onta-san
  3. Perform a rolling upgrade of the cluster (nodes reboot sequentially)
  4. After reboot, some iSCSI sessions fail to re-login and enter FAILED/REOPEN state with exponential backoff
  5. Wait some time — sessions remain stuck in REOPEN, Trident self-healing does not intervene
  6. Multipath shows affected paths as "i/o pending" — I/O works via remaining paths but multipath validation tests fail

Observed on an affected node:

$ sudo iscsiadm -m session -P 1 | grep -B2 "FAILED"
  Current Portal: <portal-ip-1>:3260,1114
    iSCSI Session State: FAILED
    Internal iscsid Session State: REOPEN
  Current Portal: <portal-ip-2>:3260,1111
    iSCSI Session State: FAILED
    Internal iscsid Session State: REOPEN
  Current Portal: <portal-ip-3>:3260,1115
    iSCSI Session State: FAILED
    Internal iscsid Session State: REOPEN
  Current Portal: <portal-ip-4>:3260,1112
    iSCSI Session State: FAILED
    Internal iscsid Session State: REOPEN
  Current Portal: <portal-ip-5>:3260,1116
    iSCSI Session State: FAILED
    Internal iscsid Session State: REOPEN

All 5 failed portals are pingable and accept iSCSI connections from other nodes. The ONTAP backend shows all LIFs up/up and sessions from other initiators are healthy.

Trident node plugin log during this time:

level=debug msg="iSCSI self-healing is running."
level=debug msg="Skipping iSCSI self-heal cycle; no iSCSI volumes published on the host."

Expected behavior

Trident's iSCSI self-healing should:

  1. Detect iSCSI sessions in FAILED/REOPEN state that have been stuck for longer than a configurable threshold (e.g., 5-10 minutes post-boot)
  2. Validate that the target portal is reachable (ICMP + TCP port 3260) to avoid masking real network issues
  3. If the target is reachable but the session is stuck in backoff, force recovery by setting recovery_tmo=1 via sysfs, then restore the original value
  4. Log the intervention at INFO/WARN level so operators are aware

This would close the gap between iscsid's built-in retry (which can get stuck in exponential backoff) and Trident's self-healing (which currently doesn't address this scenario).

Workaround

Manual fix from the affected node:

# For each stuck session (e.g., session IDs 4,5,6,7,8):
echo 1 | sudo tee /sys/class/iscsi_session/session4/recovery_tmo
# Sessions immediately tear down and re-login successfully

Additional context

  • This issue occurs ~1 in 10 reboots or during cluster upgrades. It is intermittent and race-condition dependent.
  • The issue specifically affects sessions to LIFs on the non-owning ONTAP node. Sessions to the owning node recovered normally after reboot.
  • This suggests a timing/race condition during boot: the node tries to login to all 16 portals simultaneously, the owning-node LIFs respond faster (optimized paths), and the non-owning node LIFs (non-optimized ALUA paths) timeout during the initial login burst, entering the backoff loop.
  • iscsiadm -m session --rescan does NOT fix this — it only rescans already-logged-in sessions for new LUNs.
  • iscsiadm -m node --login does NOT fix this — it reports "session already present" for the stuck sessions.
  • The only effective fix is the recovery_tmo sysfs reset.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions