Describe the bug
Trident's iSCSI self-healing (node_server=heal_iscsi) does not recover iSCSI sessions that are stuck in FAILED/REOPEN state after a node reboot. The Linux iSCSI initiator (iscsid) enters an exponential backoff retry loop that never converges, leaving multipath paths degraded indefinitely.
The self-healing loop logs "Skipping iSCSI self-heal cycle; no iSCSI volumes published on the host." on nodes without published volumes, but even on nodes WITH published volumes and active iSCSI sessions, it does not detect or fix sessions stuck in REOPEN state with exponential backoff.
The only manual fix is to reset recovery_tmo to 1 via sysfs (/sys/class/iscsi_session/sessionN/recovery_tmo), which forces the session to tear down and re-establish. This is exactly what the self-healing loop should do when it detects a session has been in REOPEN for an extended period post-reboot.
Frequency
This issue occurs approximately 1 in every 10 node reboots, or during cluster upgrades (rolling reboot of all nodes). It is not deterministic — the same node may recover fine on one reboot and get stuck on the next, depending on timing and load during the iSCSI login burst at boot.
Environment
- Trident version: v26.02.0
- Trident installation flags used: standard CSI deployment
- Container runtime: containerd 2.2.3
- Kubernetes version: v1.36.0
- OS: Sles 15
- NetApp backend types: ONTAP AFF, ontap-san driver, iSCSI, LUKS encryption enabled
- Other: 16 iSCSI data LIFs across 2 ONTAP nodes, multipath with ALUA
To Reproduce
- Deploy a cluster with
ontap-san backend using iSCSI with multiple data LIFs (16 in our case)
- Create pods and PVCs using using onta-san
- Perform a rolling upgrade of the cluster (nodes reboot sequentially)
- After reboot, some iSCSI sessions fail to re-login and enter FAILED/REOPEN state with exponential backoff
- Wait some time — sessions remain stuck in REOPEN, Trident self-healing does not intervene
- Multipath shows affected paths as "i/o pending" — I/O works via remaining paths but multipath validation tests fail
Observed on an affected node:
$ sudo iscsiadm -m session -P 1 | grep -B2 "FAILED"
Current Portal: <portal-ip-1>:3260,1114
iSCSI Session State: FAILED
Internal iscsid Session State: REOPEN
Current Portal: <portal-ip-2>:3260,1111
iSCSI Session State: FAILED
Internal iscsid Session State: REOPEN
Current Portal: <portal-ip-3>:3260,1115
iSCSI Session State: FAILED
Internal iscsid Session State: REOPEN
Current Portal: <portal-ip-4>:3260,1112
iSCSI Session State: FAILED
Internal iscsid Session State: REOPEN
Current Portal: <portal-ip-5>:3260,1116
iSCSI Session State: FAILED
Internal iscsid Session State: REOPEN
All 5 failed portals are pingable and accept iSCSI connections from other nodes. The ONTAP backend shows all LIFs up/up and sessions from other initiators are healthy.
Trident node plugin log during this time:
level=debug msg="iSCSI self-healing is running."
level=debug msg="Skipping iSCSI self-heal cycle; no iSCSI volumes published on the host."
Expected behavior
Trident's iSCSI self-healing should:
- Detect iSCSI sessions in FAILED/REOPEN state that have been stuck for longer than a configurable threshold (e.g., 5-10 minutes post-boot)
- Validate that the target portal is reachable (ICMP + TCP port 3260) to avoid masking real network issues
- If the target is reachable but the session is stuck in backoff, force recovery by setting
recovery_tmo=1 via sysfs, then restore the original value
- Log the intervention at INFO/WARN level so operators are aware
This would close the gap between iscsid's built-in retry (which can get stuck in exponential backoff) and Trident's self-healing (which currently doesn't address this scenario).
Workaround
Manual fix from the affected node:
# For each stuck session (e.g., session IDs 4,5,6,7,8):
echo 1 | sudo tee /sys/class/iscsi_session/session4/recovery_tmo
# Sessions immediately tear down and re-login successfully
Additional context
- This issue occurs ~1 in 10 reboots or during cluster upgrades. It is intermittent and race-condition dependent.
- The issue specifically affects sessions to LIFs on the non-owning ONTAP node. Sessions to the owning node recovered normally after reboot.
- This suggests a timing/race condition during boot: the node tries to login to all 16 portals simultaneously, the owning-node LIFs respond faster (optimized paths), and the non-owning node LIFs (non-optimized ALUA paths) timeout during the initial login burst, entering the backoff loop.
iscsiadm -m session --rescan does NOT fix this — it only rescans already-logged-in sessions for new LUNs.
iscsiadm -m node --login does NOT fix this — it reports "session already present" for the stuck sessions.
- The only effective fix is the
recovery_tmo sysfs reset.
Describe the bug
Trident's iSCSI self-healing (
node_server=heal_iscsi) does not recover iSCSI sessions that are stuck in FAILED/REOPEN state after a node reboot. The Linux iSCSI initiator (iscsid) enters an exponential backoff retry loop that never converges, leaving multipath paths degraded indefinitely.The self-healing loop logs
"Skipping iSCSI self-heal cycle; no iSCSI volumes published on the host."on nodes without published volumes, but even on nodes WITH published volumes and active iSCSI sessions, it does not detect or fix sessions stuck in REOPEN state with exponential backoff.The only manual fix is to reset
recovery_tmoto 1 via sysfs (/sys/class/iscsi_session/sessionN/recovery_tmo), which forces the session to tear down and re-establish. This is exactly what the self-healing loop should do when it detects a session has been in REOPEN for an extended period post-reboot.Frequency
This issue occurs approximately 1 in every 10 node reboots, or during cluster upgrades (rolling reboot of all nodes). It is not deterministic — the same node may recover fine on one reboot and get stuck on the next, depending on timing and load during the iSCSI login burst at boot.
Environment
To Reproduce
ontap-sanbackend using iSCSI with multiple data LIFs (16 in our case)Observed on an affected node:
All 5 failed portals are pingable and accept iSCSI connections from other nodes. The ONTAP backend shows all LIFs up/up and sessions from other initiators are healthy.
Trident node plugin log during this time:
Expected behavior
Trident's iSCSI self-healing should:
recovery_tmo=1via sysfs, then restore the original valueThis would close the gap between iscsid's built-in retry (which can get stuck in exponential backoff) and Trident's self-healing (which currently doesn't address this scenario).
Workaround
Manual fix from the affected node:
Additional context
iscsiadm -m session --rescandoes NOT fix this — it only rescans already-logged-in sessions for new LUNs.iscsiadm -m node --logindoes NOT fix this — it reports "session already present" for the stuck sessions.recovery_tmosysfs reset.