iSCSI self-healing does not recover sessions stuck in REOPEN after node reboot


**Describe the bug**

Trident's iSCSI self-healing (`node_server=heal_iscsi`) does not recover iSCSI sessions that are stuck in FAILED/REOPEN state after a node reboot. The Linux iSCSI initiator (`iscsid`) enters an exponential backoff retry loop that never converges, leaving multipath paths degraded indefinitely.

The self-healing loop logs `"Skipping iSCSI self-heal cycle; no iSCSI volumes published on the host."` on nodes without published volumes, but even on nodes WITH published volumes and active iSCSI sessions, it does not detect or fix sessions stuck in REOPEN state with exponential backoff.

The only manual fix is to reset `recovery_tmo` to 1 via sysfs (`/sys/class/iscsi_session/sessionN/recovery_tmo`), which forces the session to tear down and re-establish. This is exactly what the self-healing loop should do when it detects a session has been in REOPEN for an extended period post-reboot.

**Frequency**

This issue occurs approximately 1 in every 10 node reboots, or during cluster upgrades (rolling reboot of all nodes). It is not deterministic — the same node may recover fine on one reboot and get stuck on the next, depending on timing and load during the iSCSI login burst at boot.

**Environment**

- Trident version: v26.02.0
- Trident installation flags used: standard CSI deployment
- Container runtime: containerd 2.2.3
- Kubernetes version: v1.36.0
- OS: Sles 15
- NetApp backend types: ONTAP AFF, ontap-san driver, iSCSI, LUKS encryption enabled
- Other: 16 iSCSI data LIFs across 2 ONTAP nodes, multipath with ALUA

**To Reproduce**

1. Deploy a cluster with `ontap-san` backend using iSCSI with multiple data LIFs (16 in our case)
2. Create  pods and PVCs using using onta-san
3. Perform a rolling upgrade of the cluster (nodes reboot sequentially)
4. After reboot, some iSCSI sessions fail to re-login and enter FAILED/REOPEN state with exponential backoff
5. Wait  some time — sessions remain stuck in REOPEN, Trident self-healing does not intervene
6. Multipath shows affected paths as "i/o pending" — I/O works via remaining paths but multipath validation tests fail

Observed on an affected node:
```
$ sudo iscsiadm -m session -P 1 | grep -B2 "FAILED"
  Current Portal: <portal-ip-1>:3260,1114
    iSCSI Session State: FAILED
    Internal iscsid Session State: REOPEN
  Current Portal: <portal-ip-2>:3260,1111
    iSCSI Session State: FAILED
    Internal iscsid Session State: REOPEN
  Current Portal: <portal-ip-3>:3260,1115
    iSCSI Session State: FAILED
    Internal iscsid Session State: REOPEN
  Current Portal: <portal-ip-4>:3260,1112
    iSCSI Session State: FAILED
    Internal iscsid Session State: REOPEN
  Current Portal: <portal-ip-5>:3260,1116
    iSCSI Session State: FAILED
    Internal iscsid Session State: REOPEN
```

All 5 failed portals are pingable and accept iSCSI connections from other nodes. The ONTAP backend shows all LIFs up/up and sessions from other initiators are healthy.

Trident node plugin log during this time:
```
level=debug msg="iSCSI self-healing is running."
level=debug msg="Skipping iSCSI self-heal cycle; no iSCSI volumes published on the host."
```

**Expected behavior**

Trident's iSCSI self-healing should:

1. Detect iSCSI sessions in FAILED/REOPEN state that have been stuck for longer than a configurable threshold (e.g., 5-10 minutes post-boot)
2. Validate that the target portal is reachable (ICMP + TCP port 3260) to avoid masking real network issues
3. If the target is reachable but the session is stuck in backoff, force recovery by setting `recovery_tmo=1` via sysfs, then restore the original value
4. Log the intervention at INFO/WARN level so operators are aware

This would close the gap between iscsid's built-in retry (which can get stuck in exponential backoff) and Trident's self-healing (which currently doesn't address this scenario).

**Workaround**

Manual fix from the affected node:
```bash
# For each stuck session (e.g., session IDs 4,5,6,7,8):
echo 1 | sudo tee /sys/class/iscsi_session/session4/recovery_tmo
# Sessions immediately tear down and re-login successfully
```

**Additional context**

- This issue occurs ~1 in 10 reboots or during cluster upgrades. It is intermittent and race-condition dependent.
- The issue specifically affects sessions to LIFs on the non-owning ONTAP node. Sessions to the owning node recovered normally after reboot.
- This suggests a timing/race condition during boot: the node tries to login to all 16 portals simultaneously, the owning-node LIFs respond faster (optimized paths), and the non-owning node LIFs (non-optimized ALUA paths) timeout during the initial login burst, entering the backoff loop.
- `iscsiadm -m session --rescan` does NOT fix this — it only rescans already-logged-in sessions for new LUNs.
- `iscsiadm -m node --login` does NOT fix this — it reports "session already present" for the stuck sessions.
- The only effective fix is the `recovery_tmo` sysfs reset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iSCSI self-healing does not recover sessions stuck in REOPEN after node reboot #1138

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

iSCSI self-healing does not recover sessions stuck in REOPEN after node reboot #1138

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions