Skip to content

fix(dpuagent): probe firmware reset requirements in boot-ID reboot path#59

Open
tsorya wants to merge 1 commit into
NVIDIA:public-mainfrom
tsorya:igal/fix-boot-id-probe-reset
Open

fix(dpuagent): probe firmware reset requirements in boot-ID reboot path#59
tsorya wants to merge 1 commit into
NVIDIA:public-mainfrom
tsorya:igal/fix-boot-id-probe-reset

Conversation

@tsorya

@tsorya tsorya commented Jun 13, 2026

Copy link
Copy Markdown

When RebootMethodDiscovery is false (boot-ID path), getRebootMethodBootID always returned SystemLevelReset on first boot. This triggers shutdown -h on the DPU, which halts Linux but does not power off the ARM SoC. If a firmware update (ATF/UEFI) happened during boot, the ARM enters a protected state where mlxfwreset fails (error 274), leaving the DPU in a zombie state that only a host power cycle can fix.

Add probeResetRequirements() which runs a best-effort mlxfwreset status check on each MST device before defaulting to SystemLevelReset. If any device reports reset_needed with a power-cycle signal, return PowerCycle instead. On any probe failure (no MFT, no MST devices, parse error), fall back to SystemLevelReset preserving existing behavior.

Extract queryDeviceResetStatus() as a shared helper used by both probeResetRequirements and getRebootMethodDeviceQuery to avoid duplicating the mlxfwreset invocation and JSON parsing logic.

When RebootMethodDiscovery is false (boot-ID path), getRebootMethodBootID
always returned SystemLevelReset on first boot. This triggers shutdown -h
on the DPU, which halts Linux but does not power off the ARM SoC. If a
firmware update (ATF/UEFI) happened during boot, the ARM enters a
protected state where mlxfwreset fails (error 274), leaving the DPU in
a zombie state that only a host power cycle can fix.

Add probeResetRequirements() which runs a best-effort mlxfwreset status
check on each MST device before defaulting to SystemLevelReset. If any
device reports reset_needed with a power-cycle signal, return PowerCycle
instead. On any probe failure (no MFT, no MST devices, parse error),
fall back to SystemLevelReset preserving existing behavior.

Extract queryDeviceResetStatus() as a shared helper used by both
probeResetRequirements and getRebootMethodDeviceQuery to avoid
duplicating the mlxfwreset invocation and JSON parsing logic.

Signed-off-by: Igal Tsoiref <itsoiref@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant