Skip to content

fix(smp): x86_64 SMP boot hang — gale spin_unlock_valid rejects the (0,0,0) early-boot owner#46

Open
avrabe wants to merge 1 commit into
mainfrom
fix/x86-smp-spinlock-validate-boot-hang
Open

fix(smp): x86_64 SMP boot hang — gale spin_unlock_valid rejects the (0,0,0) early-boot owner#46
avrabe wants to merge 1 commit into
mainfrom
fix/x86-smp-spinlock-validate-boot-hang

Conversation

@avrabe

@avrabe avrabe commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

What

The qemu_x86_64 SMP tests (smp_semaphore/smp_mutex/smp_threads) hung at boot with no console output — the long-standing continue-on-error "Gale shim interaction with x86_64 SMP init" issue. This fixes the hang.

Root cause (reproduced locally + gdb backtrace)

During early x86_64 CPU init (z_x86_cpu_init → z_loapic_enable → virt_region_init → bitarray set_clear_region), CPU 0 takes a spinlock while _current == NULL. The encoded owner is cpu_id | (uintptr_t)_current == 0 | 0 == 0.

On unlock, stock z_spin_unlock_valid treats that as a valid match (tcpu 0 == 0 → true). But the gale shim had an extra if (thread_cpu == 0) return false;, so it failed __ASSERT(z_spin_unlock_valid(l)) in k_spin_unlock. The assert path (assert_print → vprintk) itself takes spinlocks whose validation also fails → recursive assertion failure → stack recursion → silent hang before the console is up.

Fix

Drop the bogus early-return. gale_spin_unlock_valid(0, 0, 0) already returns the correct result (0 == (0|0) → valid), matching stock. The verified Rust was already correct — the bug was C-shim-only.

Verification (local qemu_x86_64 SMP, Zephyr SDK x86_64-zephyr-elf)

  • smp_semaphore: was HANG → now PROJECT EXECUTION SUCCESSFUL (61/0).
  • All three suites now get past the boot hang.

Follow-up (separate, newly-exposed)

With the hang gone, smp_threads shows 9 thread-lifecycle subtest failures (stock: 0) and smp_mutex hangs inside mutex_api — distinct gale-SMP-correctness issues the boot hang was masking. Will file separately.

🤖 Generated with Claude Code

…0) owner

The qemu_x86_64 SMP tests (smp_semaphore/mutex/threads) hung at boot with no
console output — long "tracked for investigation" / continue-on-error. Root-caused
by local repro (qemu_x86_64, Zephyr SDK x86_64-zephyr-elf) + gdb backtrace:

During early x86_64 CPU init (z_x86_cpu_init -> z_loapic_enable -> virt_region_init
-> bitarray set_clear_region), CPU 0 takes a spinlock while _current == NULL. The
encoded owner is (cpu_id | (uintptr_t)_current) == (0 | 0) == 0. On unlock, stock
z_spin_unlock_valid treats that as a valid match (tcpu 0 == 0 -> true). But the gale
shim had an extra `if (thread_cpu == 0) return false;`, so it failed the
__ASSERT(z_spin_unlock_valid(l)) in k_spin_unlock. The assert path (assert_print ->
vprintk) itself takes spinlocks whose validation ALSO fails -> recursive assertion
failure -> stack recursion -> silent hang before the console comes up.

Fix: drop the bogus early-return. gale_spin_unlock_valid already returns the correct
result for (thread_cpu=0, cpu=0, current=0) — `0 == (0|0)` -> valid — matching stock.

Verified locally on qemu_x86_64 SMP: smp_semaphore now PROJECT EXECUTION SUCCESSFUL
(61/0), and all three suites get past the boot hang (were hung before). The verified
Rust (spin_unlock_valid) was already correct; the bug was C-shim-only.

NOTE: with the hang gone, smp_threads shows 9 thread-lifecycle subtest failures and
smp_mutex hangs inside mutex_api — distinct gale-SMP-correctness issues the boot hang
was masking, filed separately for follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 3, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant