fix(smp): x86_64 SMP boot hang — gale spin_unlock_valid rejects the (0,0,0) early-boot owner#46
Open
avrabe wants to merge 1 commit into
Open
fix(smp): x86_64 SMP boot hang — gale spin_unlock_valid rejects the (0,0,0) early-boot owner#46avrabe wants to merge 1 commit into
avrabe wants to merge 1 commit into
Conversation
…0) owner The qemu_x86_64 SMP tests (smp_semaphore/mutex/threads) hung at boot with no console output — long "tracked for investigation" / continue-on-error. Root-caused by local repro (qemu_x86_64, Zephyr SDK x86_64-zephyr-elf) + gdb backtrace: During early x86_64 CPU init (z_x86_cpu_init -> z_loapic_enable -> virt_region_init -> bitarray set_clear_region), CPU 0 takes a spinlock while _current == NULL. The encoded owner is (cpu_id | (uintptr_t)_current) == (0 | 0) == 0. On unlock, stock z_spin_unlock_valid treats that as a valid match (tcpu 0 == 0 -> true). But the gale shim had an extra `if (thread_cpu == 0) return false;`, so it failed the __ASSERT(z_spin_unlock_valid(l)) in k_spin_unlock. The assert path (assert_print -> vprintk) itself takes spinlocks whose validation ALSO fails -> recursive assertion failure -> stack recursion -> silent hang before the console comes up. Fix: drop the bogus early-return. gale_spin_unlock_valid already returns the correct result for (thread_cpu=0, cpu=0, current=0) — `0 == (0|0)` -> valid — matching stock. Verified locally on qemu_x86_64 SMP: smp_semaphore now PROJECT EXECUTION SUCCESSFUL (61/0), and all three suites get past the boot hang (were hung before). The verified Rust (spin_unlock_valid) was already correct; the bug was C-shim-only. NOTE: with the hang gone, smp_threads shows 9 thread-lifecycle subtest failures and smp_mutex hangs inside mutex_api — distinct gale-SMP-correctness issues the boot hang was masking, filed separately for follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
The
qemu_x86_64SMP tests (smp_semaphore/smp_mutex/smp_threads) hung at boot with no console output — the long-standingcontinue-on-error"Gale shim interaction with x86_64 SMP init" issue. This fixes the hang.Root cause (reproduced locally + gdb backtrace)
During early x86_64 CPU init (
z_x86_cpu_init → z_loapic_enable → virt_region_init → bitarray set_clear_region), CPU 0 takes a spinlock while_current == NULL. The encoded owner iscpu_id | (uintptr_t)_current == 0 | 0 == 0.On unlock, stock
z_spin_unlock_validtreats that as a valid match (tcpu 0 == 0→ true). But the gale shim had an extraif (thread_cpu == 0) return false;, so it failed__ASSERT(z_spin_unlock_valid(l))ink_spin_unlock. The assert path (assert_print → vprintk) itself takes spinlocks whose validation also fails → recursive assertion failure → stack recursion → silent hang before the console is up.Fix
Drop the bogus early-return.
gale_spin_unlock_valid(0, 0, 0)already returns the correct result (0 == (0|0)→ valid), matching stock. The verified Rust was already correct — the bug was C-shim-only.Verification (local qemu_x86_64 SMP, Zephyr SDK x86_64-zephyr-elf)
smp_semaphore: was HANG → now PROJECT EXECUTION SUCCESSFUL (61/0).Follow-up (separate, newly-exposed)
With the hang gone,
smp_threadsshows 9 thread-lifecycle subtest failures (stock: 0) andsmp_mutexhangs insidemutex_api— distinct gale-SMP-correctness issues the boot hang was masking. Will file separately.🤖 Generated with Claude Code