Feat/arm64#76
Open
perbu wants to merge 16 commits into
Open
Conversation
Use KVM_GET_ONE_REG/KVM_SET_ONE_REG for register access since ARM64 KVM does not implement KVM_GET_REGS/KVM_SET_REGS. Add an MMIO-based stop ABI (guest write to ARM64_STOP_MMIO_ADDR cleanly halts the VM), set the VM IPA size at KVM_CREATE_VM, guard the empty-binary constructor path, and add the missing vMemory snapshot stubs needed to link on ARM64. Includes an arm64_minimal unit test exercising the full run/stop loop.
Introduce a shared lib/tinykvm/paging.hpp declaring the common paging API (foreach_page, writable_page_at, readable_page_at, etc.) so both architectures present the same interface. The amd64 header now just includes it, and the arm64 backend gains a full implementation of the page-table walk, copy-on-write, hugepage splitting, and protection handling. This removes the TINYKVM_ARCH_AMD64 special-casing in machine_utils.cpp and memory.cpp, letting copy_to_guest, memzero, mmap_backed_area, and the writable/readable page helpers go through the same code path on both arches. New paging_default_usermode_flags() and paging_address_mask() abstract the arch-specific PTE flag/mask details.
Re-running a fork after reset_to misbehaved (silently dropped writes, or crashed with an MMIO exit at a garbage physical address). The fork's page tables were never the problem; two independent bugs were: 1. Deferred MMIO PC increment. KVM advances the guest PC past a stop-MMIO store only on the next KVM_RUN entry, against whatever PC userspace has loaded by then - so any set-PC + re-run after a stop skipped its first instruction. Complete the MMIO eagerly with an immediate_exit dummy KVM_RUN at the stop and syscall-stop exits; KVM then commits the pending increment against the old PC. Exit PC now points past the stop store, matching amd64 semantics. 2. Stale stage-1 guest TLB. A host-side TTBR0_EL1 write invalidates nothing (unlike CR3 via KVM_SET_SREGS on x86), so after reset_to rebuilds the page tables the vCPU could keep translating through recycled bank pages. Today this is masked by the kernel's tlbi vmalle1is side effect of MADV_DONTNEED in banks.reset(), but don't rely on it: add a TLBI stub in the vectors page (TLB_FLUSH_ADDR) and run it for one VM entry after every page-table switch. The stub only clobbers PC/PSTATE/x9, so just those are saved and restored (~6 us per reset). Adds a regression test that re-runs a fork across four reset_to cycles. Fast reset measures ~33 us vs ~120 us for a fresh fork per iteration.
Implements the write-prefetch optimization (Option A). A warmup fork's write working set, harvested via get_accessed_pages(), can now be replayed into every subsequent fork or reset_to with prefetch_pages(), which pre-CoWs each page with the same flags/dirty semantics as the write-fault path. This eliminates the per-page write-fault VM exits: ~32 us reset + ~3.5 us prefetch + ~101 us run vs ~843 us unprefetched on the mixed 256R/256W microbenchmark (~6x). Block-sized entries in the harvested set are walked at page granularity, splitting the block and CoWing each page beneath it. Note that prefetching marks pages accessed, so re-harvesting from a prefetched fork never shrinks the set; harvest once from a clean warmup fork. New unit test runs the full warmup -> harvest -> prefetch -> run pipeline and asserts zero page faults on both the fresh-fork and reset_to paths.
The shared ELF loader, setup_linux and syscall table were already arch-clean on ARM64; the missing piece was usermode. Guest RAM had no EL0 access bits, so a loaded program could not execute at all -- and it must run at EL0, not EL1, because an EL1 guest could rewrite its own stage-1 tables and strip the read-only bits that protect the master VM's memory under copy-on-write. - Map guest RAM user-RWX (AP_USER, UXN clear, explicit PXN). A new L3 table covers the first 2 MB: the vectors page is user-read-only so it stays EL1-executable (user-writable would force PXN), the page tables and vCPU table are EL1-only, and pages below the vectors are left unmapped to catch null dereferences. The MMIO trap block is user-accessible so EL0 guests reach the stop/syscall MMIO directly. - SCTLR_EL1 gains SPAN (keep PAN off in the vectors, which store to the now user-accessible trap pages), DZE/UCT/UCI (EL0 dc zva, CTR_EL0 and cache maintenance for glibc string routines) and nTWI/nTWE. - Default pstate is now EL0t (0x3c0) in setup_registers, setup_call and setup_clone; raw-guest demo/bench/tests move from EL1h to EL0. - setup_linux masks HWCAP_SVE, HWCAP_CPUID and HWCAP2_SME: the vCPU is created without those features, and HWCAP_CPUID invites EL0 ID-register reads (kernel-emulated on real Linux) that our vectors treat as fatal. This is what glibc ifunc resolvers key on. - Fix handle_exception reading ELR_EL1 as a sysreg; KVM exposes it as a core register, so the diagnostic path itself threw ENOENT and masked the real guest fault. - paging_default_usermode_flags no longer requires UXN as a verify flag; user pages are executable now. New arm64_elf unit tests run real statically linked glibc binaries end to end: argv/env passing, write() to the printer, heap allocation and string routines, vmcall into an ELF function, and forked vmcalls isolated by CoW across reset_to. Bench numbers are unchanged.
A minimal end-to-end smoke test for the ARM64 backend: loads a tiny hand-assembled guest, runs it at EL0, and verifies the result via the STOP MMIO address. Built only when TINYKVM_ARCH is ARM64.
Quantifies per-fork costs in the warm-fork-from-master model and empirically settles whether read-access tracking could help fork prefetch. Measures fork/reset, write-set prefetch, and page-fault costs across read-only, write-only, mixed, and prefetched configs; confirms read faults are structurally zero under CoW.
Load ld-linux-aarch64.so.1 as the machine binary with the real program as argv, the same scheme the amd64 ELF tests use. Three fixes: - Skip pre-relocation of ET_DYN binaries on ARM64. A glibc ET_DYN entered at its own entry point (ld.so, static-PIE) self-relocates, and modern aarch64 ld.so carries DT_RELR. RELR entries are "*addr += base" -- not idempotent -- so pre-applying them here double-relocated ld.so's init_array/cpu_list pointers and crashed the guest. RELATIVE rela entries are absolute writes the guest redoes anyway. amd64 behavior is unchanged. - Fix the signal-table off-by-one in the arm64 Signals::get stub (at(sig) -> at(sig-1), matching the amd64 implementation), and make rt_sigaction return -EINVAL for signals above 64 instead of letting the array bounds check throw out of the host: CPython sweeps signals 1..65 (glibc _NSIG) at startup. New tests in arm64_elf.cpp: a PIE dynamic guest, a non-PIE dynamic guest (this gcc's default; needs heap_address_hint above the fixed 0x400000 link address so ld.so's MAP_FIXED doesn't collide with the mmap arena), and a real python3 -c guest run end-to-end. The test codebuilder gains a dynamic (non -static) build mode.
mmap_backed_files failed on ARM64 for two physical-address reasons: - MMAP_PHYS_BASE (256GB) exceeds the 36-bit stage-2 IPA that Apple-Silicon KVM hosts support, so KVM_SET_USER_MEMORY_REGION rejected the slot outright. The base is now 32GB on ARM64, mirroring how the bank arena was already moved down for the same reason. - TCR_EL1.IPS was left at its reset value (32-bit / 4GB physical), so even an installable slot would have faulted every stage-1 walk that produced an output above 4GB. It worked until now only because all guest-physical memory (RAM, banks at 2GB) sat below 4GB. IPS is now programmed from KVM_CAP_ARM_VM_IPA_SIZE, captured at Machine::init. The Python guest test now runs with mmap_backed_files enabled and asserts that libpython (5.9MB, above the 4MB threshold) was served by a file-backed memory region rather than the preadv fallback.
The arm64 tgkill handler forwarded every nonzero signal to Signals::enter, which is a not-implemented stub, so a guest calling abort() (raise -> tgkill) threw "Guest signals are not implemented on ARM64" out of machine.run() -- an infrastructure failure mode for an ordinary guest crash. Match the kernel's default dispositions instead: signal 0 is an existence probe, default-ignored signals (SIGCHLD, SIGCONT, SIGURG, SIGWINCH) are dropped, and everything else stops the VM with the conventional 128+sig exit status, readable via return_value(). Handler entry remains unimplemented (and is now unreachable from tgkill); guests with registered handlers also terminate, which is the intended sandbox policy until a real workload demands delivery.
Adding real pthread test coverage on ARM64 (tests/unit/arm64_threads.cpp: create/join + shared memory, and a 4-thread mutex-contended counter) surfaced three bugs the threading engine had been hiding: - vcpu.cpp: get/set_arm64_regs read SP_EL1 (vector scratch) instead of SP_EL0 when parked at EL1h, so cloned threads ran on the parent's stack. Read ELR_EL1 for the user PC and SP_EL0 for the user SP at EL1h. - system_calls.cpp: prlimit64 had new/old limit pointers swapped, leaving the caller's RLIMIT_STACK buffer uninitialised; glibc then tried to mmap a multi-GB thread stack. - machine_utils.cpp: Machine::memzero hardcoded amd64's dirty bit (bit 6), which is AP[1] on arm64, so a large PROT_NONE mmap reservation walked off the end of guest RAM. Added paging_dirty_bit() per arch. create/join and mutex-contended counters now pass end to end. A condvar producer/consumer test is present but [!shouldfail] (futex wake is not address-aware yet). TODO.md updated with port state and the remaining gaps.
The cooperative thread scheduler resumed m_suspended.front() on every FUTEX_WAKE regardless of which address a thread waited on, so a condvar signal could land on the wrong waiter and both threads would park (consumer in cond_wait, main in join). The condvar producer/consumer test deadlocked. Make wakeups address-aware: each Thread tracks the futex address it blocked on (futex_addr; 0 = runnable). FUTEX_WAIT blocks on its address and the scheduler only resumes runnable threads; FUTEX_WAKE marks up to `val` address-matching waiters runnable and hands control to the first of them via the new MultiThreading::switch_to(). Handing off matters because the scheduler never preempts: a producer that did not yield would run to completion before a signalled consumer ran once. Thread exit now also wakes the clear_tid/join futex so pthread_join resumes. next_runnable() is a peek and switch_to() suspends before removing the target, so the scheduler stays consistent if push_back throws. set_to_and_suspend_others() resets futex_addr so a snapshot-restored thread starts runnable and re-checks its predicate on resume. Mirrored in the shared amd64 scheduler (linux/threads.cpp). The condvar test is no longer tagged [!shouldfail]; arm64_minimal (52), arm64_elf (17) and arm64_threads (3) all pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ARM64 port done with Claude Fable / Opus as well as with some help from OpenAI Codex 5.5