Feat/arm64 by perbu · Pull Request #76 · varnish/tinykvm

perbu · 2026-06-13T08:14:02Z

ARM64 port done with Claude Fable / Opus as well as with some help from OpenAI Codex 5.5

Use KVM_GET_ONE_REG/KVM_SET_ONE_REG for register access since ARM64 KVM does not implement KVM_GET_REGS/KVM_SET_REGS. Add an MMIO-based stop ABI (guest write to ARM64_STOP_MMIO_ADDR cleanly halts the VM), set the VM IPA size at KVM_CREATE_VM, guard the empty-binary constructor path, and add the missing vMemory snapshot stubs needed to link on ARM64. Includes an arm64_minimal unit test exercising the full run/stop loop.

Introduce a shared lib/tinykvm/paging.hpp declaring the common paging API (foreach_page, writable_page_at, readable_page_at, etc.) so both architectures present the same interface. The amd64 header now just includes it, and the arm64 backend gains a full implementation of the page-table walk, copy-on-write, hugepage splitting, and protection handling. This removes the TINYKVM_ARCH_AMD64 special-casing in machine_utils.cpp and memory.cpp, letting copy_to_guest, memzero, mmap_backed_area, and the writable/readable page helpers go through the same code path on both arches. New paging_default_usermode_flags() and paging_address_mask() abstract the arch-specific PTE flag/mask details.

Re-running a fork after reset_to misbehaved (silently dropped writes, or crashed with an MMIO exit at a garbage physical address). The fork's page tables were never the problem; two independent bugs were: 1. Deferred MMIO PC increment. KVM advances the guest PC past a stop-MMIO store only on the next KVM_RUN entry, against whatever PC userspace has loaded by then - so any set-PC + re-run after a stop skipped its first instruction. Complete the MMIO eagerly with an immediate_exit dummy KVM_RUN at the stop and syscall-stop exits; KVM then commits the pending increment against the old PC. Exit PC now points past the stop store, matching amd64 semantics. 2. Stale stage-1 guest TLB. A host-side TTBR0_EL1 write invalidates nothing (unlike CR3 via KVM_SET_SREGS on x86), so after reset_to rebuilds the page tables the vCPU could keep translating through recycled bank pages. Today this is masked by the kernel's tlbi vmalle1is side effect of MADV_DONTNEED in banks.reset(), but don't rely on it: add a TLBI stub in the vectors page (TLB_FLUSH_ADDR) and run it for one VM entry after every page-table switch. The stub only clobbers PC/PSTATE/x9, so just those are saved and restored (~6 us per reset). Adds a regression test that re-runs a fork across four reset_to cycles. Fast reset measures ~33 us vs ~120 us for a fresh fork per iteration.

Implements the write-prefetch optimization (Option A). A warmup fork's write working set, harvested via get_accessed_pages(), can now be replayed into every subsequent fork or reset_to with prefetch_pages(), which pre-CoWs each page with the same flags/dirty semantics as the write-fault path. This eliminates the per-page write-fault VM exits: ~32 us reset + ~3.5 us prefetch + ~101 us run vs ~843 us unprefetched on the mixed 256R/256W microbenchmark (~6x). Block-sized entries in the harvested set are walked at page granularity, splitting the block and CoWing each page beneath it. Note that prefetching marks pages accessed, so re-harvesting from a prefetched fork never shrinks the set; harvest once from a clean warmup fork. New unit test runs the full warmup -> harvest -> prefetch -> run pipeline and asserts zero page faults on both the fresh-fork and reset_to paths.

The shared ELF loader, setup_linux and syscall table were already arch-clean on ARM64; the missing piece was usermode. Guest RAM had no EL0 access bits, so a loaded program could not execute at all -- and it must run at EL0, not EL1, because an EL1 guest could rewrite its own stage-1 tables and strip the read-only bits that protect the master VM's memory under copy-on-write. - Map guest RAM user-RWX (AP_USER, UXN clear, explicit PXN). A new L3 table covers the first 2 MB: the vectors page is user-read-only so it stays EL1-executable (user-writable would force PXN), the page tables and vCPU table are EL1-only, and pages below the vectors are left unmapped to catch null dereferences. The MMIO trap block is user-accessible so EL0 guests reach the stop/syscall MMIO directly. - SCTLR_EL1 gains SPAN (keep PAN off in the vectors, which store to the now user-accessible trap pages), DZE/UCT/UCI (EL0 dc zva, CTR_EL0 and cache maintenance for glibc string routines) and nTWI/nTWE. - Default pstate is now EL0t (0x3c0) in setup_registers, setup_call and setup_clone; raw-guest demo/bench/tests move from EL1h to EL0. - setup_linux masks HWCAP_SVE, HWCAP_CPUID and HWCAP2_SME: the vCPU is created without those features, and HWCAP_CPUID invites EL0 ID-register reads (kernel-emulated on real Linux) that our vectors treat as fatal. This is what glibc ifunc resolvers key on. - Fix handle_exception reading ELR_EL1 as a sysreg; KVM exposes it as a core register, so the diagnostic path itself threw ENOENT and masked the real guest fault. - paging_default_usermode_flags no longer requires UXN as a verify flag; user pages are executable now. New arm64_elf unit tests run real statically linked glibc binaries end to end: argv/env passing, write() to the printer, heap allocation and string routines, vmcall into an ELF function, and forked vmcalls isolated by CoW across reset_to. Bench numbers are unchanged.

A minimal end-to-end smoke test for the ARM64 backend: loads a tiny hand-assembled guest, runs it at EL0, and verifies the result via the STOP MMIO address. Built only when TINYKVM_ARCH is ARM64.

Quantifies per-fork costs in the warm-fork-from-master model and empirically settles whether read-access tracking could help fork prefetch. Measures fork/reset, write-set prefetch, and page-fault costs across read-only, write-only, mixed, and prefetched configs; confirms read faults are structurally zero under CoW.

Load ld-linux-aarch64.so.1 as the machine binary with the real program as argv, the same scheme the amd64 ELF tests use. Three fixes: - Skip pre-relocation of ET_DYN binaries on ARM64. A glibc ET_DYN entered at its own entry point (ld.so, static-PIE) self-relocates, and modern aarch64 ld.so carries DT_RELR. RELR entries are "*addr += base" -- not idempotent -- so pre-applying them here double-relocated ld.so's init_array/cpu_list pointers and crashed the guest. RELATIVE rela entries are absolute writes the guest redoes anyway. amd64 behavior is unchanged. - Fix the signal-table off-by-one in the arm64 Signals::get stub (at(sig) -> at(sig-1), matching the amd64 implementation), and make rt_sigaction return -EINVAL for signals above 64 instead of letting the array bounds check throw out of the host: CPython sweeps signals 1..65 (glibc _NSIG) at startup. New tests in arm64_elf.cpp: a PIE dynamic guest, a non-PIE dynamic guest (this gcc's default; needs heap_address_hint above the fixed 0x400000 link address so ld.so's MAP_FIXED doesn't collide with the mmap arena), and a real python3 -c guest run end-to-end. The test codebuilder gains a dynamic (non -static) build mode.

mmap_backed_files failed on ARM64 for two physical-address reasons: - MMAP_PHYS_BASE (256GB) exceeds the 36-bit stage-2 IPA that Apple-Silicon KVM hosts support, so KVM_SET_USER_MEMORY_REGION rejected the slot outright. The base is now 32GB on ARM64, mirroring how the bank arena was already moved down for the same reason. - TCR_EL1.IPS was left at its reset value (32-bit / 4GB physical), so even an installable slot would have faulted every stage-1 walk that produced an output above 4GB. It worked until now only because all guest-physical memory (RAM, banks at 2GB) sat below 4GB. IPS is now programmed from KVM_CAP_ARM_VM_IPA_SIZE, captured at Machine::init. The Python guest test now runs with mmap_backed_files enabled and asserts that libpython (5.9MB, above the 4MB threshold) was served by a file-backed memory region rather than the preadv fallback.

The arm64 tgkill handler forwarded every nonzero signal to Signals::enter, which is a not-implemented stub, so a guest calling abort() (raise -> tgkill) threw "Guest signals are not implemented on ARM64" out of machine.run() -- an infrastructure failure mode for an ordinary guest crash. Match the kernel's default dispositions instead: signal 0 is an existence probe, default-ignored signals (SIGCHLD, SIGCONT, SIGURG, SIGWINCH) are dropped, and everything else stops the VM with the conventional 128+sig exit status, readable via return_value(). Handler entry remains unimplemented (and is now unreachable from tgkill); guests with registered handlers also terminate, which is the intended sandbox policy until a real workload demands delivery.

Adding real pthread test coverage on ARM64 (tests/unit/arm64_threads.cpp: create/join + shared memory, and a 4-thread mutex-contended counter) surfaced three bugs the threading engine had been hiding: - vcpu.cpp: get/set_arm64_regs read SP_EL1 (vector scratch) instead of SP_EL0 when parked at EL1h, so cloned threads ran on the parent's stack. Read ELR_EL1 for the user PC and SP_EL0 for the user SP at EL1h. - system_calls.cpp: prlimit64 had new/old limit pointers swapped, leaving the caller's RLIMIT_STACK buffer uninitialised; glibc then tried to mmap a multi-GB thread stack. - machine_utils.cpp: Machine::memzero hardcoded amd64's dirty bit (bit 6), which is AP[1] on arm64, so a large PROT_NONE mmap reservation walked off the end of guest RAM. Added paging_dirty_bit() per arch. create/join and mutex-contended counters now pass end to end. A condvar producer/consumer test is present but [!shouldfail] (futex wake is not address-aware yet). TODO.md updated with port state and the remaining gaps.

The cooperative thread scheduler resumed m_suspended.front() on every FUTEX_WAKE regardless of which address a thread waited on, so a condvar signal could land on the wrong waiter and both threads would park (consumer in cond_wait, main in join). The condvar producer/consumer test deadlocked. Make wakeups address-aware: each Thread tracks the futex address it blocked on (futex_addr; 0 = runnable). FUTEX_WAIT blocks on its address and the scheduler only resumes runnable threads; FUTEX_WAKE marks up to `val` address-matching waiters runnable and hands control to the first of them via the new MultiThreading::switch_to(). Handing off matters because the scheduler never preempts: a producer that did not yield would run to completion before a signalled consumer ran once. Thread exit now also wakes the clear_tid/join futex so pthread_join resumes. next_runnable() is a peek and switch_to() suspends before removing the target, so the scheduler stays consistent if push_back throws. set_to_and_suspend_others() resets futex_addr so a snapshot-restored thread starts runnable and re-checks its predicate on resume. Mirrored in the shared amd64 scheduler (linux/threads.cpp). The condvar test is no longer tagged [!shouldfail]; arm64_minimal (52), arm64_elf (17) and arm64_threads (3) all pass.

perbu added 16 commits June 13, 2026 10:12

Add initial ARM64 backend

2e151cf

Implement ARM64 syscall trap path

5db73d3

Implement ARM64 CoW and snapshot foundations

f643b66

Ignore build-* output directories

365123c

Add ARM64 demo program

31214f0

A minimal end-to-end smoke test for the ARM64 backend: loads a tiny hand-assembled guest, runs it at EL0, and verifies the result via the STOP MMIO address. Built only when TINYKVM_ARCH is ARM64.

perbu marked this pull request as ready for review June 13, 2026 12:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/arm64#76

Feat/arm64#76
perbu wants to merge 16 commits into
masterfrom
feat/arm64

perbu commented Jun 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

perbu commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

perbu commented Jun 13, 2026 •

edited

Loading