Skip to content

Feat/arm64#76

Open
perbu wants to merge 16 commits into
masterfrom
feat/arm64
Open

Feat/arm64#76
perbu wants to merge 16 commits into
masterfrom
feat/arm64

Conversation

@perbu

@perbu perbu commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

ARM64 port done with Claude Fable / Opus as well as with some help from OpenAI Codex 5.5

perbu added 16 commits June 13, 2026 10:12
Use KVM_GET_ONE_REG/KVM_SET_ONE_REG for register access since ARM64 KVM
does not implement KVM_GET_REGS/KVM_SET_REGS. Add an MMIO-based stop ABI
(guest write to ARM64_STOP_MMIO_ADDR cleanly halts the VM), set the VM
IPA size at KVM_CREATE_VM, guard the empty-binary constructor path, and
add the missing vMemory snapshot stubs needed to link on ARM64.

Includes an arm64_minimal unit test exercising the full run/stop loop.
Introduce a shared lib/tinykvm/paging.hpp declaring the common paging
API (foreach_page, writable_page_at, readable_page_at, etc.) so both
architectures present the same interface. The amd64 header now just
includes it, and the arm64 backend gains a full implementation of the
page-table walk, copy-on-write, hugepage splitting, and protection
handling.

This removes the TINYKVM_ARCH_AMD64 special-casing in machine_utils.cpp
and memory.cpp, letting copy_to_guest, memzero, mmap_backed_area, and the
writable/readable page helpers go through the same code path on both
arches. New paging_default_usermode_flags() and paging_address_mask()
abstract the arch-specific PTE flag/mask details.
Re-running a fork after reset_to misbehaved (silently dropped writes, or
crashed with an MMIO exit at a garbage physical address). The fork's page
tables were never the problem; two independent bugs were:

1. Deferred MMIO PC increment. KVM advances the guest PC past a stop-MMIO
   store only on the next KVM_RUN entry, against whatever PC userspace has
   loaded by then - so any set-PC + re-run after a stop skipped its first
   instruction. Complete the MMIO eagerly with an immediate_exit dummy
   KVM_RUN at the stop and syscall-stop exits; KVM then commits the pending
   increment against the old PC. Exit PC now points past the stop store,
   matching amd64 semantics.

2. Stale stage-1 guest TLB. A host-side TTBR0_EL1 write invalidates nothing
   (unlike CR3 via KVM_SET_SREGS on x86), so after reset_to rebuilds the
   page tables the vCPU could keep translating through recycled bank pages.
   Today this is masked by the kernel's tlbi vmalle1is side effect of
   MADV_DONTNEED in banks.reset(), but don't rely on it: add a TLBI stub in
   the vectors page (TLB_FLUSH_ADDR) and run it for one VM entry after
   every page-table switch. The stub only clobbers PC/PSTATE/x9, so just
   those are saved and restored (~6 us per reset).

Adds a regression test that re-runs a fork across four reset_to cycles.
Fast reset measures ~33 us vs ~120 us for a fresh fork per iteration.
Implements the write-prefetch optimization (Option A). A warmup fork's
write working set, harvested via get_accessed_pages(), can now be
replayed into every subsequent fork or reset_to with prefetch_pages(),
which pre-CoWs each page with the same flags/dirty semantics as the
write-fault path. This eliminates the per-page write-fault VM exits:
~32 us reset + ~3.5 us prefetch + ~101 us run vs ~843 us unprefetched
on the mixed 256R/256W microbenchmark (~6x).

Block-sized entries in the harvested set are walked at page
granularity, splitting the block and CoWing each page beneath it.
Note that prefetching marks pages accessed, so re-harvesting from a
prefetched fork never shrinks the set; harvest once from a clean
warmup fork.

New unit test runs the full warmup -> harvest -> prefetch -> run
pipeline and asserts zero page faults on both the fresh-fork and
reset_to paths.
The shared ELF loader, setup_linux and syscall table were already
arch-clean on ARM64; the missing piece was usermode. Guest RAM had no
EL0 access bits, so a loaded program could not execute at all -- and it
must run at EL0, not EL1, because an EL1 guest could rewrite its own
stage-1 tables and strip the read-only bits that protect the master
VM's memory under copy-on-write.

- Map guest RAM user-RWX (AP_USER, UXN clear, explicit PXN). A new L3
  table covers the first 2 MB: the vectors page is user-read-only so it
  stays EL1-executable (user-writable would force PXN), the page tables
  and vCPU table are EL1-only, and pages below the vectors are left
  unmapped to catch null dereferences. The MMIO trap block is
  user-accessible so EL0 guests reach the stop/syscall MMIO directly.
- SCTLR_EL1 gains SPAN (keep PAN off in the vectors, which store to the
  now user-accessible trap pages), DZE/UCT/UCI (EL0 dc zva, CTR_EL0 and
  cache maintenance for glibc string routines) and nTWI/nTWE.
- Default pstate is now EL0t (0x3c0) in setup_registers, setup_call and
  setup_clone; raw-guest demo/bench/tests move from EL1h to EL0.
- setup_linux masks HWCAP_SVE, HWCAP_CPUID and HWCAP2_SME: the vCPU is
  created without those features, and HWCAP_CPUID invites EL0
  ID-register reads (kernel-emulated on real Linux) that our vectors
  treat as fatal. This is what glibc ifunc resolvers key on.
- Fix handle_exception reading ELR_EL1 as a sysreg; KVM exposes it as a
  core register, so the diagnostic path itself threw ENOENT and masked
  the real guest fault.
- paging_default_usermode_flags no longer requires UXN as a verify
  flag; user pages are executable now.

New arm64_elf unit tests run real statically linked glibc binaries end
to end: argv/env passing, write() to the printer, heap allocation and
string routines, vmcall into an ELF function, and forked vmcalls
isolated by CoW across reset_to. Bench numbers are unchanged.
A minimal end-to-end smoke test for the ARM64 backend: loads a tiny
hand-assembled guest, runs it at EL0, and verifies the result via the
STOP MMIO address. Built only when TINYKVM_ARCH is ARM64.
Quantifies per-fork costs in the warm-fork-from-master model and
empirically settles whether read-access tracking could help fork
prefetch. Measures fork/reset, write-set prefetch, and page-fault
costs across read-only, write-only, mixed, and prefetched configs;
confirms read faults are structurally zero under CoW.
Load ld-linux-aarch64.so.1 as the machine binary with the real program
as argv, the same scheme the amd64 ELF tests use. Three fixes:

- Skip pre-relocation of ET_DYN binaries on ARM64. A glibc ET_DYN
  entered at its own entry point (ld.so, static-PIE) self-relocates,
  and modern aarch64 ld.so carries DT_RELR. RELR entries are
  "*addr += base" -- not idempotent -- so pre-applying them here
  double-relocated ld.so's init_array/cpu_list pointers and crashed
  the guest. RELATIVE rela entries are absolute writes the guest
  redoes anyway. amd64 behavior is unchanged.

- Fix the signal-table off-by-one in the arm64 Signals::get stub
  (at(sig) -> at(sig-1), matching the amd64 implementation), and make
  rt_sigaction return -EINVAL for signals above 64 instead of letting
  the array bounds check throw out of the host: CPython sweeps
  signals 1..65 (glibc _NSIG) at startup.

New tests in arm64_elf.cpp: a PIE dynamic guest, a non-PIE dynamic
guest (this gcc's default; needs heap_address_hint above the fixed
0x400000 link address so ld.so's MAP_FIXED doesn't collide with the
mmap arena), and a real python3 -c guest run end-to-end. The test
codebuilder gains a dynamic (non -static) build mode.
mmap_backed_files failed on ARM64 for two physical-address reasons:

- MMAP_PHYS_BASE (256GB) exceeds the 36-bit stage-2 IPA that
  Apple-Silicon KVM hosts support, so KVM_SET_USER_MEMORY_REGION
  rejected the slot outright. The base is now 32GB on ARM64, mirroring
  how the bank arena was already moved down for the same reason.

- TCR_EL1.IPS was left at its reset value (32-bit / 4GB physical), so
  even an installable slot would have faulted every stage-1 walk that
  produced an output above 4GB. It worked until now only because all
  guest-physical memory (RAM, banks at 2GB) sat below 4GB. IPS is now
  programmed from KVM_CAP_ARM_VM_IPA_SIZE, captured at Machine::init.

The Python guest test now runs with mmap_backed_files enabled and
asserts that libpython (5.9MB, above the 4MB threshold) was served by
a file-backed memory region rather than the preadv fallback.
The arm64 tgkill handler forwarded every nonzero signal to
Signals::enter, which is a not-implemented stub, so a guest calling
abort() (raise -> tgkill) threw "Guest signals are not implemented on
ARM64" out of machine.run() -- an infrastructure failure mode for an
ordinary guest crash.

Match the kernel's default dispositions instead: signal 0 is an
existence probe, default-ignored signals (SIGCHLD, SIGCONT, SIGURG,
SIGWINCH) are dropped, and everything else stops the VM with the
conventional 128+sig exit status, readable via return_value().
Handler entry remains unimplemented (and is now unreachable from
tgkill); guests with registered handlers also terminate, which is the
intended sandbox policy until a real workload demands delivery.
Adding real pthread test coverage on ARM64 (tests/unit/arm64_threads.cpp:
create/join + shared memory, and a 4-thread mutex-contended counter)
surfaced three bugs the threading engine had been hiding:

- vcpu.cpp: get/set_arm64_regs read SP_EL1 (vector scratch) instead of
  SP_EL0 when parked at EL1h, so cloned threads ran on the parent's
  stack. Read ELR_EL1 for the user PC and SP_EL0 for the user SP at EL1h.
- system_calls.cpp: prlimit64 had new/old limit pointers swapped, leaving
  the caller's RLIMIT_STACK buffer uninitialised; glibc then tried to mmap
  a multi-GB thread stack.
- machine_utils.cpp: Machine::memzero hardcoded amd64's dirty bit (bit 6),
  which is AP[1] on arm64, so a large PROT_NONE mmap reservation walked
  off the end of guest RAM. Added paging_dirty_bit() per arch.

create/join and mutex-contended counters now pass end to end. A condvar
producer/consumer test is present but [!shouldfail] (futex wake is not
address-aware yet). TODO.md updated with port state and the remaining gaps.
The cooperative thread scheduler resumed m_suspended.front() on every
FUTEX_WAKE regardless of which address a thread waited on, so a condvar
signal could land on the wrong waiter and both threads would park
(consumer in cond_wait, main in join). The condvar producer/consumer
test deadlocked.

Make wakeups address-aware: each Thread tracks the futex address it
blocked on (futex_addr; 0 = runnable). FUTEX_WAIT blocks on its address
and the scheduler only resumes runnable threads; FUTEX_WAKE marks up to
`val` address-matching waiters runnable and hands control to the first
of them via the new MultiThreading::switch_to(). Handing off matters
because the scheduler never preempts: a producer that did not yield
would run to completion before a signalled consumer ran once. Thread
exit now also wakes the clear_tid/join futex so pthread_join resumes.

next_runnable() is a peek and switch_to() suspends before removing the
target, so the scheduler stays consistent if push_back throws.
set_to_and_suspend_others() resets futex_addr so a snapshot-restored
thread starts runnable and re-checks its predicate on resume.

Mirrored in the shared amd64 scheduler (linux/threads.cpp). The condvar
test is no longer tagged [!shouldfail]; arm64_minimal (52), arm64_elf
(17) and arm64_threads (3) all pass.
@perbu perbu marked this pull request as ready for review June 13, 2026 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant