-
Notifications
You must be signed in to change notification settings - Fork 74
libmicrokit: add API for resuming x86 VCPU #431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -649,6 +649,43 @@ To find the full list of possible faults that could occur and details regarding | |||||
| kind of fault, please see the 'Faults' section of the | ||||||
| [seL4 reference manual](https://sel4.systems/Info/Docs/seL4-manual-latest.pdf). | ||||||
|
|
||||||
| ### x86 VCPU fault | ||||||
| Please see the 'VMX BASIC EXIT REASONS' section of the | ||||||
| [Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D, and 4] | ||||||
| (https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html) for a list of possible VM Exit reasons. | ||||||
|
|
||||||
| These message registers contain data relating to the VM Exit: | ||||||
| - `SEL4_VMENTER_CALL_EIP_MR`: Instruction Pointer, | ||||||
| - `SEL4_VMENTER_CALL_CONTROL_PPC_MR`: Primary Processor Based VM Execution Controls, | ||||||
| - `SEL4_VMENTER_CALL_INTERRUPT_INFO_MR`: VM Entry Interruption-Information`, | ||||||
| - `SEL4_VMENTER_FAULT_REASON_MR`: VM Exit reason, | ||||||
| - `SEL4_VMENTER_FAULT_QUALIFICATION_MR`: VM Exit qualification, | ||||||
| - `SEL4_VMENTER_FAULT_INSTRUCTION_LEN_MR`: Length of instruction that caused the VM Exit, | ||||||
| - `SEL4_VMENTER_FAULT_GUEST_PHYSICAL_MR`: Guest Physical Address of the VM Exit, | ||||||
| - `SEL4_VMENTER_FAULT_RFLAGS_MR`: Guest FLAGS register, | ||||||
| - `SEL4_VMENTER_FAULT_GUEST_INT_MR`: Guest interruptability, | ||||||
| - `SEL4_VMENTER_FAULT_CR3_MR`: Guest CR3. | ||||||
|
|
||||||
| Some of these message registers may not contain valid data depending on the VM Exit reason, | ||||||
| please consult the Intel SDM for more details. | ||||||
|
|
||||||
| These message registers contain the guest general purpose registers at the time of VM Exit: | ||||||
| - `SEL4_VMENTER_FAULT_EAX` | ||||||
| - `SEL4_VMENTER_FAULT_EBX` | ||||||
| - `SEL4_VMENTER_FAULT_ECX` | ||||||
| - `SEL4_VMENTER_FAULT_EDX` | ||||||
| - `SEL4_VMENTER_FAULT_ESI` | ||||||
| - `SEL4_VMENTER_FAULT_EDI` | ||||||
| - `SEL4_VMENTER_FAULT_EBP` | ||||||
| - `SEL4_VMENTER_FAULT_R8` | ||||||
| - `SEL4_VMENTER_FAULT_R9` | ||||||
| - `SEL4_VMENTER_FAULT_R10` | ||||||
| - `SEL4_VMENTER_FAULT_R11` | ||||||
| - `SEL4_VMENTER_FAULT_R12` | ||||||
| - `SEL4_VMENTER_FAULT_R13` | ||||||
| - `SEL4_VMENTER_FAULT_R14` | ||||||
| - `SEL4_VMENTER_FAULT_R15` | ||||||
|
|
||||||
| ## `microkit_msginfo microkit_ppcall(microkit_channel ch, microkit_msginfo msginfo)` | ||||||
|
|
||||||
| Performs a call to a protected procedure in a different PD. | ||||||
|
|
@@ -826,6 +863,20 @@ virtual CPU with ID `vcpu`. | |||||
| Write the registers of a given virtual CPU with ID `vcpu`. The `regs` argument is the pointer to | ||||||
| the struct of registers `seL4_VCPUContext` that are written from. | ||||||
|
|
||||||
| ## `void microkit_vcpu_x86_on(void)` | ||||||
|
|
||||||
| Allow the PD to switch to guest execution mode with the bound vCPU every time `init()`, `notified()` or `fault()` return. | ||||||
|
|
||||||
| The caller is responsible for initialising the VCPU's instruction pointer, Primary Processor-Based VM-Execution Controls, | ||||||
| VM-Entry Interruption-Information Field and other architectural states to the corresponding VMCS fields via | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
This sentence doesn't really make sense. "responsible for initialising the other architectural states to the corresponding VMCS fields". Why not just say "The caller should initialise the VMCS fields per (the SDM/seL4))". Do they all initialise to zero by default? If so, should mention that. |
||||||
| `microkit_vcpu_x86_write_vmcs()`. For more details, please see the 'Virtualisation' section of the | ||||||
| [seL4 reference manual](https://sel4.systems/Info/Docs/seL4-manual-latest.pdf). For architectural details, | ||||||
| please consult the [Intel SDM](https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html). | ||||||
|
|
||||||
| ## `void microkit_vcpu_x86_off(void)` | ||||||
|
|
||||||
| Stop the PD from switching to guest execution mode when `init()`, `notified()` or `fault()` return. | ||||||
|
|
||||||
| ## `seL4_CPtr microkit_cspace_root_slot_to_cptr(seL4_Word slot)` {#libmicrokit_cspace_root_slot_to_cptr} | ||||||
|
|
||||||
| Converts the slot identifier of the `<cspace>`'s capability element into an | ||||||
|
|
@@ -948,6 +999,7 @@ The `virtual_machine` element has the following attributes: | |||||
| Additionally, it supports the following child elements: | ||||||
|
|
||||||
| * `vcpu`: (one or more) Describes the virtual CPU that will be tied to the virtual machine. | ||||||
| * On x86-64, there is a limit of one VCPU per PD. | ||||||
| * `map`: (zero or more) Describes mapping of memory regions into the virtual machine. | ||||||
|
|
||||||
| The `vcpu` element has the following attributes: | ||||||
|
|
@@ -1027,6 +1079,7 @@ The `end` element has the following attributes: | |||||
| * `id`: Channel identifier in the context of the named protection domain. Must be at least 0 and less than 63. | ||||||
| * `pp`: (optional) Indicates that the protection domain for this end can perform a protected procedure call to the other end; defaults to false. | ||||||
| Protected procedure calls can only be to PDs of strictly higher priority. | ||||||
| On x86-64, PDs with virtual machines cannot receive protected procedure calls. | ||||||
|
dreamliner787-9 marked this conversation as resolved.
|
||||||
| * `notify`: (optional) Indicates that the protection domain for this end can send a notification to the other end; defaults to true. | ||||||
| * `setvar_id`: (optional) Specifies a symbol in the program image. This symbol will be rewritten with the channel identifier. | ||||||
|
|
||||||
|
|
||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -10,6 +10,9 @@ | |
| #pragma once | ||
|
|
||
| #include <sel4/sel4.h> | ||
| #ifdef CONFIG_VTX | ||
| #include <sel4/arch/vmenter.h> | ||
| #endif /* CONFIG_VTX */ | ||
|
|
||
| typedef unsigned int microkit_channel; | ||
| typedef unsigned int microkit_child; | ||
|
|
@@ -47,6 +50,15 @@ extern char microkit_name[MICROKIT_PD_NAME_LENGTH]; | |
| extern seL4_Bool microkit_have_signal; | ||
| extern seL4_CPtr microkit_signal_cap; | ||
| extern seL4_MessageInfo_t microkit_signal_msg; | ||
| #if defined(CONFIG_VTX) | ||
| struct microkit_x86_vcpu_state { | ||
| seL4_Bool do_resume; | ||
| seL4_Word rip; | ||
| seL4_Word prim_proc_ctl; | ||
| seL4_Word irq_info; | ||
| }; | ||
| extern struct microkit_x86_vcpu_state microkit_x86_vcpu_state; | ||
| #endif | ||
|
|
||
| /* Symbols for error checking libmicrokit API calls. Patched by the Microkit tool | ||
| * to set bits corresponding to valid channels for this PD. */ | ||
|
|
@@ -190,13 +202,8 @@ static inline void microkit_vcpu_restart(microkit_child vcpu, seL4_Word entry_po | |
| { | ||
| seL4_Error err; | ||
| seL4_UserContext ctxt = {0}; | ||
| #if defined(CONFIG_ARCH_AARCH64) | ||
| ctxt.pc = entry_point; | ||
| #elif defined(CONFIG_ARCH_X86_64) | ||
| ctxt.rip = entry_point; | ||
| #else | ||
| #error "unknown architecture for 'microkit_vcpu_restart'" | ||
| #endif | ||
|
|
||
|
Indanz marked this conversation as resolved.
|
||
| err = seL4_TCB_WriteRegisters( | ||
| BASE_VM_TCB_CAP + vcpu, | ||
| seL4_True, | ||
|
|
@@ -220,9 +227,7 @@ static inline void microkit_vcpu_stop(microkit_child vcpu) | |
| microkit_internal_crash(err); | ||
| } | ||
| } | ||
| #endif | ||
|
|
||
| #if defined(CONFIG_ARM_HYPERVISOR_SUPPORT) | ||
| static inline void microkit_vcpu_arm_inject_irq(microkit_child vcpu, seL4_Uint16 irq, seL4_Uint8 priority, | ||
| seL4_Uint8 group, seL4_Uint8 index) | ||
| { | ||
|
|
@@ -265,7 +270,7 @@ static inline void microkit_vcpu_arm_write_reg(microkit_child vcpu, seL4_Word re | |
| microkit_internal_crash(err); | ||
| } | ||
| } | ||
| #endif | ||
| #endif /* CONFIG_ARM_HYPERVISOR_SUPPORT */ | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These changes to the arm stuff and the ifdefs; these should be part of a separate commit. Feel free to merge the manual changes and API changes and rust changes into one commit (they are related, and standing alone doesn't really make snse), but these can be standalone. |
||
|
|
||
| #if defined(CONFIG_ALLOW_SMC_CALLS) | ||
| static inline void microkit_arm_smc_call(seL4_ARM_SMCContext *args, seL4_ARM_SMCContext *response) | ||
|
|
@@ -277,7 +282,7 @@ static inline void microkit_arm_smc_call(seL4_ARM_SMCContext *args, seL4_ARM_SMC | |
| microkit_internal_crash(err); | ||
| } | ||
| } | ||
| #endif | ||
| #endif /* CONFIG_ALLOW_SMC_CALLS */ | ||
|
|
||
| #if defined(CONFIG_ARCH_X86_64) | ||
| static inline void microkit_x86_ioport_write_8(microkit_ioport ioport_id, seL4_Word port_addr, seL4_Word data) | ||
|
|
@@ -393,28 +398,85 @@ static inline seL4_Uint32 microkit_x86_ioport_read_32(microkit_ioport ioport_id, | |
|
|
||
| return ret.result; | ||
| } | ||
| #endif | ||
|
|
||
| #if defined(CONFIG_ARCH_X86_64) && defined(CONFIG_VTX) | ||
| #if defined(CONFIG_VTX) | ||
| /* Architecturally defined identifiers for a x86 VCPU's VMCS fields, | ||
| * see seL4 source: `include/arch/x86/arch/object/vcpu.h` | ||
| * or Intel® 64 and IA-32 Architectures Software Developer’s Manual | ||
| * Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D, and 4 | ||
| * Order Number: 325462-084US June 2024 | ||
| * "Table B-8. Encodings for 32-Bit Control Fields (0100_00xx_xxxx_xxx0B)" and | ||
| * "Table B-14. Encodings for Natural-Width Guest-State Fields (0110_10xx_xxxx_xxx0B) (Contd.)". | ||
| * */ | ||
| #define VMX_GUEST_RIP 0x0000681E | ||
| #define VMX_CONTROL_PRIMARY_PROCESSOR_CONTROLS 0x00004002 | ||
| #define VMX_CONTROL_ENTRY_INTERRUPTION_INFO 0x00004016 | ||
|
|
||
| /* Serve the guest's: | ||
| * - RIP, | ||
| * - Primary Processor Based VM Execution Controls, and | ||
| * - VM Entry Interruption-Information | ||
| * from local variables that were written to by `microkit_vcpu_x86_write_vmcs()`. | ||
| * The kernel will service reading from other fields. | ||
|
Comment on lines
+415
to
+420
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "Serve" is a bit of a weird verb here. I'd say "Read". Also, I'd start with "Read a guest VMCS field" as the primary thing, then mention just a comment that certain values are served from local variables? (That is after all, an implementation detail.) |
||
| */ | ||
| static inline seL4_Word microkit_vcpu_x86_read_vmcs(microkit_child vcpu, seL4_Word field) | ||
| { | ||
| seL4_X86_VCPU_ReadVMCS_t ret; | ||
| ret = seL4_X86_VCPU_ReadVMCS(BASE_VCPU_CAP + vcpu, field); | ||
| if (ret.error != seL4_NoError) { | ||
| microkit_dbg_puts("microkit_x86_read_vmcs: error reading data\n"); | ||
| microkit_internal_crash(ret.error); | ||
| } | ||
|
|
||
| return ret.value; | ||
| } | ||
|
|
||
| seL4_Word value; | ||
|
|
||
| /* Assumes that a PD would only have access to 1 VCPU object. */ | ||
| switch (field) { | ||
| case VMX_GUEST_RIP: | ||
| value = microkit_x86_vcpu_state.rip; | ||
| break; | ||
| case VMX_CONTROL_PRIMARY_PROCESSOR_CONTROLS: | ||
| value = microkit_x86_vcpu_state.prim_proc_ctl; | ||
| break; | ||
| case VMX_CONTROL_ENTRY_INTERRUPTION_INFO: | ||
| value = microkit_x86_vcpu_state.irq_info; | ||
| break; | ||
| default: { | ||
| seL4_X86_VCPU_ReadVMCS_t ret = seL4_X86_VCPU_ReadVMCS(BASE_VCPU_CAP + vcpu, field); | ||
| if (ret.error != seL4_NoError) { | ||
| microkit_dbg_puts("microkit_x86_read_vmcs: error reading data\n"); | ||
| microkit_internal_crash(ret.error); | ||
|
Indanz marked this conversation as resolved.
|
||
| } | ||
| value = ret.value; | ||
| break; | ||
| } | ||
| } | ||
|
|
||
| return value; | ||
| } | ||
|
|
||
| /* These fields will be written to local variables rather than the actual VMCS: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ditto here. This is mostly talking about implementation details, which should belong inside the function. Describe that it writes a VMCS field. Reading "microkit.h" it's somewhat obvious that the comment immediately above about the reference in seL4/the intel SDM, but if you use an IDE or we ever added doxygen compiling them the comments about what is "field" would be lost, so I'd recommend adding it to these two function's docs. |
||
| * - RIP, | ||
| * - Primary Processor Based VM Execution Controls, and | ||
| * - VM Entry Interruption-Information. | ||
| * Because they will be passed to the kernel on `seL4_VMEnter()`, | ||
| * then the kernel will write them to the VMCS for us. | ||
| * Writing other VMCS fields will go to the real VMCS immeidately via a syscall. | ||
| */ | ||
| static inline void microkit_vcpu_x86_write_vmcs(microkit_child vcpu, seL4_Word field, seL4_Word value) | ||
| { | ||
| seL4_X86_VCPU_WriteVMCS_t ret; | ||
| ret = seL4_X86_VCPU_WriteVMCS(BASE_VCPU_CAP + vcpu, field, value); | ||
| if (ret.error != seL4_NoError) { | ||
| microkit_dbg_puts("microkit_x86_write_vmcs: error writing data\n"); | ||
| microkit_internal_crash(ret.error); | ||
| /* Assumes that a PD would only have access to 1 VCPU object. */ | ||
| switch (field) { | ||
| case VMX_GUEST_RIP: | ||
| microkit_x86_vcpu_state.rip = value; | ||
| break; | ||
| case VMX_CONTROL_PRIMARY_PROCESSOR_CONTROLS: | ||
| microkit_x86_vcpu_state.prim_proc_ctl = value; | ||
| break; | ||
| case VMX_CONTROL_ENTRY_INTERRUPTION_INFO: | ||
| microkit_x86_vcpu_state.irq_info = value; | ||
| break; | ||
| default: { | ||
| seL4_X86_VCPU_WriteVMCS_t ret = seL4_X86_VCPU_WriteVMCS(BASE_VCPU_CAP + vcpu, field, value); | ||
| if (ret.error != seL4_NoError) { | ||
| microkit_dbg_puts("microkit_x86_write_vmcs: error writing data\n"); | ||
| microkit_internal_crash(ret.error); | ||
| } | ||
|
Comment on lines
+474
to
+477
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No. This clearly can fail if the user makes a mistake (such as an invalid field, or if seL4 doesn't allow writing to that field), so it should actually return seL4_Error, not void. |
||
| break; | ||
| } | ||
| } | ||
| } | ||
|
|
||
|
|
@@ -479,7 +541,20 @@ static inline void microkit_vcpu_x86_write_regs(microkit_child vcpu, seL4_VCPUCo | |
| } | ||
| } | ||
|
|
||
| #endif | ||
| static inline void microkit_vcpu_x86_on(void) | ||
| { | ||
| /* On x86, a TCB can only have one bound VCPU at any given time. | ||
| * So we don't take a `microkit_child vcpu` here. */ | ||
|
Comment on lines
+546
to
+547
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Doesn't that mean that you want to take that argument away from
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's a good point, the reason why I haven't done that is because I don't know how VM with multiple VCPUs will look like yet, because there are multiple approaches you can take. You could allow one VMM PD access to multiple VCPU caps to configure them and/or handle faults, i.e. you will have an implicit "big VMM lock". This is currently how ARM VM are modelled in Microkit. For x86, we will need additional trampoline threads that resume the VCPUs. The downside is that you have more context switching, but you would be able to replicate the ARM model. In this case we would need to pass Or, you could adopt a "multi VMM" approach where each PD only have access 1 VCPU. Then the user is responsible for creating 1 PD per VCPU, handle IPIs with explicit channels between the PDs and synchronise access to any global VMM state themselves. This model would incur less context switching and there isn't a "big lock" like the previous one. But there are more complications: you would need to ensure that all the VMM PDs have access to the same resources (memory regions, interrupts, etc) for fault handling, and channels between the VMM PDs for every possible combinations of IPIs ( Arguably the "multi VMM" approach is already possible on both ARM and x86 in Microkit. But at this point I'm not sure whether supporting the "big VMM lock" model on x86 is appropriate so I'm keeping the option open. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What you say argues against dropping the parameter for microkit_vcpu_x86_on/off too though. I strongly advise against some trampoline construction that shuffles all VCPU handling to one VMM task: The reason to have multiple VCPUs is to use multiple cores, so they all will be on different cores. If you handle all VCPUs with one VMM task, all operations become cross-core with huge overhead: First the fault/trap needs to be propagated to the core the VMM is on, and then any reply needs to be funnelled back. The added value of handling multiple, independent virtual machines with one VMM is approximately zero. This means that also on Arm, you do not want to use the model with one central VMM handling multiple VCPUs, even though it's simpler.
Why? A VMM doesn't need access to the virtual machine's memory, interrupts etc.
Why do you need IPIs? Why would one VMM care about what others do? If you need some kind of central control, then have a global VMM manager which communicates with all the VMMs handling VCPUs. Then you need only n channels and have only one decision point.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I agree on the cross-core overhead. I wouldn't recommend it for anything production grade. The reason I brought it up is outlined below.
If you mean handling multiple independent VMs, I agree. However, if you mean handling multiple VCPUs of a single VM with one VMM, I do see value in it from a prototyping perspective. We currently have the single-VMM + multi-VCPU model working on our ARM VMM, and it was a non-invasive evolution from the single-VCPU code. While x86 would require a trampoline approach, some Microkit changes for multiple VCPUs in one PD, and incur a performance hit. It offers a much simpler development path to a working prototype. It allows the user to get a basic multi-core VM booting first, and then mature their implementation into a performant multi-VMM design later. That said, I see the argument for keeping the x86 VCPU API aligned with the current kernel implementation. If you feel the prototyping benefit isn't worth accommodating, I'm happy to remove the
On x86, the VMM needs access to the VM's memory primarily for fault handling. For example, consider an EPT VM Exit when the guest accesses an emulated I/O APIC. The architecture tells you the faulting guest physical address and whether it was a read or a write, but unlike ARM, it does not provide the read/write width or the target CPU registers. To emulate the operation, the VMM must take the instruction pointer, walk the guest page tables, fetch the faulting instruction from guest RAM, and decode it. If multiple VMMs are managing different VCPUs for a single VM, all of them need access to guest RAM to do this. Additionally, if the VMM is emulating a storage or network device (e.g., VirtIO), it naturally needs access to guest RAM to process DMA requests. Regarding interrupts, this is necessary for pass-through devices. Since mainline seL4 does not yet support posted interrupts, the physical interrupt must be routed to the VMM so the injection process can be emulated in software.
The guest OS will send IPIs between any pair of VCPUs. If a single VM is split across multiple VMMs, they must coordinate to inject these IPIs into the target VCPUs. Connecting the VMMs with an There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is Microkit production grade or a prototype toolkit? That's the choice that you need to make, and be very clear about what your design goals are.
Wonderful, that explains where virtio came from.
That's why I hate virtio, because it forces write access.
The seL4 manual doesn't explain how IRQ injection works on x86, but I assume it's by manipulating VMCS and the return message for
Fair enough, that's a good reason.
You don't need full-blown channels for just IPIs though, you could limit it to one shared memory page and But anyway, IPIs are just passing info through, it's pretty stateless. I was talking more about VMMs needing to communicate between them for internal reasons. |
||
| microkit_x86_vcpu_state.do_resume = seL4_True; | ||
| } | ||
|
|
||
| static inline void microkit_vcpu_x86_off(void) | ||
| { | ||
| microkit_x86_vcpu_state.do_resume = seL4_False; | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Might as well call this "is_on" given that's what the API is called. Also, we can use |
||
| } | ||
|
|
||
| #endif /* CONFIG_VTX */ | ||
| #endif /* CONFIG_ARCH_X86_64 */ | ||
|
|
||
| static inline void microkit_deferred_notify(microkit_channel ch) | ||
| { | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts?
(and similar for vcpu off).
I don't think we need to discuss implementation details in the documentation, only what it means.