-
Notifications
You must be signed in to change notification settings - Fork 2k
[SH] add userfault support #5261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feature/secret-hiding
Are you sure you want to change the base?
[SH] add userfault support #5261
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## feature/secret-hiding #5261 +/- ##
=========================================================
- Coverage 82.52% 81.67% -0.85%
=========================================================
Files 250 250
Lines 27386 27792 +406
=========================================================
+ Hits 22599 22698 +99
- Misses 4787 5094 +307
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
286efbe
to
4e10e54
Compare
b6185cb
to
60abeb9
Compare
d5e7aa8
to
40101cd
Compare
This is needed because if guest_memfd is used to back guest memory, vCPU fault notifications are delivered via the UFFD UDS socket. Signed-off-by: Nikita Kalyazin <[email protected]>
82f3312
to
a242c6f
Compare
Ok((0, 3)) => { | ||
self.secret_free = true; | ||
} | ||
Ok((n, _)) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n
is the number of bytes read, so this should be (_, n)
instead
Err(e) => { | ||
if e.errno() == libc::EAGAIN { | ||
return None; | ||
} | ||
panic!("Read error: {}", e); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
Err(e) => { | |
if e.errno() == libc::EAGAIN { | |
return None; | |
} | |
panic!("Read error: {}", e); | |
} | |
Err(e) if e.errno() == libc::EAGAIN => return None, | |
Err(e) => panic!() |
} | ||
} | ||
|
||
fn recv_json<T: serde::de::DeserializeOwned + serde::Serialize>(&mut self) -> Option<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think you can drop +serde::Serialize
here? compiled without it for me at least
match self.recv_mappings() { | ||
Some(mappings) => Some(UffdMsgFromFirecracker::Mappings(mappings)), | ||
None => None, // EOF or error | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: .map()
here and below
if bytes_read > 0 { | ||
self.current_pos += bytes_read; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for if
match stream.next() { | ||
Some(Ok(value)) => { | ||
let consumed = stream.byte_offset(); | ||
self.buffer.copy_within(consumed..self.current_pos, 0); | ||
self.current_pos -= consumed; | ||
Some(value) | ||
} | ||
Some(Err(e)) => panic!( | ||
"Failed to deserialize JSON message: {}. Error: {}", | ||
String::from_utf8_lossy(&self.buffer[..self.current_pos]), | ||
e | ||
), | ||
None => None, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
match stream.next() { | |
Some(Ok(value)) => { | |
let consumed = stream.byte_offset(); | |
self.buffer.copy_within(consumed..self.current_pos, 0); | |
self.current_pos -= consumed; | |
Some(value) | |
} | |
Some(Err(e)) => panic!( | |
"Failed to deserialize JSON message: {}. Error: {}", | |
String::from_utf8_lossy(&self.buffer[..self.current_pos]), | |
e | |
), | |
None => None, | |
} | |
match stream.next()? { | |
Ok(value) => { | |
let consumed = stream.byte_offset(); | |
self.buffer.copy_within(consumed..self.current_pos, 0); | |
self.current_pos -= consumed; | |
Some(value) | |
} | |
Err(e) => panic!( | |
"Failed to deserialize JSON message: {}. Error: {}", | |
String::from_utf8_lossy(&self.buffer[..self.current_pos]), | |
e | |
) | |
} |
UffdMsgFromFirecracker::Mappings(mappings) => { | ||
let (guest_memfd, userfault_bitmap_memfd) = | ||
if uffd_msg_iter.secret_free { | ||
( | ||
Some(unsafe { | ||
File::from_raw_fd(uffd_msg_iter.fds[1]) | ||
}), | ||
Some(unsafe { | ||
File::from_raw_fd(uffd_msg_iter.fds[2]) | ||
}), | ||
) | ||
} else { | ||
(None, None) | ||
}; | ||
|
||
let uffd_handler = UffdHandler::from_mappings( | ||
mappings, | ||
unsafe { File::from_raw_fd(uffd_msg_iter.fds[0]) }, | ||
guest_memfd, | ||
userfault_bitmap_memfd, | ||
self.backing_memory, | ||
self.backing_memory_size, | ||
); | ||
|
||
let fd = uffd_handler.uffd.as_raw_fd(); | ||
if uffd_handler.guest_memfd.is_some() { | ||
self.main_handler_fd = Some(fd); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this entire mappings retrival logic out of the event loop and back into the uffd handler construction path. Since we now read the FDs from the socket without reading any json, the iterator doesn't really need to know about fds at all. They can just be read before we even construct the iterator. As for the mappings, we can set <UffdMsgIterator as Iterator>::Item
to be FaultRequest
, and simply directly call recv_mappings()
from the uffd handler construction code
It is used by Secret-Free-enabled UFFD handlers to disable vCPU fault notifications from the kernel. Signed-off-by: Nikita Kalyazin <[email protected]>
Accept receiving 3 fds instead of 1, where fds[1] is guest_memfd and fds[2] is userfault bitmap memfd. Also handle the FaultRequest message over the UDS socket by calling a new callback in the Runtime and sending a FaultReply. TODO: add cab/sob from Patrick Signed-off-by: Nikita Kalyazin <[email protected]>
There are two ways a UFFD handler receives a fault notification if Secret Fredom is enabled (which is inferred from 3 fds sent by Firecracker instead of 1): - a VMM- or KVM-triggered fault is delivered via a minor UFFD fault event. The handler is supposed to respond to it via memcpying the content of the page (if the page hasn't already been populated) followed by a UFFDIO_CONTINUE call. - a vCPU-triggered fault is delievered via a FaultRequest message on the UDS socket. The handler is supposed to reply with a pwrite64 call on the guest_memfd to populate the page followed by a FaultReply message on the UDS socket. In both cases, the handler also needs to clear the bit in the userfault bitmap at the corresponding offset in order to stop further fault notifications for the same page. UFFD handlers use the userfault bitmap for two purposes: - communicate to the kernel whether a fault at the corresponding guest_memfd offset will cause a VM exit - keep track of pages that have already been populated in order to avoid overwriting the content of the page that is already initialised. Signed-off-by: Nikita Kalyazin <[email protected]>
These are used for communication of page faults between Firecracker and a UFFD handler. Signed-off-by: Nikita Kalyazin <[email protected]>
If configured, userfault bitmap is registered with KVM and controls whether KVM will exit to userspace on a fault of the corresponding page. We are going to allocate the bitmap in a memfd in Firecracker, set bits for all pages to request notifications for vCPU faults and send it to the UFFD handler to delegate clearing the bits as pages get populated. Since the KVM userfault patches are still in review, set_user_memory_region2 is not aware of the userfault flag and the userfault bitmap address in its input structure. Define it in Firecracker code temporarily. Signed-off-by: Nikita Kalyazin <[email protected]>
This is needed to instruct the kernel to exit to userspace when a vCPU fault occurs and the corresponding bit in the userfault bitmap is set. The userfault bitmap is allocated in a memfd by Firecracker and sent to the UFFD handler. This also sends 3 fds to the UFFD handler in the handshake: - UFFD (original) - guest_memfd: for the handler to be able to populate guest memory - userfault bitmap memfd: for the handler to be able to disable exits to userspace for the pages that have already been populated Signed-off-by: Nikita Kalyazin <[email protected]>
These will be used to communicate vCPU faults between vCPUs and the VM if secret freedom is enabled. Signed-off-by: Nikita Kalyazin <[email protected]>
This is because vCPUs reason in GPAs while the secret-free UFFD protocol is guest_memfd-offset-based. TODO: add cab/sob from Patrick Signed-off-by: Nikita Kalyazin <[email protected]>
It contains two parts: - external: between the VMM thread and the UFFD handler - internal: between vCPUs and the VMM thread An outline of the workflow: - When a vCPU fault occurs, vCPU exits to userspace - The vCPU thread sends a message to the VMM thread via the userfault channel - The VMM thread forwards the message to the UFFD handler via the UDS socket - The UFFD hnadler populates the page, clears the corresponding bit in the userfault bitmap and sends a reply to Firecracker - The VMM thread receives the reply and forwards it to the vCPU via the userfault channel - The vCPU resumes execution Signed-off-by: Nikita Kalyazin <[email protected]>
This is required by Secret Freedom to implement the userfault protocol: vCPUs read notification of fault handling completions from the userfault channel. Signed-off-by: Nikita Kalyazin <[email protected]>
kvmclock is currently not supported by Secret Freedom and calling kvmclock_ctrl will always fail. Signed-off-by: Nikita Kalyazin <[email protected]>
In a regular VM, we mmap the memory snapshot file and supply the address in the KVM memory slot. In Secret Free VMs, we provide guest_memfd in the memory slot instead. There is no way we can restore a Secret Free VM from a file, unless we prepopulate the guest_memfd with the file content, which is inefficient and is not practically useful. Signed-off-by: Nikita Kalyazin <[email protected]>
It is not supported by Secret Freedom. Signed-off-by: Nikita Kalyazin <[email protected]>
This includes both functional and performance tests. Signed-off-by: Nikita Kalyazin <[email protected]>
Do not add a balloon device to a Secret Free VM as it is not currently supported. Signed-off-by: Nikita Kalyazin <[email protected]>
When taking a snapshot from a Secret Free VM, we create a bounce buffer to be able to pass it to the host kernel to store in a file. Exclude it from the memory monitor calculation. Signed-off-by: Nikita Kalyazin <[email protected]>
This is because the error type has changed due the implementation of snapshot restore support for Secret Free VMs. Signed-off-by: Nikita Kalyazin <[email protected]>
ioctl_iow_nr!( | ||
KVM_SET_USER_MEMORY_REGION2, | ||
KVMIO, | ||
0x49, | ||
kvm_userspace_memory_region2 | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can reuse this constant from kvm, only the size of the struct influences the ioctl number, but that is unchanged here
let addr = match userfault_bitmap_memfd { | ||
Some(file) => { | ||
// SAFETY: the arguments to mmap cannot cause any memory unsafety in the rust sense | ||
let addr = unsafe { | ||
libc::mmap( | ||
std::ptr::null_mut(), | ||
usize::try_from(file.metadata().expect("Failed to get metadata").len()) | ||
.expect("userfault bitmap file size is too large"), | ||
libc::PROT_WRITE, | ||
libc::MAP_SHARED, | ||
file.as_raw_fd(), | ||
0, | ||
) | ||
}; | ||
|
||
if addr == libc::MAP_FAILED { | ||
panic!( | ||
"Failed to mmap userfault bitmap file: {}", | ||
std::io::Error::last_os_error() | ||
); | ||
} | ||
|
||
Some(addr as u64) | ||
} | ||
None => None, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer if this function took a pointer to the beginning of the userfault bitmap region, and then in the for loop below we do the pointer arithmetic to get the correct offset into it for the specific guest memory region.
But argh then this entire thing needs to be unsafe (well, register_memory_region
already has to be given that it takes a u64 that we're happily interpreting as a pointer and passing to the kernel for doing reads and writes on). Maybe it can take a slice? We create the memfd, and before we pass it off to KVM for doing unholy things with it, having Rust slices to it should be fine.
.set_user_memory_region2(memory_region) | ||
.map_err(VmError::SetUserMemoryRegion)?; | ||
} | ||
self.set_user_memory_region2(memory_region)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's some asserts in the fallback branch which we probably wanna update to include the userfault bitmap being zero
if vm_resources.machine_config.huge_pages.is_hugetlbfs() { | ||
return Err(BuildMicrovmFromSnapshotErrorGuestMemoryError::File( | ||
GuestMemoryFromFileError::HugetlbfsSnapshot, | ||
) | ||
.into()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably need one of these checks for secret hiding for now, until guest_memfd support hugepages? :o
#[cfg(target_arch = "x86_64")] | ||
vmm.vm.set_memory_private().map_err(VmmError::Vm)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move this to its own commit so its easier to drop it once its not needed anymore?
// TODO remove these when the UFFD crate supports minor faults for guest_memfd | ||
const UFFDIO_REGISTER_MODE_MINOR: u64 = 1 << 2; | ||
const UFFDIO_REGISTER: u64 = 0xc020aa00; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can drop it once the UFFD create supports minor mode in general, right? This is not specific to guest_memfd if I'm understanding correctly, so theoretically we could already open a PR over there to add support for this, right?
same for uffdio_continue actually
// We prevent Rust from closing the guest_memfd file descriptor | ||
// so the UFFD handler can used it to populate guest memory. | ||
forget(guest_memfd_copy); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we only send over the raw FD number, we can just hoist guest_memfd.as_raw_fd()
to the beginning of the function to store the fd number itself for sending across, before we pass ownership of the fd to create_memory
. Then there's no need for clone()/forget()
fn create_userfault_channels( | ||
&self, | ||
secret_free: bool, | ||
) -> Result<(Option<UserfaultChannel>, Option<UserfaultChannel>), std::io::Error> { | ||
if secret_free { | ||
let (receiver_vcpu_to_vm, sender_vcpu_to_vm) = pipe2(libc::O_NONBLOCK)?; | ||
let (receiver_vm_to_vcpu, sender_vm_to_vcpu) = pipe2(0)?; | ||
Ok(( | ||
Some(UserfaultChannel { | ||
sender: sender_vcpu_to_vm, | ||
receiver: receiver_vm_to_vcpu, | ||
}), | ||
Some(UserfaultChannel { | ||
sender: sender_vm_to_vcpu, | ||
receiver: receiver_vcpu_to_vm, | ||
}), | ||
)) | ||
} else { | ||
Ok((None, None)) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, so I haven't completely digested this part yet, but two questions:
- Why do we need a pipe to send stuff from the vmm thread to the vcpu thread? The vcpu is already paused here, so why can't we reuse the existing rust-channel that we use for sending the resume command?
- Why pipes in the first place? We only need a single eventfd for the vcpu to kick the vmm thread (could even be the same eventfd for all vcpus). The data can then also be sent via a rust-channel (which iirc are completely in userspace)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for co-author / SoB from me, I just did a one line bug fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't be required if we use rust channels instead of pipes :p
response = microvm.api.vm_config.get() | ||
config = response.json() | ||
|
||
return "balloon" in config and config["balloon"] is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config.get('balloon') is not None
?
Changes
Implement userfault support in Secret Freedom. The goal of this change is to be able to resume Secret-Free VMs via UFFD.
Major changes:
write
s to the guest_memfd to populate guest pages and clears bits in the userfault bitmap (memfd) to stop KVM from sending vCPU fault notificationsReason
This is needed to be able to restore snapshots where the VM was backed by guest_memfd.
License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md
.PR Checklist
tools/devtool checkstyle
to verify that the PR passes theautomated style checks.
how they are solving the problem in a clear and encompassing way.
[ ] I have updated any relevant documentation (both in code and in the docs)in the PR.
[ ] I have mentioned all user-facing changes inCHANGELOG.md
.[ ] If a specific issue led to this PR, this PR closes the issue.[ ] When making API changes, I have followed theRunbook for Firecracker API changes.
[ ] I have tested all new and changed functionalities in unit tests and/orintegration tests.
[ ] I have linked an issue to every newTODO
.rust-vmm
.