Skip to content

[SH] add userfault support #5261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: feature/secret-hiding
Choose a base branch
from

Conversation

kalyazin
Copy link
Contributor

@kalyazin kalyazin commented Jun 13, 2025

Changes

Implement userfault support in Secret Freedom. The goal of this change is to be able to resume Secret-Free VMs via UFFD.

Major changes:

  • Firecracker sends guest_memfd and memfd to the UFFD handler. UFFD handler writes to the guest_memfd to populate guest pages and clears bits in the userfault bitmap (memfd) to stop KVM from sending vCPU fault notifications
  • vCPU faults on guest_memfd cause VM exits. Once vCPU exits to userspace on a fault, it sends a fault request to the VMM thread via a pipe for the VMM thread to forward it to the UFFD handler.
  • Firecracker- and KVM-triggered faults are delivered to the UFFD handler via minor UFFD notifications and UFFD handler unblocks the faulting process via UFFDIO_CONTINUE.

Reason

This is needed to be able to restore snapshots where the VM was backed by guest_memfd.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • [ ] I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • [ ] I have mentioned all user-facing changes in CHANGELOG.md.
  • [ ] If a specific issue led to this PR, this PR closes the issue.
  • [ ] When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • [ ] I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • [ ] I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

Copy link

codecov bot commented Jun 13, 2025

Codecov Report

Attention: Patch coverage is 28.14371% with 360 lines in your changes missing coverage. Please review.

Project coverage is 81.67%. Comparing base (00ac2f3) to head (1888316).

Files with missing lines Patch % Lines
src/vmm/src/lib.rs 12.75% 130 Missing ⚠️
src/vmm/src/vstate/vm.rs 40.42% 84 Missing ⚠️
src/vmm/src/persist.rs 21.79% 61 Missing ⚠️
src/vmm/src/builder.rs 46.37% 37 Missing ⚠️
src/vmm/src/vstate/vcpu.rs 35.55% 29 Missing ⚠️
src/vmm/src/vstate/memory.rs 0.00% 19 Missing ⚠️
Additional details and impacted files
@@                    Coverage Diff                    @@
##           feature/secret-hiding    #5261      +/-   ##
=========================================================
- Coverage                  82.52%   81.67%   -0.85%     
=========================================================
  Files                        250      250              
  Lines                      27386    27792     +406     
=========================================================
+ Hits                       22599    22698      +99     
- Misses                      4787     5094     +307     
Flag Coverage Δ
5.10-c5n.metal 81.84% <23.35%> (-1.07%) ⬇️
5.10-m5n.metal 81.84% <23.35%> (-1.06%) ⬇️
5.10-m6a.metal 81.00% <23.35%> (-1.09%) ⬇️
5.10-m6g.metal 77.62% <19.87%> (-1.08%) ⬇️
5.10-m6i.metal 81.84% <23.35%> (-1.06%) ⬇️
5.10-m7a.metal-48xl 80.99% <23.35%> (-1.10%) ⬇️
5.10-m7g.metal 77.62% <19.87%> (-1.08%) ⬇️
5.10-m7i.metal-24xl 81.80% <23.35%> (-1.06%) ⬇️
5.10-m7i.metal-48xl 81.79% <23.35%> (-1.07%) ⬇️
5.10-m8g.metal-24xl 77.61% <19.87%> (-1.08%) ⬇️
5.10-m8g.metal-48xl 77.61% <19.87%> (-1.08%) ⬇️
6.1-c5n.metal 81.89% <23.35%> (-1.07%) ⬇️
6.1-m5n.metal 81.88% <23.35%> (-1.08%) ⬇️
6.1-m6a.metal 81.05% <23.35%> (-1.10%) ⬇️
6.1-m6g.metal 77.62% <19.87%> (-1.08%) ⬇️
6.1-m6i.metal 81.87% <23.35%> (-1.08%) ⬇️
6.1-m7a.metal-48xl 81.03% <23.35%> (-1.10%) ⬇️
6.1-m7g.metal 77.62% <19.87%> (-1.08%) ⬇️
6.1-m7i.metal-24xl 81.89% <23.35%> (-1.07%) ⬇️
6.1-m7i.metal-48xl 81.90% <23.35%> (-1.06%) ⬇️
6.1-m8g.metal-24xl 77.61% <19.87%> (-1.08%) ⬇️
6.1-m8g.metal-48xl 77.62% <19.87%> (-1.08%) ⬇️
6.14-c5n.metal 81.94% <28.14%> (-0.98%) ⬇️
6.14-m5n.metal 81.93% <28.14%> (-0.99%) ⬇️
6.14-m6a.metal 81.10% <28.14%> (-1.01%) ⬇️
6.14-m6g.metal 77.66% <24.69%> (-1.00%) ⬇️
6.14-m6i.metal 81.93% <28.14%> (-0.98%) ⬇️
6.14-m7a.metal-48xl 81.09% <28.14%> (-1.01%) ⬇️
6.14-m7g.metal 77.67% <24.69%> (-1.00%) ⬇️
6.14-m7i.metal-24xl 81.95% <28.14%> (-0.98%) ⬇️
6.14-m7i.metal-48xl 81.95% <28.14%> (-0.98%) ⬇️
6.14-m8g.metal-24xl 77.66% <24.69%> (-1.00%) ⬇️
6.14-m8g.metal-48xl 77.67% <24.69%> (-0.99%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kalyazin kalyazin force-pushed the sh_uf branch 2 times, most recently from 286efbe to 4e10e54 Compare June 13, 2025 19:55
@kalyazin kalyazin marked this pull request as ready for review June 16, 2025 07:26
@kalyazin kalyazin changed the title [WIP][SH] add userfault support to UFFD handlers [SH] add userfault support to UFFD handlers Jun 16, 2025
@kalyazin kalyazin force-pushed the sh_uf branch 3 times, most recently from b6185cb to 60abeb9 Compare June 17, 2025 10:42
@kalyazin kalyazin force-pushed the sh_uf branch 4 times, most recently from d5e7aa8 to 40101cd Compare June 19, 2025 11:41
@kalyazin kalyazin mentioned this pull request Jun 19, 2025
10 tasks
@kalyazin kalyazin changed the title [SH] add userfault support to UFFD handlers [SH] add userfault support Jun 19, 2025
@kalyazin kalyazin self-assigned this Jun 19, 2025
JackThomson2
JackThomson2 previously approved these changes Jun 19, 2025
This is needed because if guest_memfd is used to back guest memory, vCPU
fault notifications are delivered via the UFFD UDS socket.

Signed-off-by: Nikita Kalyazin <[email protected]>
Ok((0, 3)) => {
self.secret_free = true;
}
Ok((n, _)) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n is the number of bytes read, so this should be (_, n) instead

Comment on lines +360 to +501
Err(e) => {
if e.errno() == libc::EAGAIN {
return None;
}
panic!("Read error: {}", e);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
Err(e) => {
if e.errno() == libc::EAGAIN {
return None;
}
panic!("Read error: {}", e);
}
Err(e) if e.errno() == libc::EAGAIN => return None,
Err(e) => panic!()

}
}

fn recv_json<T: serde::de::DeserializeOwned + serde::Serialize>(&mut self) -> Option<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think you can drop +serde::Serialize here? compiled without it for me at least

Comment on lines +370 to +509
match self.recv_mappings() {
Some(mappings) => Some(UffdMsgFromFirecracker::Mappings(mappings)),
None => None, // EOF or error
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: .map() here and below

Comment on lines +534 to +536
if bytes_read > 0 {
self.current_pos += bytes_read;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for if

Comment on lines +415 to +564
match stream.next() {
Some(Ok(value)) => {
let consumed = stream.byte_offset();
self.buffer.copy_within(consumed..self.current_pos, 0);
self.current_pos -= consumed;
Some(value)
}
Some(Err(e)) => panic!(
"Failed to deserialize JSON message: {}. Error: {}",
String::from_utf8_lossy(&self.buffer[..self.current_pos]),
e
),
None => None,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
match stream.next() {
Some(Ok(value)) => {
let consumed = stream.byte_offset();
self.buffer.copy_within(consumed..self.current_pos, 0);
self.current_pos -= consumed;
Some(value)
}
Some(Err(e)) => panic!(
"Failed to deserialize JSON message: {}. Error: {}",
String::from_utf8_lossy(&self.buffer[..self.current_pos]),
e
),
None => None,
}
match stream.next()? {
Ok(value) => {
let consumed = stream.byte_offset();
self.buffer.copy_within(consumed..self.current_pos, 0);
self.current_pos -= consumed;
Some(value)
}
Err(e) => panic!(
"Failed to deserialize JSON message: {}. Error: {}",
String::from_utf8_lossy(&self.buffer[..self.current_pos]),
e
)
}

Comment on lines 571 to 734
UffdMsgFromFirecracker::Mappings(mappings) => {
let (guest_memfd, userfault_bitmap_memfd) =
if uffd_msg_iter.secret_free {
(
Some(unsafe {
File::from_raw_fd(uffd_msg_iter.fds[1])
}),
Some(unsafe {
File::from_raw_fd(uffd_msg_iter.fds[2])
}),
)
} else {
(None, None)
};

let uffd_handler = UffdHandler::from_mappings(
mappings,
unsafe { File::from_raw_fd(uffd_msg_iter.fds[0]) },
guest_memfd,
userfault_bitmap_memfd,
self.backing_memory,
self.backing_memory_size,
);

let fd = uffd_handler.uffd.as_raw_fd();
if uffd_handler.guest_memfd.is_some() {
self.main_handler_fd = Some(fd);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this entire mappings retrival logic out of the event loop and back into the uffd handler construction path. Since we now read the FDs from the socket without reading any json, the iterator doesn't really need to know about fds at all. They can just be read before we even construct the iterator. As for the mappings, we can set <UffdMsgIterator as Iterator>::Item to be FaultRequest, and simply directly call recv_mappings() from the uffd handler construction code

kalyazin added 2 commits June 23, 2025 10:58
It is used by Secret-Free-enabled UFFD handlers to disable vCPU fault
notifications from the kernel.

Signed-off-by: Nikita Kalyazin <[email protected]>
Accept receiving 3 fds instead of 1, where fds[1] is guest_memfd and
fds[2] is userfault bitmap memfd.

Also handle the FaultRequest message over the UDS socket by calling a
new callback in the Runtime and sending a FaultReply.

TODO: add cab/sob from Patrick

Signed-off-by: Nikita Kalyazin <[email protected]>
kalyazin added 15 commits June 23, 2025 11:00
There are two ways a UFFD handler receives a fault notification if
Secret Fredom is enabled (which is inferred from 3 fds sent by
Firecracker instead of 1):
 - a VMM- or KVM-triggered fault is delivered via a minor UFFD fault
   event.  The handler is supposed to respond to it via memcpying the
   content of the page (if the page hasn't already been populated)
   followed by a UFFDIO_CONTINUE call.
 - a vCPU-triggered fault is delievered via a FaultRequest message on
   the UDS socket.  The handler is supposed to reply with a pwrite64
   call on the guest_memfd to populate the page followed by a FaultReply
   message on the UDS socket.

In both cases, the handler also needs to clear the bit in the userfault
bitmap at the corresponding offset in order to stop further fault
notifications for the same page.

UFFD handlers use the userfault bitmap for two purposes:
 - communicate to the kernel whether a fault at the corresponding
   guest_memfd offset will cause a VM exit
 - keep track of pages that have already been populated in order to
   avoid overwriting the content of the page that is already
   initialised.

Signed-off-by: Nikita Kalyazin <[email protected]>
These are used for communication of page faults between Firecracker and
a UFFD handler.

Signed-off-by: Nikita Kalyazin <[email protected]>
If configured, userfault bitmap is registered with KVM and controls
whether KVM will exit to userspace on a fault of the corresponding page.

We are going to allocate the bitmap in a memfd in Firecracker, set bits
for all pages to request notifications for vCPU faults and send
it to the UFFD handler to delegate clearing the bits as pages get
populated.

Since the KVM userfault patches are still in review,
set_user_memory_region2 is not aware of the userfault flag and the
userfault bitmap address in its input structure.  Define it in
Firecracker code temporarily.

Signed-off-by: Nikita Kalyazin <[email protected]>
This is needed to instruct the kernel to exit to userspace when a vCPU
fault occurs and the corresponding bit in the userfault bitmap is set.

The userfault bitmap is allocated in a memfd by Firecracker and sent to
the UFFD handler.

This also sends 3 fds to the UFFD handler in the handshake:
 - UFFD (original)
 - guest_memfd: for the handler to be able to populate guest memory
 - userfault bitmap memfd: for the handler to be able to disable exits
   to userspace for the pages that have already been populated

Signed-off-by: Nikita Kalyazin <[email protected]>
These will be used to communicate vCPU faults between vCPUs and the VM
if secret freedom is enabled.

Signed-off-by: Nikita Kalyazin <[email protected]>
This is because vCPUs reason in GPAs while the secret-free UFFD
protocol is guest_memfd-offset-based.

TODO: add cab/sob from Patrick

Signed-off-by: Nikita Kalyazin <[email protected]>
It contains two parts:
 - external: between the VMM thread and the UFFD handler
 - internal: between vCPUs and the VMM thread

An outline of the workflow:
 - When a vCPU fault occurs, vCPU exits to userspace
 - The vCPU thread sends a message to the VMM thread via the userfault
   channel
 - The VMM thread forwards the message to the UFFD handler via the UDS
   socket
 - The UFFD hnadler populates the page, clears the corresponding bit in
   the userfault bitmap and sends a reply to Firecracker
 - The VMM thread receives the reply and forwards it to the vCPU via the
   userfault channel
 - The vCPU resumes execution

Signed-off-by: Nikita Kalyazin <[email protected]>
This is required by Secret Freedom to implement the userfault protocol:
vCPUs read notification of fault handling completions from the userfault
channel.

Signed-off-by: Nikita Kalyazin <[email protected]>
kvmclock is currently not supported by Secret Freedom and calling
kvmclock_ctrl will always fail.

Signed-off-by: Nikita Kalyazin <[email protected]>
In a regular VM, we mmap the memory snapshot file and supply the address
in the KVM memory slot.  In Secret Free VMs, we provide guest_memfd in
the memory slot instead.  There is no way we can restore a Secret Free
VM from a file, unless we prepopulate the guest_memfd with the file
content, which is inefficient and is not practically useful.

Signed-off-by: Nikita Kalyazin <[email protected]>
It is not supported by Secret Freedom.

Signed-off-by: Nikita Kalyazin <[email protected]>
This includes both functional and performance tests.

Signed-off-by: Nikita Kalyazin <[email protected]>
Do not add a balloon device to a Secret Free VM as it is not currently
supported.

Signed-off-by: Nikita Kalyazin <[email protected]>
When taking a snapshot from a Secret Free VM, we create a bounce buffer
to be able to pass it to the host kernel to store in a file.  Exclude it
from the memory monitor calculation.

Signed-off-by: Nikita Kalyazin <[email protected]>
This is because the error type has changed due the implementation of
snapshot restore support for Secret Free VMs.

Signed-off-by: Nikita Kalyazin <[email protected]>
Comment on lines +244 to +375
ioctl_iow_nr!(
KVM_SET_USER_MEMORY_REGION2,
KVMIO,
0x49,
kvm_userspace_memory_region2
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can reuse this constant from kvm, only the size of the struct influences the ioctl number, but that is unchanged here

Comment on lines +205 to +230
let addr = match userfault_bitmap_memfd {
Some(file) => {
// SAFETY: the arguments to mmap cannot cause any memory unsafety in the rust sense
let addr = unsafe {
libc::mmap(
std::ptr::null_mut(),
usize::try_from(file.metadata().expect("Failed to get metadata").len())
.expect("userfault bitmap file size is too large"),
libc::PROT_WRITE,
libc::MAP_SHARED,
file.as_raw_fd(),
0,
)
};

if addr == libc::MAP_FAILED {
panic!(
"Failed to mmap userfault bitmap file: {}",
std::io::Error::last_os_error()
);
}

Some(addr as u64)
}
None => None,
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if this function took a pointer to the beginning of the userfault bitmap region, and then in the for loop below we do the pointer arithmetic to get the correct offset into it for the specific guest memory region.

But argh then this entire thing needs to be unsafe (well, register_memory_region already has to be given that it takes a u64 that we're happily interpreting as a pointer and passing to the kernel for doing reads and writes on). Maybe it can take a slice? We create the memfd, and before we pass it off to KVM for doing unholy things with it, having Rust slices to it should be fine.

.set_user_memory_region2(memory_region)
.map_err(VmError::SetUserMemoryRegion)?;
}
self.set_user_memory_region2(memory_region)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's some asserts in the fallback branch which we probably wanna update to include the userfault bitmap being zero

Comment on lines +551 to +556
if vm_resources.machine_config.huge_pages.is_hugetlbfs() {
return Err(BuildMicrovmFromSnapshotErrorGuestMemoryError::File(
GuestMemoryFromFileError::HugetlbfsSnapshot,
)
.into());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably need one of these checks for secret hiding for now, until guest_memfd support hugepages? :o

Comment on lines +582 to +583
#[cfg(target_arch = "x86_64")]
vmm.vm.set_memory_private().map_err(VmmError::Vm)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this to its own commit so its easier to drop it once its not needed anymore?

Comment on lines +503 to +505
// TODO remove these when the UFFD crate supports minor faults for guest_memfd
const UFFDIO_REGISTER_MODE_MINOR: u64 = 1 << 2;
const UFFDIO_REGISTER: u64 = 0xc020aa00;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can drop it once the UFFD create supports minor mode in general, right? This is not specific to guest_memfd if I'm understanding correctly, so theoretically we could already open a PR over there to add support for this, right?

same for uffdio_continue actually

Comment on lines +606 to +608
// We prevent Rust from closing the guest_memfd file descriptor
// so the UFFD handler can used it to populate guest memory.
forget(guest_memfd_copy);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we only send over the raw FD number, we can just hoist guest_memfd.as_raw_fd() to the beginning of the function to store the fd number itself for sending across, before we pass ownership of the fd to create_memory. Then there's no need for clone()/forget()

Comment on lines +203 to +223
fn create_userfault_channels(
&self,
secret_free: bool,
) -> Result<(Option<UserfaultChannel>, Option<UserfaultChannel>), std::io::Error> {
if secret_free {
let (receiver_vcpu_to_vm, sender_vcpu_to_vm) = pipe2(libc::O_NONBLOCK)?;
let (receiver_vm_to_vcpu, sender_vm_to_vcpu) = pipe2(0)?;
Ok((
Some(UserfaultChannel {
sender: sender_vcpu_to_vm,
receiver: receiver_vm_to_vcpu,
}),
Some(UserfaultChannel {
sender: sender_vm_to_vcpu,
receiver: receiver_vcpu_to_vm,
}),
))
} else {
Ok((None, None))
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, so I haven't completely digested this part yet, but two questions:

  • Why do we need a pipe to send stuff from the vmm thread to the vcpu thread? The vcpu is already paused here, so why can't we reuse the existing rust-channel that we use for sending the resume command?
  • Why pipes in the first place? We only need a single eventfd for the vcpu to kick the vmm thread (could even be the same eventfd for all vcpus). The data can then also be sent via a rust-channel (which iirc are completely in userspace)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for co-author / SoB from me, I just did a one line bug fix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't be required if we use rust channels instead of pipes :p

response = microvm.api.vm_config.get()
config = response.json()

return "balloon" in config and config["balloon"] is not None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config.get('balloon') is not None?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants