-
Notifications
You must be signed in to change notification settings - Fork 2k
Add support for virtio-pci transport #5240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feature/pcie
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## feature/pcie #5240 +/- ##
================================================
+ Coverage 79.23% 80.32% +1.08%
================================================
Files 262 265 +3
Lines 29347 30994 +1647
================================================
+ Hits 23254 24896 +1642
- Misses 6093 6098 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
a80faa9
to
15ed3eb
Compare
61c6be1
to
9d18f11
Compare
c0ada38
to
507cbce
Compare
Define thiserror::Error and displaydoc::Display for various error types in the vended PCI crate. This way we can embed them in our error types downstream. Also export a few types and struct fields that were private and we will be needing them. Signed-off-by: Babis Chalios <[email protected]>
Instead of returning an `EventFd` type, which will actually force us to clone the file descriptor in the Firecracker side. Signed-off-by: Babis Chalios <[email protected]>
This is code we are not going to use in Firecracker. Remove it, so we can keep the crates we vend as minimal as possible, including only things we are actually using. Signed-off-by: Babis Chalios <[email protected]>
We'd like to be able to store Vm within an atomic reference so we can pass it around and share it with other components. The main issue with doing this change is that we need Vm to be `mut` during initialization and the builder.rs code was creating Vmm with Vm embedded in it. To solve this, we break down the initialization of the Vmm object. We first create its individual parts (Vm, Kvm and DeviceManager), perform any necessary initialization logic on Vm and once this done add it within an Arc. Signed-off-by: Babis Chalios <[email protected]>
Add logic to track the device interrupts used by the microVM. This is not strictly needed right now, but we will need it when adding support for MSI-X interrupts. MSI-X interrupts are configured at runtime and we need to interact with KVM to set the interruput routes. To do it, we need to keep track all of the interrupts the VM is using. Signed-off-by: Babis Chalios <[email protected]>
Enable Vm to vend and manage MSI/MSI-X interrupts. This adds the logic to create a set of MSI vectors and then handle their lifetime. Signed-off-by: Babis Chalios <[email protected]>
Vm object is now maintaining information about the interrupts (both traditional IRQs and MSI-X vectors) that are being used by microVM devices. Derive Serialize/Deserialize add logic for recreating objects for relevant types. Signed-off-by: Babis Chalios <[email protected]>
Add a VirtIO PCI transport implementation. Nothing uses it at the moment. This requires a few changes in our vended pci and vm-device crates. Add a couple of tests that ensure that PCI configuration space is what expected. We read common fields and make sure the BAR we allocate for the VirtIO device is what expected. Signed-off-by: Babis Chalios <[email protected]>
Merge the device-related errors that DeviceManager might return. This way, we can avoid adding yet another error type for PCI devices and reduce some the variants of StartMicrovmError. Suggested-by: Egor Lazarchuk <[email protected]> Signed-off-by: Babis Chalios <[email protected]>
Commit d8c2714 (refactor: use VirtioInterrupt in VirtIO devices) which refactored devices to use new VirtioInterrupt type introduced a bug with the index used to trigger a queue interrupt. Instead of using the actual queue index, we were using the index of the used descriptor. Signed-off-by: Babis Chalios <[email protected]>
Apparently, PCI needs Queue::size to be initialized to the maximum possible size supported by the device, otherwise initialization fails. Signed-off-by: Babis Chalios <[email protected]>
Remove the flags in FADT that were declaring we do not support MSI and PCI ASPM. Signed-off-by: Babis Chalios <[email protected]>
Create VirtIO devices using the PCI transport layer when user launched microVM with --enable-pci. Signed-off-by: Babis Chalios <[email protected]>
We are now calling KVM_CHECK_EXTENSION for checking the KVM_CAP_MSI_DEVID capability. We are also calling KVM_SET_GSI_ROUTING to set the interrupts routes and KVM_IRQFD to set/unset interrupt lines. Signed-off-by: Babis Chalios <[email protected]>
Add some unit tests to PciSegment. We now test that the next_device_bdf() method and the initialization logic work as expected. We also check that the configuration space of the PCI segment is correctly registered with the MMIO and, on x86, PIO bus. Signed-off-by: Babis Chalios <[email protected]>
vm-allocator now allows us to (De)serialize IdAllocator and AddressAllocator types. Add ResourceAllocator in DeviceManager snapshot state and restore it when loading a snapshot. Like this we can avoid doing the ExactMatch allocations during snapshot resumes for reserving the exact same MMIO ranges. Moreover, change DeviceManager and PciDevices to provide save/restore functionality via the Persist trait. Like that we can avoid first creating the objects and then restoring their state, overwriting their fields. Signed-off-by: Babis Chalios <[email protected]>
VirtIO MMIO restore logic activates the device the moment we restore the device state, if the device was activated when snapshotted. Move the activation responsibility to the logic the restores the MMIO transport. The reason for this change is that that's how it will be done for the PCI transport. Unifying this will allow us reusing the same types for restoring the non-transport state of devices. Note that we needed to change the way Net devices are saved/restored. RxBuffer type of Net devices holds RX descriptors that we have parsed from the Queue ahead of time. The way we restored this info was manipulating the queue to re-parse the RX descriptors during the restore phase. However, we need the device to be activated to do so, which now isn't. So, instead of storing this info inside the snapshot make sure we have flushed everything before taking the snapshot. Also, simplify a bit the types that we use for serializing/deserializing the state of a device. Signed-off-by: Babis Chalios <[email protected]>
Support serializing the device-specific and transport state of a VirtIO device that uses the PCI transport. Signed-off-by: Babis Chalios <[email protected]>
ResourceAllocator object was part of DeviceManager since it is (mainly) devices that use it. ResourceAllocator is as well the object that implements (in a dummy way, for the moment) the DeviceRelocation trait which PciDevices use to move the address space of a PciDevice when triggered from the guest. Problem with DeviceRelocation is that it also needs the Vm file descriptor to perform the relocation, because we need to move register the new IO event fd for VirtIO devices. To make things simpler, move ResourceAllocator inside the Vm object. In subsequent commit we will remove the DeviceRelocation from ResourceAllocator and move it to Vm instead. This has the nice secondary effect that we were able to simplify the signature of many device-related methods that received Vm and ResourceAllocator arguments. Signed-off-by: Babis Chalios <[email protected]>
We had previously added MMIO and Port IO buses inside ResourceAllocator so that we could implement DeviceRelocation for the type. Now, we will delegate device relocation responsibilities to ArchVm instead. That is because device relocation requires access to the Vm file descriptor as well. As a result, we can move buses to the Vm object itself. Add MMIO bus to VmCommon as both architectures use it. Add PortIO bus for x86 architecture only. Signed-off-by: Babis Chalios <[email protected]>
Implement PCI device relocation logic for Vm type. Up until now, we were only implementing it in a dummy way. Make sure we correctly deallocate previously allocate ranges and unregister them from the corresponding bus. Try to allocate new addresses and register them with the bus. Signed-off-by: Babis Chalios <[email protected]>
Refactor the test code that inserts VirtIO devices in a Vmm object and then add a test which creates a Vmm with PCI devices and then serializes and deserializes the device manager and ensures that everything is as restored as expected. Signed-off-by: Babis Chalios <[email protected]>
Add support for ITS device which provides support for MSI interrupts on ARM architecture. This is currently supported only on systems with GICv3 interrupt controller. In order to make saving/restore of ITS state work properly, we need to change the order in which we restore redistributor register GICR_CTLR. We need to make sure that this register is restored last. Otherwise, restoring GICR_PROPBASER doesn't have any effect and ITS depends on it in order to save/restore ITS tables to/from guest memory. Signed-off-by: Babis Chalios <[email protected]>
Use pci_enabled fixture for boot time, block, and network tests to create PCI microVM variants as well. Signed-off-by: Babis Chalios <[email protected]>
We only pass pci=off if PCI is disabled in Firecracker. Adapt tests and comments to reflect that. Signed-off-by: Babis Chalios <[email protected]>
src/vmm/src/vstate/vm.rs
Outdated
@@ -34,6 +60,8 @@ pub struct VmCommon { | |||
max_memslots: usize, | |||
/// The guest memory of this Vm. | |||
pub guest_memory: GuestMemoryMmap, | |||
/// Interrupts used by Vm's devices | |||
interrupts: Arc<Mutex<HashMap<u32, RoutingEntry>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we actually need an Arc
here if the ArchVm
is already stored into a Arc
?
/// Enable vector | ||
fn enable(&self, vmfd: &VmFd) -> Result<(), errno::Error> { | ||
if !self.enabled.load(Ordering::Acquire) { | ||
vmfd.register_irqfd(&self.event_fd, self.gsi)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need to track this through the VmCommon
object? If the tracking is only for GSI, we should probably just rename that register_gsi
?
(I'm still quite confused around the gsi
vs irq
nomenclature)
impl MsiVectorGroup { | ||
/// Returns the number of vectors in this group | ||
pub fn num_vectors(&self) -> u16 { | ||
u16::try_from(self.irq_routes.len()).unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why safe? can we use .expect("irq routes should be less than 16k")
or smth like that?
|
||
fn trigger( | ||
&self, | ||
index: vm_device::interrupt::InterruptIndex, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we just import InterruptIndex
?
// in the context of a restore given the device might require some | ||
// activation, meaning it will require locking. Dropping the lock | ||
// prevents from a subtle deadlock. | ||
std::mem::drop(locked_device); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can probably wrap the section we need the lock for inside a block to avoid this explicitly.
let (num_queues, device_type) = {
let locked_device = device.lock().unwrap();
(locked_device.queues().len(), locked_device.device_type())
};
let pci_device_id =
VIRTIO_PCI_DEVICE_ID_BASE + u16::try_from(device_type).unwrap();
let (class, subclass) = match device_type {
[...]
}
@@ -311,20 +261,22 @@ impl<'a> Persist<'a> for MMIODeviceManager { | |||
let _: Result<(), ()> = self.for_each_virtio_device(|_, devid, device| { | |||
let mmio_transport_locked = device.inner.lock().expect("Poisoned lock"); | |||
let transport_state = mmio_transport_locked.save(); | |||
let device_info = device.resources; | |||
let device_id = devid.clone(); | |||
|
|||
let mut locked_device = mmio_transport_locked.locked_device(); | |||
match locked_device.device_type() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also here, can this be a trait function?
} | ||
|
||
// Give potential deferred RX frame to guest | ||
self.rx_buffer.finish_frame(&mut self.queues[RX_INDEX]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the case in which there's some used descriptors but the next_used
wasn't advanced? rate limiting?
allocator | ||
.lock() | ||
.unwrap() | ||
.allocate(len, len, vm_allocator::AllocPolicy::ExactMatch(new_base)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's possible for the driver to relocate from 32b to 64b addresses (or at least that what happened to me when I had a conflict in the 32b address space iirc). So we may need to free an allocator and reallocate in another.
IoEventAddress::Mmio(notify_base + i as u64 * NOTIFY_OFF_MULTIPLIER); | ||
self.common | ||
.fd | ||
.unregister_ioevent(queue_evt, &io_addr, NoDatamatch)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this requires some seccomp rules iirc
I'm not sure how we could test a relocation in an integ test, but we probably should.
/// Returns the file descriptor of the ITS device, if any | ||
pub fn its_fd(&self) -> Option<&DeviceFd> { | ||
match self { | ||
Self::V2(_) => None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still use GICv2? Can we just completely remove it?
Changes
This PR adds support for using a PCI transport for VirtIO devices. The PR adds PCI support for x86 and Aarch64 systems both for booted and snapshot-restored microVMs. The VirtIO-PCI transport is used whenever Firecracker is launched with PCI enabled. Otherwise, we fall back to the MMIO transport.
This PR also adds support for MSI-X interrupts for VirtIO devices. When using the PCI transport VirtIO devices always use MSI-X type of interrupts. MMIO devices still use the IRQ based ones.
Since we needed to change various things in order to support PCI along with MMIO transport the PR brings a few quality of life changes in the code base:
ResourceAllocator
, so that we free old memory and allocate new one, the MMIO and Port IO buses, so that we register the device to its new allocated address range and the KVM Vm file descriptor so that we correctly set the new IO event fds. As a result, we move buses, and resource allocator inside theVm
object and make the latter responsible for the relocation when needed. This moves simplify the signatures of a lot of functions that were using a combination ofResourceAllocator
,Vm
andBus
objects. In the future, we might want to even moveDeviceManager
inside theVm
object. This should simplify things even further.serde
support in the new version ofvm-allocator
crate, so that we don't have to perform allocations with fixed addresses upon snapshot restore.Reason
Support VirtIO devices with PCI transport
License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md
.PR Checklist
tools/devtool checkstyle
to verify that the PR passes theautomated style checks.
how they are solving the problem in a clear and encompassing way.
in the PR.
CHANGELOG.md
.Runbook for Firecracker API changes.
integration tests.
TODO
.rust-vmm
.