Skip to content

sled-agent: move instance configuration generation to Nexus #8002

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 29, 2025

Conversation

gjcolombo
Copy link
Contributor

One of the determinations in RFD 505 is that Nexus should be the component that's in charge of determining how to configure a VM given a set of database records describing its instance (the Instance itself, its attached Disks and NetworkInterfaces, etc.). To summarize the rationale in the RFD, the hope is that this will promote two nice properties:

  • Local reasoning about virtual platforms: All the logic that translates instance descriptions into VM specs now lives in a single module in Nexus. In past iterations of the code, Nexus transformed database records into an intermediate sled-agent type, and sled-agent would transform those into Propolis API types, which Propolis would then use to fill in virtual hardware details. Understanding where a VM's configuration came from required the reader to look at all these components; now all the relevant logic lives in Nexus.
  • Serviceability: Putting type transformations and platform policies into sled-agent and Propolis makes them marginally more painful to update, since updating these components requires the system to migrate VMs and reboot sleds. Putting the virtual platform policy in Nexus will make it much less expensive to update in the future.

To achieve this:

  • Move sled-agent's virtual platform logic (added in ingest new Propolis VM creation API #7211) into a new Nexus module. Sled-agent needs to hold onto a bit of logic to insert OPTE port names into instance specs before sending those specs to Propolis; this needs to live in the agent since it selects the relevant object names.
  • Update the sled-agent instance registration API to take a Propolis instance spec as a parameter (and rework some other types to distinguish a bit more clearly between "Propolis VM configuration" and "sled-agent objects that need to be created to support this VM").

The main pain point in this change is that sled-agent's API now includes types that it picked up from the propolis-client API, which caused sled-agent's OpenAPI document to balloon with "duplicate" schema descriptions it inherited from propolis-client's generated types. I'm not sure if there's a great way around this (aside from changing the generated Propolis client to replace all its generated types with their "native" counterparts); I'm open to suggestions here.

Tested by booting a VM in a dev cluster, booting a comparable VM on rack2, and comparing their instance specs (as returned by Propolis's /instance/spec API) to make sure they specified the same components with the same configuration.

Copy link
Contributor Author

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also want to test manually that Propolis-directed region replacements still work as intended with this change (they depend on the virtual platform module having used the relevant disk record's ID as the relevant Propolis backend ID).

@gjcolombo
Copy link
Contributor Author

This will need a fresh commit hash/SHA from the Propolis repo after oxidecomputer/propolis#899 merges, but I think it is otherwise more or less ready for review (though it could probably use some unit tests of the new virtual platform logic...).

@gjcolombo gjcolombo requested a review from hawkw April 24, 2025 19:47
@iximeow iximeow self-requested a review April 24, 2025 20:21
@gjcolombo
Copy link
Contributor Author

@hawkw - thanks for taking a look! I think all your feedback is addressed in 31ee29c.

I've also updated this change to point to the latest mainline Propolis commit, which picks up propolis#899 (re-export Propolis client types for sled-agent to use in its API), propolis#900 (Crucible bump), and propolis#901 (packaging fix). This change now updates Omicron's Crucible dependency as well (cc @leftwo).

@gjcolombo gjcolombo requested a review from hawkw April 28, 2025 23:32
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me --- I dunno if @iximeow wants to review this as well?

Copy link
Member

@iximeow iximeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for giving me a second to take a look through this too. it's good!!

#[derive(Clone, Debug, Serialize, Deserialize, JsonSchema)]
pub struct InstanceHardware {
pub properties: InstanceProperties,
pub struct InstanceSledLocalConfig {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i really like splitting apart "externally-visible foundation for this VM" and "stuff we need to plumb for the VM to have the universe a user thinks it should". i dunno how much you intended that up front, but InstanceHardware being both was a little confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I stumbled across this half by accident (ISTR it fell out of an earlier draft where I was still trying to use some generated propolis-client types for certain things). But I'm with you--I'm really happy with how it turned out!

Comment on lines +210 to +214
let pci_path = slot_to_pci_bdf(slot, PciDeviceKind::Disk)?;
let device = ComponentV0::NvmeDisk(NvmeDisk {
backend_id: SpecKey::Uuid(disk.id()),
pci_path,
serial_number: zero_padded_nvme_serial_from_str(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah having a box to put this kind of Weird Stuff seems like a really nice separation too.

#[derive(Clone, Debug, Serialize, Deserialize, JsonSchema)]
pub struct InstanceHardware {
pub properties: InstanceProperties,
pub struct InstanceSledLocalConfig {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I stumbled across this half by accident (ISTR it fell out of an earlier draft where I was still trying to use some generated propolis-client types for certain things). But I'm with you--I'm really happy with how it turned out!

@gjcolombo gjcolombo force-pushed the gjcolombo/instance-configs-in-nexus branch from 5d08f98 to 4443230 Compare April 29, 2025 16:32
@gjcolombo gjcolombo merged commit a617f0f into main Apr 29, 2025
19 checks passed
@gjcolombo gjcolombo deleted the gjcolombo/instance-configs-in-nexus branch April 29, 2025 23:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants