Skip to content

Commit 9b595e9

Browse files
authored
Perform instance state transitions in instance-update saga (#5749)
A number of bugs relating to guest instance lifecycle management have been observed. These include: - Instances getting "stuck" in a transient state, such as `Starting` or `Stopping`, with no way to forcibly terminate them (#4004) - Race conditions between instances starting and receiving state updates, which cause provisioning counters to underflow (#5042) - Instances entering and exiting the `Failed` state when nothing is actually wrong with them, potentially leaking virtual resources (#4226) These typically require support intervention to resolve. Broadly , these issues exist because the control plane's current mechanisms for understanding and managing an instance's lifecycle state machine are "kind of a mess". In particular: - **(Conceptual) ownership of the CRDB `instance` record is currently split between Nexus and sled-agent(s).** Although Nexus is the only entity that actually reads or writes to the database, the instance's runtime state is also modified by the sled-agents that manage its active Propolis (and, if it's migrating, it's target Propolis), and written to CRDB on their behalf by Nexus. This means that there are multiple copies of the instance's state in different places at the same time, which can potentially get out of sync. When an instance is migrating, its state is updated by two different sled-agents, and they may potentially generate state updates that conflict with each other. And, splitting the responsibility between Nexus and sled-agent makes the code more complex and harder to understand: there is no one place where all instance state machine transitions are performed. - **Nexus doesn't ensure that instance state updates are processed reliably.** Instance state transitions triggered by user actions, such as `instance-start` and `instance-delete`, are performed by distributed sagas, ensuring that they run to completion even if the Nexus instance executing them comes to an untimely end. This is *not* the case for operations that result from instance state transitions reported by sled-agents, which just happen in the HTTP APIs for reporting instance states. If the Nexus processing such a transition crashes, gets network partition'd, or encountering a transient error, the instance is left in an incomplete state and the remainder of the operation will not be performed. This branch rewrites much of the control plane's instance state management subsystem to resolve these issues. At a high level, it makes the following high-level changes: - **Nexus is now the sole owner of the `instance` record.** Sled-agents no longer have their own copies of an instance's `InstanceRuntimeState`, and do not generate changes to that state when reporting instance observations to Nexus. Instead, the sled-agent only publishes updates to the `vmm` and `migration` records (which are never modified by Nexus directly) and Nexus is the only entity responsible for determining how an instance's state should change in response to a VMM or migration state update. - **When an instance has an active VMM, its effective external state is determined primarily by the active `vmm` record**, so that fewer state transitions *require* changes to the `instance` record. PR #5854 laid the ground work for this change, but it's relevant here as well. - **All updates to an `instance` record (and resources conceptually owned by that instance) are performed by a distributed saga.** I've introduced a new `instance-update` saga, which is responsible for performing all changes to the `instance` record, virtual provisioning resources, and instance network config that are performed as part of a state transition. Moving this to a saga helps us to ensure that these operations are always run to completion, even in the event of a sudden Nexus death. - **Consistency of instance state changes is ensured by distributed locking.** State changes may be published by multiple sled-agents to different Nexus replicas. If one Nexus replica is processing a state change received from a sled-agent, and then the instance's state changes again, and the sled-agent publishes that state change to a *different* Nexus...lots of bad things can happen, since the second state change may be performed from the previous initial state, when it *should* have a "happens-after" relationship with the other state transition. And, some operations may contradict each other when performed concurrently. To prevent these race conditions, this PR has the dubious honor of using the first _distributed lock_ in the Oxide control plane, the "instance updater lock". I introduced the locking primitives in PR #5831 --- see that branch for more discussion of locking. - **Background tasks are added to prevent missed updates**. To ensure we cannot accidentally miss an instance update even if a Nexus dies, hits a network partition, or just chooses to eat the state update accidentally, we add a new `instance-updater` background task, which queries the database for instances that are in states that require an update saga without such a saga running, and starts the requisite sagas. Currently, the instance update saga runs in the following cases: - An instance's active VMM transitions to `Destroyed`, in which case the instance's virtual resources are cleaned up and the active VMM is unlinked. - Either side of an instance's live migration reports that the migration has completed successfully. - Either side of an instance's live migration reports that the migration has failed. The inner workings of the instance-update saga itself is fairly complex, and has some kind of interesting idiosyncrasies relative to the existing sagas. I've written up a [lengthy comment] that provides an overview of the theory behind the design of the saga and its principles of operation, so I won't reproduce that in this commit message. [lengthy comment]: https://github.com/oxidecomputer/omicron/blob/357f29c8b532fef5d05ed8cbfa1e64a07e0953a5/nexus/src/app/sagas/instance_update/mod.rs#L5-L254
1 parent d391e5c commit 9b595e9

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

65 files changed

+7305
-3018
lines changed

clients/nexus-client/src/lib.rs

Lines changed: 2 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -122,22 +122,6 @@ impl From<types::VmmState> for omicron_common::api::internal::nexus::VmmState {
122122
}
123123
}
124124

125-
impl From<omicron_common::api::internal::nexus::InstanceRuntimeState>
126-
for types::InstanceRuntimeState
127-
{
128-
fn from(
129-
s: omicron_common::api::internal::nexus::InstanceRuntimeState,
130-
) -> Self {
131-
Self {
132-
dst_propolis_id: s.dst_propolis_id,
133-
gen: s.gen,
134-
migration_id: s.migration_id,
135-
propolis_id: s.propolis_id,
136-
time_updated: s.time_updated,
137-
}
138-
}
139-
}
140-
141125
impl From<omicron_common::api::internal::nexus::VmmRuntimeState>
142126
for types::VmmRuntimeState
143127
{
@@ -153,10 +137,10 @@ impl From<omicron_common::api::internal::nexus::SledInstanceState>
153137
s: omicron_common::api::internal::nexus::SledInstanceState,
154138
) -> Self {
155139
Self {
156-
instance_state: s.instance_state.into(),
157140
propolis_id: s.propolis_id,
158141
vmm_state: s.vmm_state.into(),
159-
migration_state: s.migration_state.map(Into::into),
142+
migration_in: s.migration_in.map(Into::into),
143+
migration_out: s.migration_out.map(Into::into),
160144
}
161145
}
162146
}
@@ -169,26 +153,13 @@ impl From<omicron_common::api::internal::nexus::MigrationRuntimeState>
169153
) -> Self {
170154
Self {
171155
migration_id: s.migration_id,
172-
role: s.role.into(),
173156
state: s.state.into(),
174157
gen: s.gen,
175158
time_updated: s.time_updated,
176159
}
177160
}
178161
}
179162

180-
impl From<omicron_common::api::internal::nexus::MigrationRole>
181-
for types::MigrationRole
182-
{
183-
fn from(s: omicron_common::api::internal::nexus::MigrationRole) -> Self {
184-
use omicron_common::api::internal::nexus::MigrationRole as Input;
185-
match s {
186-
Input::Source => Self::Source,
187-
Input::Target => Self::Target,
188-
}
189-
}
190-
}
191-
192163
impl From<omicron_common::api::internal::nexus::MigrationState>
193164
for types::MigrationState
194165
{

clients/sled-agent-client/src/lib.rs

Lines changed: 64 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,9 @@
55
//! Interface for making API requests to a Sled Agent
66
77
use async_trait::async_trait;
8+
use schemars::JsonSchema;
9+
use serde::Deserialize;
10+
use serde::Serialize;
811
use std::convert::TryFrom;
912
use uuid::Uuid;
1013

@@ -162,10 +165,10 @@ impl From<types::SledInstanceState>
162165
{
163166
fn from(s: types::SledInstanceState) -> Self {
164167
Self {
165-
instance_state: s.instance_state.into(),
166168
propolis_id: s.propolis_id,
167169
vmm_state: s.vmm_state.into(),
168-
migration_state: s.migration_state.map(Into::into),
170+
migration_in: s.migration_in.map(Into::into),
171+
migration_out: s.migration_out.map(Into::into),
169172
}
170173
}
171174
}
@@ -177,25 +180,12 @@ impl From<types::MigrationRuntimeState>
177180
Self {
178181
migration_id: s.migration_id,
179182
state: s.state.into(),
180-
role: s.role.into(),
181183
gen: s.gen,
182184
time_updated: s.time_updated,
183185
}
184186
}
185187
}
186188

187-
impl From<types::MigrationRole>
188-
for omicron_common::api::internal::nexus::MigrationRole
189-
{
190-
fn from(r: types::MigrationRole) -> Self {
191-
use omicron_common::api::internal::nexus::MigrationRole as Output;
192-
match r {
193-
types::MigrationRole::Source => Output::Source,
194-
types::MigrationRole::Target => Output::Target,
195-
}
196-
}
197-
}
198-
199189
impl From<types::MigrationState>
200190
for omicron_common::api::internal::nexus::MigrationState
201191
{
@@ -457,12 +447,29 @@ impl From<types::SledIdentifiers>
457447
/// are bonus endpoints, not generated in the real client.
458448
#[async_trait]
459449
pub trait TestInterfaces {
450+
async fn instance_single_step(&self, id: Uuid);
460451
async fn instance_finish_transition(&self, id: Uuid);
452+
async fn instance_simulate_migration_source(
453+
&self,
454+
id: Uuid,
455+
params: SimulateMigrationSource,
456+
);
461457
async fn disk_finish_transition(&self, id: Uuid);
462458
}
463459

464460
#[async_trait]
465461
impl TestInterfaces for Client {
462+
async fn instance_single_step(&self, id: Uuid) {
463+
let baseurl = self.baseurl();
464+
let client = self.client();
465+
let url = format!("{}/instances/{}/poke-single-step", baseurl, id);
466+
client
467+
.post(url)
468+
.send()
469+
.await
470+
.expect("instance_single_step() failed unexpectedly");
471+
}
472+
466473
async fn instance_finish_transition(&self, id: Uuid) {
467474
let baseurl = self.baseurl();
468475
let client = self.client();
@@ -484,4 +491,46 @@ impl TestInterfaces for Client {
484491
.await
485492
.expect("disk_finish_transition() failed unexpectedly");
486493
}
494+
495+
async fn instance_simulate_migration_source(
496+
&self,
497+
id: Uuid,
498+
params: SimulateMigrationSource,
499+
) {
500+
let baseurl = self.baseurl();
501+
let client = self.client();
502+
let url = format!("{baseurl}/instances/{id}/sim-migration-source");
503+
client
504+
.post(url)
505+
.json(&params)
506+
.send()
507+
.await
508+
.expect("instance_simulate_migration_source() failed unexpectedly");
509+
}
510+
}
511+
512+
/// Parameters to the `/instances/{id}/sim-migration-source` test API.
513+
///
514+
/// This message type is not included in the OpenAPI spec, because this API
515+
/// exists only in test builds.
516+
#[derive(Serialize, Deserialize, JsonSchema)]
517+
pub struct SimulateMigrationSource {
518+
/// The ID of the migration out of the instance's current active VMM.
519+
pub migration_id: Uuid,
520+
/// What migration result (success or failure) to simulate.
521+
pub result: SimulatedMigrationResult,
522+
}
523+
524+
/// The result of a simulated migration out from an instance's current active
525+
/// VMM.
526+
#[derive(Serialize, Deserialize, JsonSchema)]
527+
pub enum SimulatedMigrationResult {
528+
/// Simulate a successful migration out.
529+
Success,
530+
/// Simulate a failed migration out.
531+
///
532+
/// # Note
533+
///
534+
/// This is not currently implemented by the simulated sled-agent.
535+
Failure,
487536
}

common/src/api/internal/nexus.rs

Lines changed: 26 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -117,18 +117,38 @@ pub struct VmmRuntimeState {
117117
/// specific VMM and the instance it incarnates.
118118
#[derive(Clone, Debug, Deserialize, Serialize, JsonSchema)]
119119
pub struct SledInstanceState {
120-
/// The sled's conception of the state of the instance.
121-
pub instance_state: InstanceRuntimeState,
122-
123120
/// The ID of the VMM whose state is being reported.
124121
pub propolis_id: PropolisUuid,
125122

126123
/// The most recent state of the sled's VMM process.
127124
pub vmm_state: VmmRuntimeState,
128125

129-
/// The current state of any in-progress migration for this instance, as
130-
/// understood by this sled.
131-
pub migration_state: Option<MigrationRuntimeState>,
126+
/// The current state of any inbound migration to this VMM.
127+
pub migration_in: Option<MigrationRuntimeState>,
128+
129+
/// The state of any outbound migration from this VMM.
130+
pub migration_out: Option<MigrationRuntimeState>,
131+
}
132+
133+
#[derive(Copy, Clone, Debug, Default)]
134+
pub struct Migrations<'state> {
135+
pub migration_in: Option<&'state MigrationRuntimeState>,
136+
pub migration_out: Option<&'state MigrationRuntimeState>,
137+
}
138+
139+
impl Migrations<'_> {
140+
pub fn empty() -> Self {
141+
Self { migration_in: None, migration_out: None }
142+
}
143+
}
144+
145+
impl SledInstanceState {
146+
pub fn migrations(&self) -> Migrations<'_> {
147+
Migrations {
148+
migration_in: self.migration_in.as_ref(),
149+
migration_out: self.migration_out.as_ref(),
150+
}
151+
}
132152
}
133153

134154
/// An update from a sled regarding the state of a migration, indicating the
@@ -137,7 +157,6 @@ pub struct SledInstanceState {
137157
pub struct MigrationRuntimeState {
138158
pub migration_id: Uuid,
139159
pub state: MigrationState,
140-
pub role: MigrationRole,
141160
pub gen: Generation,
142161

143162
/// Timestamp for the migration state update.
@@ -192,32 +211,6 @@ impl fmt::Display for MigrationState {
192211
}
193212
}
194213

195-
#[derive(
196-
Clone, Copy, Debug, PartialEq, Eq, Deserialize, Serialize, JsonSchema,
197-
)]
198-
#[serde(rename_all = "snake_case")]
199-
pub enum MigrationRole {
200-
/// This update concerns the source VMM of a migration.
201-
Source,
202-
/// This update concerns the target VMM of a migration.
203-
Target,
204-
}
205-
206-
impl MigrationRole {
207-
pub fn label(&self) -> &'static str {
208-
match self {
209-
Self::Source => "source",
210-
Self::Target => "target",
211-
}
212-
}
213-
}
214-
215-
impl fmt::Display for MigrationRole {
216-
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
217-
f.write_str(self.label())
218-
}
219-
}
220-
221214
// Oximeter producer/collector objects.
222215

223216
/// The kind of metric producer this is.

dev-tools/omdb/src/bin/omdb/nexus.rs

Lines changed: 82 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -929,6 +929,9 @@ fn print_task_details(bgtask: &BackgroundTask, details: &serde_json::Value) {
929929
/// number of stale instance metrics that were deleted
930930
pruned_instances: usize,
931931

932+
/// update sagas queued due to instance updates.
933+
update_sagas_queued: usize,
934+
932935
/// instance states from completed checks.
933936
///
934937
/// this is a mapping of stringified instance states to the number
@@ -970,6 +973,7 @@ fn print_task_details(bgtask: &BackgroundTask, details: &serde_json::Value) {
970973
),
971974
Ok(TaskSuccess {
972975
total_instances,
976+
update_sagas_queued,
973977
pruned_instances,
974978
instance_states,
975979
failed_checks,
@@ -987,7 +991,7 @@ fn print_task_details(bgtask: &BackgroundTask, details: &serde_json::Value) {
987991
for (state, count) in &instance_states {
988992
println!(" -> {count} instances {state}")
989993
}
990-
994+
println!(" update sagas queued: {update_sagas_queued}");
991995
println!(" failed checks: {total_failures}");
992996
for (failure, count) in &failed_checks {
993997
println!(" -> {count} {failure}")
@@ -1239,11 +1243,6 @@ fn print_task_details(bgtask: &BackgroundTask, details: &serde_json::Value) {
12391243
} else if name == "lookup_region_port" {
12401244
match serde_json::from_value::<LookupRegionPortStatus>(details.clone())
12411245
{
1242-
Err(error) => eprintln!(
1243-
"warning: failed to interpret task details: {:?}: {:?}",
1244-
error, details
1245-
),
1246-
12471246
Ok(LookupRegionPortStatus { found_port_ok, errors }) => {
12481247
println!(" total filled in ports: {}", found_port_ok.len());
12491248
for line in &found_port_ok {
@@ -1255,6 +1254,83 @@ fn print_task_details(bgtask: &BackgroundTask, details: &serde_json::Value) {
12551254
println!(" > {line}");
12561255
}
12571256
}
1257+
1258+
Err(error) => eprintln!(
1259+
"warning: failed to interpret task details: {:?}: {:?}",
1260+
error, details,
1261+
),
1262+
}
1263+
} else if name == "instance_updater" {
1264+
#[derive(Deserialize)]
1265+
struct UpdaterStatus {
1266+
/// number of instances found with destroyed active VMMs
1267+
destroyed_active_vmms: usize,
1268+
1269+
/// number of instances found with terminated active migrations
1270+
terminated_active_migrations: usize,
1271+
1272+
/// number of update sagas started.
1273+
sagas_started: usize,
1274+
1275+
/// number of sagas completed successfully
1276+
sagas_completed: usize,
1277+
1278+
/// number of sagas which failed
1279+
sagas_failed: usize,
1280+
1281+
/// number of sagas which could not be started
1282+
saga_start_failures: usize,
1283+
1284+
/// the last error that occurred during execution.
1285+
error: Option<String>,
1286+
}
1287+
match serde_json::from_value::<UpdaterStatus>(details.clone()) {
1288+
Err(error) => eprintln!(
1289+
"warning: failed to interpret task details: {:?}: {:?}",
1290+
error, details
1291+
),
1292+
Ok(UpdaterStatus {
1293+
destroyed_active_vmms,
1294+
terminated_active_migrations,
1295+
sagas_started,
1296+
sagas_completed,
1297+
sagas_failed,
1298+
saga_start_failures,
1299+
error,
1300+
}) => {
1301+
if let Some(error) = error {
1302+
println!(" task did not complete successfully!");
1303+
println!(" most recent error: {error}");
1304+
}
1305+
1306+
println!(
1307+
" total instances in need of updates: {}",
1308+
destroyed_active_vmms + terminated_active_migrations
1309+
);
1310+
println!(
1311+
" instances with destroyed active VMMs: {}",
1312+
destroyed_active_vmms,
1313+
);
1314+
println!(
1315+
" instances with terminated active migrations: {}",
1316+
terminated_active_migrations,
1317+
);
1318+
println!(" update sagas started: {sagas_started}");
1319+
println!(
1320+
" update sagas completed successfully: {}",
1321+
sagas_completed,
1322+
);
1323+
1324+
let total_failed = sagas_failed + saga_start_failures;
1325+
if total_failed > 0 {
1326+
println!(" unsuccessful update sagas: {total_failed}");
1327+
println!(
1328+
" sagas which could not be started: {}",
1329+
saga_start_failures
1330+
);
1331+
println!(" sagas failed: {sagas_failed}");
1332+
}
1333+
}
12581334
};
12591335
} else {
12601336
println!(

0 commit comments

Comments
 (0)