Skip to content

Commit dfdfe77

Browse files
author
lif
committed
Add some unit tests for sled-agent Instance creation
At time of writing, instance creation roughly looks like: - nexus -> sled-agent: `instance_put_state` - sled-agent: `InstanceManager::ensure_state` - sled-agent: `Instance::propolis_ensure` - sled-agent -> nexus: `cpapi_instances_put` (if not migrating) - sled-agent: `Instance::setup_propolis_locked` (*blocking!*) - `RunningZone::install` and `Zones::boot` - `illumos_utils::svc::wait_for_service` - `self::wait_for_http_server` for propolis-server itself - sled-agent: `Instance::ensure_propolis_and_tasks` - sled-agent: spawn `Instance::monitor_state_task` - sled-agent -> nexus: `cpapi_instances_put` (if not migrating) - sled-agent: return ok result - nexus: `handle_instance_put_result` Or at least, it does in the happy path. omicron#3927 saw propolis zone creation take longer than the minute nexus's call to sled-agent's `instance_put_state`. That might've looked something like: - nexus -> sled-agent: `instance_put_state` - sled-agent: `InstanceManager::ensure_state` - sled-agent: `Instance::propolis_ensure` - sled-agent -> nexus: `cpapi_instances_put` (if not migrating) - sled-agent: `Instance::setup_propolis_locked` (*blocking!*) - `RunningZone::install` and `Zones::boot` - nexus: i've been waiting a whole minute for this. connection timeout! - nexus: `handle_instance_put_result` - sled-agent: [...] return... oh, they hung up. :( To avoid this timeout being implicit at the *Dropshot configuration* layer (that is to say, we should still have *some* timeout), we could consider a small refactor to make `instance_put_state` not a blocking call -- especially since it's already sending nexus updates on its progress via out-of-band `cpapi_instances_put` calls! That might look something like: - nexus -> sled-agent: `instance_put_state` - sled-agent: `InstanceManager::ensure_state` - sled-agent: spawn { - sled-agent: `Instance::propolis_ensure` - sled-agent -> nexus: `cpapi_instances_put` (if not migrating) - sled-agent: `Instance::setup_propolis_locked` (blocking!) - sled-agent: `Instance::ensure_propolis_and_tasks` - sled-agent: spawn `Instance::monitor_state_task` - sled-agent -> nexus: `cpapi_instances_put` (if not migrating) - sled-agent -> nexus: a cpapi call equivalent to the `handle_instance_put_result` nexus currently invokes after getting the response from the blocking call (With a way for nexus to cancel an instance creation by ID, and a timeout in sled-agent itself for terminating the attempt and reporting the failure back to nexus, and a shorter threshold for logging the event of an instance creation taking a long time.) Before such a change, though, we should really have some more tests around sled-agent's instance creation code at all! So here's a few.
1 parent 0c41f95 commit dfdfe77

File tree

4 files changed

+497
-3
lines changed

4 files changed

+497
-3
lines changed

sled-agent/src/fakes/nexus.rs

Lines changed: 38 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,13 +8,18 @@
88
//! to operate correctly.
99
1010
use dropshot::{
11-
endpoint, ApiDescription, FreeformBody, HttpError, HttpResponseOk, Path,
12-
RequestContext,
11+
endpoint, ApiDescription, FreeformBody, HttpError, HttpResponseOk,
12+
HttpResponseUpdatedNoContent, Path, RequestContext, TypedBody,
1313
};
1414
use hyper::Body;
1515
use internal_dns::ServiceName;
1616
use omicron_common::api::external::Error;
17-
use omicron_common::api::internal::nexus::UpdateArtifactId;
17+
use omicron_common::api::internal::nexus::{
18+
SledInstanceState, UpdateArtifactId,
19+
};
20+
use schemars::JsonSchema;
21+
use serde::Deserialize;
22+
use uuid::Uuid;
1823

1924
/// Implements a fake Nexus.
2025
///
@@ -28,6 +33,13 @@ pub trait FakeNexusServer: Send + Sync {
2833
) -> Result<Vec<u8>, Error> {
2934
Err(Error::internal_error("Not implemented"))
3035
}
36+
fn cpapi_instances_put(
37+
&self,
38+
_instance_id: Uuid,
39+
_new_runtime_state: SledInstanceState,
40+
) -> Result<(), Error> {
41+
Err(Error::internal_error("Not implemented"))
42+
}
3143
}
3244

3345
/// Describes the server context type.
@@ -52,9 +64,32 @@ async fn cpapi_artifact_download(
5264
))
5365
}
5466

67+
#[derive(Deserialize, JsonSchema)]
68+
struct InstancePathParam {
69+
instance_id: Uuid,
70+
}
71+
72+
#[endpoint {
73+
method = PUT,
74+
path = "/instances/{instance_id}",
75+
}]
76+
async fn cpapi_instances_put(
77+
request_context: RequestContext<ServerContext>,
78+
path_params: Path<InstancePathParam>,
79+
new_runtime_state: TypedBody<SledInstanceState>,
80+
) -> Result<HttpResponseUpdatedNoContent, HttpError> {
81+
let context = request_context.context();
82+
context.cpapi_instances_put(
83+
path_params.into_inner().instance_id,
84+
new_runtime_state.into_inner(),
85+
)?;
86+
Ok(HttpResponseUpdatedNoContent())
87+
}
88+
5589
fn api() -> ApiDescription<ServerContext> {
5690
let mut api = ApiDescription::new();
5791
api.register(cpapi_artifact_download).unwrap();
92+
api.register(cpapi_instances_put).unwrap();
5893
api
5994
}
6095

0 commit comments

Comments
 (0)