Cut a few seconds off of every test using Cockroach (#6934)

smklein · web-flow · commit dceed974ee3b · 2024-10-26T17:07:58.000-07:00
This one has some history! While working on some database-related APIs, I noticed that my tests using CockroachDB were a **lot** slower than other tests. To some degree, I expect this, but also, this was on the order of ~5-10 seconds per test doing very little other than CockroachDB setup and teardown. After doing a little profiling, I noticed that tests took several seconds to perform teardown, which increases significantly if any schema changes occurred. Why does teardown take so long? Well, it turns out we are sending `SIGTERM` to the CockroachDB process to gracefully terminate it, instead of `SIGKILL`, which would terminate it much more abruptly. This is where the history comes in: Gracefully terminating CockroachDB was a choice we made a few years ago to avoid a test flake: #540. Basically, when creating the "seed database" -- a database where we apply schema changes that are shared between many tests -- we want to gracefully terminate to avoid leaving the database in a "dirty state", where it might need to flush work and cleanup intermediate state. In the case of #540, that "dirty intermediate state" was an absolute path, which meant copies of that seed database trampled on each other if graceful shutdown did not complete. Our approach was to apply graceful termination to all CockroachDB teardown invocations, but this was overkill. Only the seed database expects to have storage be in-use after the call to `cleanup` -- all other test-only invocations expect to immediately remove their storage. They don't need to terminate gracefully, and arguably, should just exit as quickly as they can. This PR changes the disposition: - `cleanup_gracefully` uses `SIGTERM`, and waits for graceful cleanup. This is still used when constructing the seed db. - `cleanup` uses `SIGKILL`, and kills the database immediately. This is now used for all other use-cases. As an example in the performance difference, here's a comparison for some datastore tests: ## Before ``` SETUP PASS [ 1/1] crdb-seed: cargo run -p crdb-seed --profile test PASS [ 6.996s] nexus-db-queries db::datastore::db_metadata::test::ensure_schema_is_current_version PASS [ 7.344s] nexus-db-queries db::datastore::db_metadata::test::schema_version_subcomponents_save_progress PASS [ 8.609s] nexus-db-queries db::datastore::db_metadata::test::concurrent_nexus_instances_only_move_forward ------------ Summary [ 11.386s] 3 tests run: 3 passed, 228 skipped ``` ## After ``` SETUP PASS [ 1/1] crdb-seed: cargo run -p crdb-seed --profile test PASS [ 2.087s] nexus-db-queries db::datastore::db_metadata::test::ensure_schema_is_current_version PASS [ 3.054s] nexus-db-queries db::datastore::db_metadata::test::schema_version_subcomponents_save_progress PASS [ 4.442s] nexus-db-queries db::datastore::db_metadata::test::concurrent_nexus_instances_only_move_forward ------------ Summary [ 7.550s] 3 tests run: 3 passed, 228 skipped ```
diff --git a/dev-tools/db-dev/src/main.rs b/dev-tools/db-dev/src/main.rs
@@ -137,7 +137,6 @@ impl DbRunArgs {
         // receive SIGINT.
         tokio::select! {
             _ = db_instance.wait_for_shutdown() => {
-                db_instance.cleanup().await.context("clean up after shutdown")?;
                 bail!(
                     "db-dev: database shut down unexpectedly \
                     (see error output above)"
diff --git a/test-utils/src/dev/db.rs b/test-utils/src/dev/db.rs
@@ -514,6 +514,21 @@ pub enum CockroachStartError {
     },
 }
 
+#[derive(Copy, Clone, Debug)]
+enum Signal {
+    Kill,
+    Terminate,
+}
+
+impl From<Signal> for libc::c_int {
+    fn from(signal: Signal) -> Self {
+        match signal {
+            Signal::Kill => libc::SIGKILL,
+            Signal::Terminate => libc::SIGTERM,
+        }
+    }
+}
+
 /// Manages a CockroachDB process running as a single-node cluster
 ///
 /// You are **required** to invoke [`CockroachInstance::wait_for_shutdown()`] or
@@ -578,7 +593,8 @@ impl CockroachInstance {
         client.cleanup().await.context("cleaning up after wipe")
     }
 
-    /// Waits for the child process to exit
+    /// Waits for the child process to exit, and cleans up its temporary
+    /// storage.
     ///
     /// Note that CockroachDB will normally run forever unless the caller
     /// arranges for it to be shutdown.
@@ -593,24 +609,43 @@ impl CockroachInstance {
                 .await;
         }
         self.child_process = None;
+
+        // It shouldn't really matter which cleanup API we use, since
+        // the child process is gone anyway.
         self.cleanup().await
     }
 
-    /// Cleans up the child process and temporary directory
+    /// Gracefully cleans up the child process and temporary directory
+    ///
+    /// If the child process is still running, it will be killed with SIGTERM and
+    /// this function will wait for it to exit.  Then the temporary directory
+    /// will be cleaned up.
+    pub async fn cleanup_gracefully(&mut self) -> Result<(), anyhow::Error> {
+        self.cleanup_inner(Signal::Terminate).await
+    }
+
+    /// Quickly cleans up the child process and temporary directory
     ///
     /// If the child process is still running, it will be killed with SIGKILL and
     /// this function will wait for it to exit.  Then the temporary directory
     /// will be cleaned up.
     pub async fn cleanup(&mut self) -> Result<(), anyhow::Error> {
-        // SIGTERM the process and wait for it to exit so that we can remove the
+        self.cleanup_inner(Signal::Kill).await
+    }
+
+    async fn cleanup_inner(
+        &mut self,
+        signal: Signal,
+    ) -> Result<(), anyhow::Error> {
+        // Kill the process and wait for it to exit so that we can remove the
         // temporary directory that we may have used to store its data.  We
         // don't care what the result of the process was.
         if let Some(child_process) = self.child_process.as_mut() {
             let pid = child_process.id().expect("Missing child PID") as i32;
             let success =
-                0 == unsafe { libc::kill(pid as libc::pid_t, libc::SIGTERM) };
+                0 == unsafe { libc::kill(pid as libc::pid_t, signal.into()) };
             if !success {
-                bail!("Failed to send SIGTERM to DB");
+                bail!("Failed to send {signal:?} to DB");
             }
             child_process.wait().await.context("waiting for child process")?;
             self.child_process = None;
diff --git a/test-utils/src/dev/seed.rs b/test-utils/src/dev/seed.rs
@@ -176,7 +176,7 @@ pub async fn test_setup_database_seed(
     )
     .await
     .context("failed to setup database")?;
-    db.cleanup().await.context("failed to cleanup database")?;
+    db.cleanup_gracefully().await.context("failed to cleanup database")?;
 
     // See https://github.com/cockroachdb/cockroach/issues/74231 for context on
     // this. We use this assertion to check that our seed directory won't point

Original file line number	Diff line number	Diff line change
`@@ -176,7 +176,7 @@ pub async fn test_setup_database_seed(`
`176`	`176`	`)`
`177`	`177`	`.await`
`178`	`178`	`.context("failed to setup database")?;`
`179`		`- db.cleanup().await.context("failed to cleanup database")?;`
	`179`	`+ db.cleanup_gracefully().await.context("failed to cleanup database")?;`
`180`	`180`
`181`	`181`	`// See https://github.com/cockroachdb/cockroach/issues/74231 for context on`
`182`	`182`	`// this. We use this assertion to check that our seed directory won't point`