Skip to content

Commit dceed97

Browse files
authored
Cut a few seconds off of every test using Cockroach (#6934)
This one has some history! While working on some database-related APIs, I noticed that my tests using CockroachDB were a **lot** slower than other tests. To some degree, I expect this, but also, this was on the order of ~5-10 seconds per test doing very little other than CockroachDB setup and teardown. After doing a little profiling, I noticed that tests took several seconds to perform teardown, which increases significantly if any schema changes occurred. Why does teardown take so long? Well, it turns out we are sending `SIGTERM` to the CockroachDB process to gracefully terminate it, instead of `SIGKILL`, which would terminate it much more abruptly. This is where the history comes in: Gracefully terminating CockroachDB was a choice we made a few years ago to avoid a test flake: #540. Basically, when creating the "seed database" -- a database where we apply schema changes that are shared between many tests -- we want to gracefully terminate to avoid leaving the database in a "dirty state", where it might need to flush work and cleanup intermediate state. In the case of #540, that "dirty intermediate state" was an absolute path, which meant copies of that seed database trampled on each other if graceful shutdown did not complete. Our approach was to apply graceful termination to all CockroachDB teardown invocations, but this was overkill. Only the seed database expects to have storage be in-use after the call to `cleanup` -- all other test-only invocations expect to immediately remove their storage. They don't need to terminate gracefully, and arguably, should just exit as quickly as they can. This PR changes the disposition: - `cleanup_gracefully` uses `SIGTERM`, and waits for graceful cleanup. This is still used when constructing the seed db. - `cleanup` uses `SIGKILL`, and kills the database immediately. This is now used for all other use-cases. As an example in the performance difference, here's a comparison for some datastore tests: ## Before ``` SETUP PASS [ 1/1] crdb-seed: cargo run -p crdb-seed --profile test PASS [ 6.996s] nexus-db-queries db::datastore::db_metadata::test::ensure_schema_is_current_version PASS [ 7.344s] nexus-db-queries db::datastore::db_metadata::test::schema_version_subcomponents_save_progress PASS [ 8.609s] nexus-db-queries db::datastore::db_metadata::test::concurrent_nexus_instances_only_move_forward ------------ Summary [ 11.386s] 3 tests run: 3 passed, 228 skipped ``` ## After ``` SETUP PASS [ 1/1] crdb-seed: cargo run -p crdb-seed --profile test PASS [ 2.087s] nexus-db-queries db::datastore::db_metadata::test::ensure_schema_is_current_version PASS [ 3.054s] nexus-db-queries db::datastore::db_metadata::test::schema_version_subcomponents_save_progress PASS [ 4.442s] nexus-db-queries db::datastore::db_metadata::test::concurrent_nexus_instances_only_move_forward ------------ Summary [ 7.550s] 3 tests run: 3 passed, 228 skipped ```
1 parent 38a34e1 commit dceed97

File tree

3 files changed

+41
-7
lines changed

3 files changed

+41
-7
lines changed

dev-tools/db-dev/src/main.rs

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -137,7 +137,6 @@ impl DbRunArgs {
137137
// receive SIGINT.
138138
tokio::select! {
139139
_ = db_instance.wait_for_shutdown() => {
140-
db_instance.cleanup().await.context("clean up after shutdown")?;
141140
bail!(
142141
"db-dev: database shut down unexpectedly \
143142
(see error output above)"

test-utils/src/dev/db.rs

Lines changed: 40 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -514,6 +514,21 @@ pub enum CockroachStartError {
514514
},
515515
}
516516

517+
#[derive(Copy, Clone, Debug)]
518+
enum Signal {
519+
Kill,
520+
Terminate,
521+
}
522+
523+
impl From<Signal> for libc::c_int {
524+
fn from(signal: Signal) -> Self {
525+
match signal {
526+
Signal::Kill => libc::SIGKILL,
527+
Signal::Terminate => libc::SIGTERM,
528+
}
529+
}
530+
}
531+
517532
/// Manages a CockroachDB process running as a single-node cluster
518533
///
519534
/// You are **required** to invoke [`CockroachInstance::wait_for_shutdown()`] or
@@ -578,7 +593,8 @@ impl CockroachInstance {
578593
client.cleanup().await.context("cleaning up after wipe")
579594
}
580595

581-
/// Waits for the child process to exit
596+
/// Waits for the child process to exit, and cleans up its temporary
597+
/// storage.
582598
///
583599
/// Note that CockroachDB will normally run forever unless the caller
584600
/// arranges for it to be shutdown.
@@ -593,24 +609,43 @@ impl CockroachInstance {
593609
.await;
594610
}
595611
self.child_process = None;
612+
613+
// It shouldn't really matter which cleanup API we use, since
614+
// the child process is gone anyway.
596615
self.cleanup().await
597616
}
598617

599-
/// Cleans up the child process and temporary directory
618+
/// Gracefully cleans up the child process and temporary directory
619+
///
620+
/// If the child process is still running, it will be killed with SIGTERM and
621+
/// this function will wait for it to exit. Then the temporary directory
622+
/// will be cleaned up.
623+
pub async fn cleanup_gracefully(&mut self) -> Result<(), anyhow::Error> {
624+
self.cleanup_inner(Signal::Terminate).await
625+
}
626+
627+
/// Quickly cleans up the child process and temporary directory
600628
///
601629
/// If the child process is still running, it will be killed with SIGKILL and
602630
/// this function will wait for it to exit. Then the temporary directory
603631
/// will be cleaned up.
604632
pub async fn cleanup(&mut self) -> Result<(), anyhow::Error> {
605-
// SIGTERM the process and wait for it to exit so that we can remove the
633+
self.cleanup_inner(Signal::Kill).await
634+
}
635+
636+
async fn cleanup_inner(
637+
&mut self,
638+
signal: Signal,
639+
) -> Result<(), anyhow::Error> {
640+
// Kill the process and wait for it to exit so that we can remove the
606641
// temporary directory that we may have used to store its data. We
607642
// don't care what the result of the process was.
608643
if let Some(child_process) = self.child_process.as_mut() {
609644
let pid = child_process.id().expect("Missing child PID") as i32;
610645
let success =
611-
0 == unsafe { libc::kill(pid as libc::pid_t, libc::SIGTERM) };
646+
0 == unsafe { libc::kill(pid as libc::pid_t, signal.into()) };
612647
if !success {
613-
bail!("Failed to send SIGTERM to DB");
648+
bail!("Failed to send {signal:?} to DB");
614649
}
615650
child_process.wait().await.context("waiting for child process")?;
616651
self.child_process = None;

test-utils/src/dev/seed.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -176,7 +176,7 @@ pub async fn test_setup_database_seed(
176176
)
177177
.await
178178
.context("failed to setup database")?;
179-
db.cleanup().await.context("failed to cleanup database")?;
179+
db.cleanup_gracefully().await.context("failed to cleanup database")?;
180180

181181
// See https://github.com/cockroachdb/cockroach/issues/74231 for context on
182182
// this. We use this assertion to check that our seed directory won't point

0 commit comments

Comments
 (0)