-
Notifications
You must be signed in to change notification settings - Fork 43
deploy 5 CockroachDB nodes from RSS #3450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This needs a bit more work. Sled Agent makes some assumptions that either need to be relaxed or worked around (e.g., that there will only be one CockroachDB dataset in a given zpool). |
Is this still an issue? I saw you bumped the number of zpools in the simulated environment from three to five, which looks like the right call to me |
No, this is fixed with a change to the dependent PR (#3446) to allocate from separate zpools when possible, plus a change in this PR to set up more virtual zpools. |
@@ -47,7 +47,7 @@ function ensure_zpools { | |||
echo "Zpool: [$ZPOOL]" | |||
VDEV_PATH="$OMICRON_TOP/$ZPOOL.vdev" | |||
if ! [[ -f "$VDEV_PATH" ]]; then | |||
dd if=/dev/zero of="$VDEV_PATH" bs=1 count=0 seek=10G | |||
dd if=/dev/zero of="$VDEV_PATH" bs=1 count=0 seek=6G |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to keep the total space used by the zpools the same size. It used to be 3x10G. Now it's 5x6G.
"oxp_14b4dc87-ab46-49fb-a4b4-d361ae214c03", | ||
"oxp_24b4dc87-ab46-49fb-a4b4-d361ae214c03", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need two more pools in the system in order to deploy a 5-node CockroachDB cluster.
@@ -240,7 +240,7 @@ pub async fn build_client() -> Result<oxide_client::Client> { | |||
}) | |||
}, | |||
&Duration::from_secs(1), | |||
&Duration::from_secs(300), | |||
&Duration::from_secs(600), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're deploying two extra CockroachDB nodes and it can take a little longer for the system to set up now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can look into parallelizing service bringup on the sled-agent side, if that would help mitigate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect it would. No pressure on my part in terms of urgency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, lets not block this PR, but I can do this as a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
@@ -240,7 +240,7 @@ pub async fn build_client() -> Result<oxide_client::Client> { | |||
}) | |||
}, | |||
&Duration::from_secs(1), | |||
&Duration::from_secs(300), | |||
&Duration::from_secs(600), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can look into parallelizing service bringup on the sled-agent side, if that would help mitigate?
* Configure a replication factor of 5 to ensure that the system can maintain | ||
* availability in the face of any two node failures. | ||
*/ | ||
ALTER RANGE default CONFIGURE ZONE USING num_replicas = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For local testing CRDB -- e.g. with the omicron-dev tool -- what does this do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I expect it will have no practical impact. The default replication factor is 3, which is already more nodes than we ever have today. The impact is that CockroachDB considers all of its data under-replicated. It reports it as such (e.g., here) and if new nodes show up, it will try to replicate data until nothing is under-replicated.
One could imagine that this would instead disallow you from writing any data to the cluster when it can't satisfy this constraint. That doesn't seem to be how it works. If it did, the tests wouldn't run at all, I think.
Depends on #3446. Replaces #2956. Fixes #727.
This changes RSS to deploy 5 CockroachDB nodes and sets the default replication factor to 5. Quoting the CockroachDB production checklist: