-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Robustness test flake due to error when reading WAL #19674
Comments
I'll take this up |
/assign |
I tested it locally, and It is fine to swallow errors when reading WAL file, and return the last valid entries. The question still remains why there are different entries when only comparing the first 935 entries. |
With #19684, you can see that for member 0, 1,
for member 2
The last entry in member 2 is uncommited. How about just skipping the member if |
It should work for robustness, however I don't understand why we return error when they are uncommited entries. |
Which Github Action / Prow Jobs are flaking?
ci-etcd-robustness-main-arm64
Which tests are flaking?
Robustness test
Github Action / Prow Job link
https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-etcd-robustness-main-arm64/1904808574576496640
Reason for failure (if possible)
Issue investigated as part of robustness tests meeting on March 26th.
Failed scenario https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-etcd-robustness-main-arm64/1904808574576496640
Thinks we confirmed:
etcd-dump-log
is able to read WAL properly without a problem, returned commitIndex 2936Error occurred when reading WAL entries wal: slice bounds out of range
implying the entry 2448 is corrupted, even whenetcd-dump-log
worked.ErrSliceOutOfRange
is swallowed.From experience I frequently observed errors when reading WAL in 3 node member cluster. Expect that failpoing causing etcd crash might be disrupting writing WAL, resulting in corrupted state. This is still ok as long as it happens to just one member and remaining two are ok.
Things to do:
Anything else we need to know?
No response
The text was updated successfully, but these errors were encountered: