Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Robustness test flake due to error when reading WAL #19674

Open
serathius opened this issue Mar 26, 2025 · 5 comments
Open

Robustness test flake due to error when reading WAL #19674

serathius opened this issue Mar 26, 2025 · 5 comments

Comments

@serathius
Copy link
Member

Which Github Action / Prow Jobs are flaking?

ci-etcd-robustness-main-arm64

Which tests are flaking?

Robustness test

Github Action / Prow Job link

https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-etcd-robustness-main-arm64/1904808574576496640

Reason for failure (if possible)

Issue investigated as part of robustness tests meeting on March 26th.

Failed scenario https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-etcd-robustness-main-arm64/1904808574576496640

unexpected differences between wal entries, diff:
        	            	  []model.EtcdRequest{
        	            	  	... // 922 identical elements
        	            	  	{Type: "txn", Txn: &{OperationsOnSuccess: {{Type: "put-operation", Put: {Key: "key2", Value: {Value: "925"}}}}}},
        	            	  	{Type: "txn", Txn: &{OperationsOnSuccess: {{Type: "put-operation", Put: {Key: "key7", Value: {Hash: 1602287042}}}}}},
        	            	  	{
        	            	  		... // 2 identical fields
        	            	  		LeaseRevoke: nil,
        	            	  		Range:       nil,
        	            	  		Txn: &model.TxnRequest{
        	            	  			Conditions: nil,
        	            	  			OperationsOnSuccess: []model.EtcdOperation{
        	            	  				{
        	            	  					Type:  "put-operation",
        	            	  					Range: {},
        	            	  					Put: model.PutOptions{
        	            	- 						Key: "key5",
        	            	+ 						Key: "key7",
        	            	  						Value: model.ValueOrHash{
        	            	- 							Value: "",
        	            	+ 							Value: "927",
        	            	- 							Hash:  876259569,
        	            	+ 							Hash:  0,
        	            	  						},
        	            	  						LeaseID: 0,
        	            	  					},
        	            	  					Delete: {},
        	            	  				},
        	            	  			},
        	            	  			OperationsOnFailure: nil,
        	            	  		},
        	            	  		Defragment: nil,
        	            	  		Compact:    nil,
        	            	  	},
        	            	  }

Thinks we confirmed:

  • It's not bug in etcd, history is linearizable if we disable reading WAL.
  • The etcd-dump-log is able to read WAL properly without a problem, returned commitIndex 2936
  • It's not issue with comparing uncommitted entries, however the current logic doesn't support it.
  • Reading WAL logs by robustness fails with Error occurred when reading WAL entries wal: slice bounds out of range implying the entry 2448 is corrupted, even when etcd-dump-log worked.
  • Error ErrSliceOutOfRange is swallowed.

From experience I frequently observed errors when reading WAL in 3 node member cluster. Expect that failpoing causing etcd crash might be disrupting writing WAL, resulting in corrupted state. This is still ok as long as it happens to just one member and remaining two are ok.

Things to do:

  • Don't swallow errors when reading WAL file
  • Skip for uncommitted entries, by breaking loop after entry.Index > hardState.Commit.
  • Investigate why etcd-dump-log reads WAL without a problem.

Anything else we need to know?

No response

@abdurrehman107
Copy link
Member

I'll take this up
cc @serathius

@abdurrehman107
Copy link
Member

/assign

@siyuanfoundation
Copy link
Contributor

siyuanfoundation commented Mar 26, 2025

I tested it locally, and etcd-dump-log fails for the same error Error occurred when reading WAL entries wal: slice bounds out of range for \server-TestRobustnessExploratoryEtcdHighTrafficClusterOfSize3-test-2\. Which is expected if the node is recovered and received snapshot from peers.

It is fine to swallow errors when reading WAL file, and return the last valid entries. The question still remains why there are different entries when only comparing the first 935 entries.

@siyuanfoundation
Copy link
Contributor

With #19684, you can see that for member 0, 1,

   3	       933	norm	header:<ID:62089223431686296 > put:<key:"key7" value:"htmtapc...
   5	       934	norm
   5	       935	norm	header:<ID:6577953459306661678 > put:<key:"key7" value:"927" >
   5	       936	norm	header:<ID:6577953459306661680 > put:<key:"key9" value:"928" >
   5	       937	norm	header:<ID:6577953459306661681 > put:<key:"key9" value:"930" >

for member 2

   3	       930	norm	header:<ID:11010058442592651283 > put:<key:"key7" value:"923" >
   3	       931	norm	header:<ID:6577953459306661672 > put:<key:"key8" value:"924" >
   3	       932	norm	header:<ID:62089223431686295 > put:<key:"key2" value:"925" >
   3	       933	norm	header:<ID:62089223431686296 > put:<key:"key7" value:"htmta...
   3	       934	norm	header:<ID:6577953459306661674 > put:<key:"key5" value:"jvco...

The last entry in member 2 is uncommited.

How about just skipping the member if slice bounds out of range? @serathius

@serathius
Copy link
Member Author

How about just skipping the member if slice bounds out of range?

It should work for robustness, however I don't understand why we return error when they are uncommited entries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants