Robustness test flake due to error when reading WAL #19674

serathius · 2025-03-26T19:19:30Z

Which Github Action / Prow Jobs are flaking?

ci-etcd-robustness-main-arm64

Which tests are flaking?

Robustness test

Github Action / Prow Job link

https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-etcd-robustness-main-arm64/1904808574576496640

Reason for failure (if possible)

Issue investigated as part of robustness tests meeting on March 26th.

Failed scenario https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-etcd-robustness-main-arm64/1904808574576496640

unexpected differences between wal entries, diff:
        	            	  []model.EtcdRequest{
        	            	  	... // 922 identical elements
        	            	  	{Type: "txn", Txn: &{OperationsOnSuccess: {{Type: "put-operation", Put: {Key: "key2", Value: {Value: "925"}}}}}},
        	            	  	{Type: "txn", Txn: &{OperationsOnSuccess: {{Type: "put-operation", Put: {Key: "key7", Value: {Hash: 1602287042}}}}}},
        	            	  	{
        	            	  		... // 2 identical fields
        	            	  		LeaseRevoke: nil,
        	            	  		Range:       nil,
        	            	  		Txn: &model.TxnRequest{
        	            	  			Conditions: nil,
        	            	  			OperationsOnSuccess: []model.EtcdOperation{
        	            	  				{
        	            	  					Type:  "put-operation",
        	            	  					Range: {},
        	            	  					Put: model.PutOptions{
        	            	- 						Key: "key5",
        	            	+ 						Key: "key7",
        	            	  						Value: model.ValueOrHash{
        	            	- 							Value: "",
        	            	+ 							Value: "927",
        	            	- 							Hash:  876259569,
        	            	+ 							Hash:  0,
        	            	  						},
        	            	  						LeaseID: 0,
        	            	  					},
        	            	  					Delete: {},
        	            	  				},
        	            	  			},
        	            	  			OperationsOnFailure: nil,
        	            	  		},
        	            	  		Defragment: nil,
        	            	  		Compact:    nil,
        	            	  	},
        	            	  }

Thinks we confirmed:

It's not bug in etcd, history is linearizable if we disable reading WAL.
The etcd-dump-log is able to read WAL properly without a problem, returned commitIndex 2936
It's not issue with comparing uncommitted entries, however the current logic doesn't support it.
Reading WAL logs by robustness fails with Error occurred when reading WAL entries wal: slice bounds out of range implying the entry 2448 is corrupted, even when etcd-dump-log worked.
Error ErrSliceOutOfRange is swallowed.

From experience I frequently observed errors when reading WAL in 3 node member cluster. Expect that failpoing causing etcd crash might be disrupting writing WAL, resulting in corrupted state. This is still ok as long as it happens to just one member and remaining two are ok.

Things to do:

Don't swallow errors when reading WAL file
Skip for uncommitted entries, by breaking loop after entry.Index > hardState.Commit.
Investigate why etcd-dump-log reads WAL without a problem.

Anything else we need to know?

No response

The text was updated successfully, but these errors were encountered:

abdurrehman107 · 2025-03-26T20:28:53Z

I'll take this up
cc @serathius

abdurrehman107 · 2025-03-26T20:29:00Z

/assign

siyuanfoundation · 2025-03-26T20:45:33Z

I tested it locally, and etcd-dump-log fails for the same error Error occurred when reading WAL entries wal: slice bounds out of range for \server-TestRobustnessExploratoryEtcdHighTrafficClusterOfSize3-test-2\. Which is expected if the node is recovered and received snapshot from peers.

It is fine to swallow errors when reading WAL file, and return the last valid entries. The question still remains why there are different entries when only comparing the first 935 entries.

siyuanfoundation · 2025-04-01T16:16:36Z

With #19684, you can see that for member 0, 1,

   3	       933	norm	header:<ID:62089223431686296 > put:<key:"key7" value:"htmtapc...
   5	       934	norm
   5	       935	norm	header:<ID:6577953459306661678 > put:<key:"key7" value:"927" >
   5	       936	norm	header:<ID:6577953459306661680 > put:<key:"key9" value:"928" >
   5	       937	norm	header:<ID:6577953459306661681 > put:<key:"key9" value:"930" >

for member 2

   3	       930	norm	header:<ID:11010058442592651283 > put:<key:"key7" value:"923" >
   3	       931	norm	header:<ID:6577953459306661672 > put:<key:"key8" value:"924" >
   3	       932	norm	header:<ID:62089223431686295 > put:<key:"key2" value:"925" >
   3	       933	norm	header:<ID:62089223431686296 > put:<key:"key7" value:"htmta...
   3	       934	norm	header:<ID:6577953459306661674 > put:<key:"key5" value:"jvco...

The last entry in member 2 is uncommited.

How about just skipping the member if slice bounds out of range? @serathius

serathius · 2025-04-01T17:05:50Z

How about just skipping the member if slice bounds out of range?

It should work for robustness, however I don't understand why we return error when they are uncommited entries.

serathius added area/testing type/flake area/robustness-testing labels Mar 26, 2025

k8s-ci-robot assigned abdurrehman107 Mar 26, 2025

siyuanfoundation added the stage/triaged label Apr 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Robustness test flake due to error when reading WAL #19674

Robustness test flake due to error when reading WAL #19674

serathius commented Mar 26, 2025

abdurrehman107 commented Mar 26, 2025

abdurrehman107 commented Mar 26, 2025

siyuanfoundation commented Mar 26, 2025 •

edited

Loading

siyuanfoundation commented Apr 1, 2025

serathius commented Apr 1, 2025

Robustness test flake due to error when reading WAL #19674

Robustness test flake due to error when reading WAL #19674

Comments

serathius commented Mar 26, 2025

Which Github Action / Prow Jobs are flaking?

Which tests are flaking?

Github Action / Prow Job link

Reason for failure (if possible)

Anything else we need to know?

abdurrehman107 commented Mar 26, 2025

abdurrehman107 commented Mar 26, 2025

siyuanfoundation commented Mar 26, 2025 • edited Loading

siyuanfoundation commented Apr 1, 2025

serathius commented Apr 1, 2025

siyuanfoundation commented Mar 26, 2025 •

edited

Loading