CA+Rancher is deleting random nodes #8015

SnelsSM · 2025-04-08T07:00:33Z

Which component are you using?:

/area cluster-autoscaler

What version of the component are you using?:

Component version: v1.31.0

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version
Client Version: v1.32.3
Kustomize Version: v5.5.0
Server Version: v1.31.4+k3s1

What environment is this in?:
Rancher v2.10.2
Harvester v1.4.1

What did you expect to happen?:

The cluster autoscaler, when used with Rancher, must not initiate the deletion of nodes that should not be deleted.

What happened instead?:

When using Rancher, the cluster autoscaler incorrectly triggers (Yes, there is no direct indication of deletion. It seems there's a logical error somewhere) the deletion of a node that should not be deleted.
There is no such problem without using cluster autoscaler.

I used different settings, but the result is the same.
The current arguments look like this:
scale-down-utilization-threshold: 0.01
scale-down-unneeded-time: 1h
scale-down-delay-after-add: 30m
skip-nodes-with-local-storage: true

Even though pods are using emptyDir and their CPU/memory requests are above the threshold, it's not making any difference.

From debug log

2025-04-08 04:02:08.261	I0408 04:02:08.261178       1 rancher_provider.go:170] found pool "worker-dyn" via machine "jenkins-local-2-worker-dyn-wk6x7-s82ff"
2025-04-08 04:02:08.261	I0408 04:02:08.261199       1 eligibility.go:162] Node jenkins-local-2-worker-dyn-wk6x7-s82ff unremovable: cpu requested (91.3043% of allocatable) is above the scale-down utilization threshold
2025-04-08 04:02:08.261	I0408 04:02:08.261229       1 static_autoscaler.go:623] Scale down status: lastScaleUpTime=2025-04-08 03:13:11.691708094 +0000 UTC m=+57967.285067686 lastScaleDownDeleteTime=2025-04-08 04:01:58.155727444 +0000 UTC m=+60893.749087046 lastScaleDownFailTime=2025-04-07 10:07:23.731699145 +0000 UTC m=-3580.674941223 scaleDownForbidden=false scaleDownInCooldown=false
2025-04-08 04:02:08.261	I0408 04:02:08.261490       1 rancher_provider.go:170] found pool "worker-dyn" via machine "jenkins-local-2-worker-dyn-wk6x7-s82ff"
...
2025-04-08 04:02:18.306	I0408 04:02:18.306101       1 request.go:1351] Response Body: {"apiVersion":"cluster.x-k8s.io/v1beta1","items":[{...
"name":"jenkins-local-2-worker-dyn-wk6x7-s82ff","namespace":"fleet-default","ownerReferences":[{"apiVersion":"cluster.x-k8s.io/v1beta1","blockOwnerDeletion":true,"controller":true,"kind":"MachineSet","name":"jenkins-local-2-worker-dyn-wk6x7","uid":"06f5c8e5-e5da-4448-95e4-ebd5b385201e"}],"resourceVersion":"1306230318","uid":"4a2e5fc6-f85b-4912-9509-d1caac023bc1"},"spec":{"bootstrap":{"configRef":{"apiVersion":"rke.cattle.io/v1","kind":"RKEBootstrap","name":"jenkins-local-2-worker-dyn-wk6x7-s82ff","namespace":"fleet-default","uid":"23844a56-22f2-47c6-b623-94333add7e33"},"dataSecretName":"jenkins-local-2-worker-dyn-wk6x7-s82ff-machine-bootstrap"},"clusterName":"jenkins-local-2","infrastructureRef":{"apiVersion":"rke-machine.cattle.io/v1","kind":"HarvesterMachine","name":"jenkins-local-2-worker-dyn-wk6x7-s82ff","namespace":"fleet-default","uid":"4f27e5a9-f0fa-4b82-a469-c23ce8efe370"},"nodeDeletionTimeout":"10s","providerID":"k3s://jenkins-local-2-worker-dyn-wk6x7-s82ff"},"status":{"addresses":[{"address":"10.107.5.181","type":"InternalIP"},{"address":"jenkins-local-2-worker-dyn-wk6x7-s82ff","type":"Hostname"}],"bootstrapReady":true,"conditions":[{"lastTransitionTime":"2025-04-08T02:57:58Z","status":"True","type":"Ready"},{"lastTransitionTime":"2025-04-08T02:57:57Z","status":"True","type":"BootstrapReady"},{"lastTransitionTime":"2025-04-08T04:01:56Z","message":"**deleting server [fleet-default/jenkins-local-2-worker-dyn-wk6x7-s82ff] of kind (HarvesterMachine) for machine jenkins-local-2-worker-dyn-wk6x7-s82ff in infrastructure provider"**,"status":"False","type":"InfrastructureReady"},{"lastTransitionTime":"2025-04-08T04:01:55Z","reason":"Deleting","severity":"Info","status":"False","type":"NodeHealthy"},{"lastTransitionTime":"2025-04-08T02:58:21Z","status":"True","type":"PlanApplied"},{"lastTransitionTime":"2025-04-08T04:01:55Z","status":"True","type":"PreDrainDeleteHookSucceeded"},{"lastTransitionTime":"2025-04-08T04:01:55Z","status":"True","type":"PreTerminateDeleteHookSucceeded"}

How to reproduce it (as minimally and precisely as possible):
Rancher + harvester as provider + CA
I haven't been able to find an exact reproduction scenario. It happens randomly.

I can assume that the cluster autoscaler triggers Rancher to reduce the pool size (but why?), and then Rancher deletes a completely random node from the pool.

The text was updated successfully, but these errors were encountered:

SnelsSM added the kind/bug Categorizes issue or PR as related to a bug. label Apr 8, 2025

k8s-ci-robot added the area/cluster-autoscaler label Apr 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CA+Rancher is deleting random nodes #8015

CA+Rancher is deleting random nodes #8015

SnelsSM commented Apr 8, 2025

CA+Rancher is deleting random nodes #8015

CA+Rancher is deleting random nodes #8015

Comments

SnelsSM commented Apr 8, 2025