You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What k8s version are you using (kubectl version)?:
kubectl version Output
$ kubectl version
Client Version: v1.32.3
Kustomize Version: v5.5.0
Server Version: v1.31.4+k3s1
What environment is this in?:
Rancher v2.10.2
Harvester v1.4.1
What did you expect to happen?:
The cluster autoscaler, when used with Rancher, must not initiate the deletion of nodes that should not be deleted.
What happened instead?:
When using Rancher, the cluster autoscaler incorrectly triggers (Yes, there is no direct indication of deletion. It seems there's a logical error somewhere) the deletion of a node that should not be deleted.
There is no such problem without using cluster autoscaler.
I used different settings, but the result is the same.
The current arguments look like this:
scale-down-utilization-threshold: 0.01
scale-down-unneeded-time: 1h
scale-down-delay-after-add: 30m
skip-nodes-with-local-storage: true
Even though pods are using emptyDir and their CPU/memory requests are above the threshold, it's not making any difference.
From debug log
2025-04-08 04:02:08.261 I0408 04:02:08.261178 1 rancher_provider.go:170] found pool "worker-dyn" via machine "jenkins-local-2-worker-dyn-wk6x7-s82ff"
2025-04-08 04:02:08.261 I0408 04:02:08.261199 1 eligibility.go:162] Node jenkins-local-2-worker-dyn-wk6x7-s82ff unremovable: cpu requested (91.3043% of allocatable) is above the scale-down utilization threshold
2025-04-08 04:02:08.261 I0408 04:02:08.261229 1 static_autoscaler.go:623] Scale down status: lastScaleUpTime=2025-04-08 03:13:11.691708094 +0000 UTC m=+57967.285067686 lastScaleDownDeleteTime=2025-04-08 04:01:58.155727444 +0000 UTC m=+60893.749087046 lastScaleDownFailTime=2025-04-07 10:07:23.731699145 +0000 UTC m=-3580.674941223 scaleDownForbidden=false scaleDownInCooldown=false
2025-04-08 04:02:08.261 I0408 04:02:08.261490 1 rancher_provider.go:170] found pool "worker-dyn" via machine "jenkins-local-2-worker-dyn-wk6x7-s82ff"
...
2025-04-08 04:02:18.306 I0408 04:02:18.306101 1 request.go:1351] Response Body: {"apiVersion":"cluster.x-k8s.io/v1beta1","items":[{...
"name":"jenkins-local-2-worker-dyn-wk6x7-s82ff","namespace":"fleet-default","ownerReferences":[{"apiVersion":"cluster.x-k8s.io/v1beta1","blockOwnerDeletion":true,"controller":true,"kind":"MachineSet","name":"jenkins-local-2-worker-dyn-wk6x7","uid":"06f5c8e5-e5da-4448-95e4-ebd5b385201e"}],"resourceVersion":"1306230318","uid":"4a2e5fc6-f85b-4912-9509-d1caac023bc1"},"spec":{"bootstrap":{"configRef":{"apiVersion":"rke.cattle.io/v1","kind":"RKEBootstrap","name":"jenkins-local-2-worker-dyn-wk6x7-s82ff","namespace":"fleet-default","uid":"23844a56-22f2-47c6-b623-94333add7e33"},"dataSecretName":"jenkins-local-2-worker-dyn-wk6x7-s82ff-machine-bootstrap"},"clusterName":"jenkins-local-2","infrastructureRef":{"apiVersion":"rke-machine.cattle.io/v1","kind":"HarvesterMachine","name":"jenkins-local-2-worker-dyn-wk6x7-s82ff","namespace":"fleet-default","uid":"4f27e5a9-f0fa-4b82-a469-c23ce8efe370"},"nodeDeletionTimeout":"10s","providerID":"k3s://jenkins-local-2-worker-dyn-wk6x7-s82ff"},"status":{"addresses":[{"address":"10.107.5.181","type":"InternalIP"},{"address":"jenkins-local-2-worker-dyn-wk6x7-s82ff","type":"Hostname"}],"bootstrapReady":true,"conditions":[{"lastTransitionTime":"2025-04-08T02:57:58Z","status":"True","type":"Ready"},{"lastTransitionTime":"2025-04-08T02:57:57Z","status":"True","type":"BootstrapReady"},{"lastTransitionTime":"2025-04-08T04:01:56Z","message":"**deleting server [fleet-default/jenkins-local-2-worker-dyn-wk6x7-s82ff] of kind (HarvesterMachine) for machine jenkins-local-2-worker-dyn-wk6x7-s82ff in infrastructure provider"**,"status":"False","type":"InfrastructureReady"},{"lastTransitionTime":"2025-04-08T04:01:55Z","reason":"Deleting","severity":"Info","status":"False","type":"NodeHealthy"},{"lastTransitionTime":"2025-04-08T02:58:21Z","status":"True","type":"PlanApplied"},{"lastTransitionTime":"2025-04-08T04:01:55Z","status":"True","type":"PreDrainDeleteHookSucceeded"},{"lastTransitionTime":"2025-04-08T04:01:55Z","status":"True","type":"PreTerminateDeleteHookSucceeded"}
How to reproduce it (as minimally and precisely as possible):
Rancher + harvester as provider + CA
I haven't been able to find an exact reproduction scenario. It happens randomly.
I can assume that the cluster autoscaler triggers Rancher to reduce the pool size (but why?), and then Rancher deletes a completely random node from the pool.
The text was updated successfully, but these errors were encountered:
Which component are you using?:
/area cluster-autoscaler
What version of the component are you using?:
Component version: v1.31.0
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat environment is this in?:
Rancher v2.10.2
Harvester v1.4.1
What did you expect to happen?:
The cluster autoscaler, when used with Rancher, must not initiate the deletion of nodes that should not be deleted.
What happened instead?:
When using Rancher, the cluster autoscaler incorrectly triggers (Yes, there is no direct indication of deletion. It seems there's a logical error somewhere) the deletion of a node that should not be deleted.
There is no such problem without using cluster autoscaler.
I used different settings, but the result is the same.
The current arguments look like this:
scale-down-utilization-threshold: 0.01
scale-down-unneeded-time: 1h
scale-down-delay-after-add: 30m
skip-nodes-with-local-storage: true
Even though pods are using emptyDir and their CPU/memory requests are above the threshold, it's not making any difference.
From debug log
How to reproduce it (as minimally and precisely as possible):
Rancher + harvester as provider + CA
I haven't been able to find an exact reproduction scenario. It happens randomly.
I can assume that the cluster autoscaler triggers Rancher to reduce the pool size (but why?), and then Rancher deletes a completely random node from the pool.
The text was updated successfully, but these errors were encountered: