Skip to content

CA+Rancher is deleting random nodes #8015

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
SnelsSM opened this issue Apr 8, 2025 · 0 comments
Open

CA+Rancher is deleting random nodes #8015

SnelsSM opened this issue Apr 8, 2025 · 0 comments
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.

Comments

@SnelsSM
Copy link

SnelsSM commented Apr 8, 2025

Which component are you using?:

/area cluster-autoscaler

What version of the component are you using?:

Component version: v1.31.0

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: v1.32.3
Kustomize Version: v5.5.0
Server Version: v1.31.4+k3s1

What environment is this in?:
Rancher v2.10.2
Harvester v1.4.1

What did you expect to happen?:

The cluster autoscaler, when used with Rancher, must not initiate the deletion of nodes that should not be deleted.

What happened instead?:

When using Rancher, the cluster autoscaler incorrectly triggers (Yes, there is no direct indication of deletion. It seems there's a logical error somewhere) the deletion of a node that should not be deleted.
There is no such problem without using cluster autoscaler.

I used different settings, but the result is the same.
The current arguments look like this:
scale-down-utilization-threshold: 0.01
scale-down-unneeded-time: 1h
scale-down-delay-after-add: 30m
skip-nodes-with-local-storage: true

Even though pods are using emptyDir and their CPU/memory requests are above the threshold, it's not making any difference.

From debug log
2025-04-08 04:02:08.261	I0408 04:02:08.261178       1 rancher_provider.go:170] found pool "worker-dyn" via machine "jenkins-local-2-worker-dyn-wk6x7-s82ff"
2025-04-08 04:02:08.261	I0408 04:02:08.261199       1 eligibility.go:162] Node jenkins-local-2-worker-dyn-wk6x7-s82ff unremovable: cpu requested (91.3043% of allocatable) is above the scale-down utilization threshold
2025-04-08 04:02:08.261	I0408 04:02:08.261229       1 static_autoscaler.go:623] Scale down status: lastScaleUpTime=2025-04-08 03:13:11.691708094 +0000 UTC m=+57967.285067686 lastScaleDownDeleteTime=2025-04-08 04:01:58.155727444 +0000 UTC m=+60893.749087046 lastScaleDownFailTime=2025-04-07 10:07:23.731699145 +0000 UTC m=-3580.674941223 scaleDownForbidden=false scaleDownInCooldown=false
2025-04-08 04:02:08.261	I0408 04:02:08.261490       1 rancher_provider.go:170] found pool "worker-dyn" via machine "jenkins-local-2-worker-dyn-wk6x7-s82ff"
...
2025-04-08 04:02:18.306	I0408 04:02:18.306101       1 request.go:1351] Response Body: {"apiVersion":"cluster.x-k8s.io/v1beta1","items":[{...
"name":"jenkins-local-2-worker-dyn-wk6x7-s82ff","namespace":"fleet-default","ownerReferences":[{"apiVersion":"cluster.x-k8s.io/v1beta1","blockOwnerDeletion":true,"controller":true,"kind":"MachineSet","name":"jenkins-local-2-worker-dyn-wk6x7","uid":"06f5c8e5-e5da-4448-95e4-ebd5b385201e"}],"resourceVersion":"1306230318","uid":"4a2e5fc6-f85b-4912-9509-d1caac023bc1"},"spec":{"bootstrap":{"configRef":{"apiVersion":"rke.cattle.io/v1","kind":"RKEBootstrap","name":"jenkins-local-2-worker-dyn-wk6x7-s82ff","namespace":"fleet-default","uid":"23844a56-22f2-47c6-b623-94333add7e33"},"dataSecretName":"jenkins-local-2-worker-dyn-wk6x7-s82ff-machine-bootstrap"},"clusterName":"jenkins-local-2","infrastructureRef":{"apiVersion":"rke-machine.cattle.io/v1","kind":"HarvesterMachine","name":"jenkins-local-2-worker-dyn-wk6x7-s82ff","namespace":"fleet-default","uid":"4f27e5a9-f0fa-4b82-a469-c23ce8efe370"},"nodeDeletionTimeout":"10s","providerID":"k3s://jenkins-local-2-worker-dyn-wk6x7-s82ff"},"status":{"addresses":[{"address":"10.107.5.181","type":"InternalIP"},{"address":"jenkins-local-2-worker-dyn-wk6x7-s82ff","type":"Hostname"}],"bootstrapReady":true,"conditions":[{"lastTransitionTime":"2025-04-08T02:57:58Z","status":"True","type":"Ready"},{"lastTransitionTime":"2025-04-08T02:57:57Z","status":"True","type":"BootstrapReady"},{"lastTransitionTime":"2025-04-08T04:01:56Z","message":"**deleting server [fleet-default/jenkins-local-2-worker-dyn-wk6x7-s82ff] of kind (HarvesterMachine) for machine jenkins-local-2-worker-dyn-wk6x7-s82ff in infrastructure provider"**,"status":"False","type":"InfrastructureReady"},{"lastTransitionTime":"2025-04-08T04:01:55Z","reason":"Deleting","severity":"Info","status":"False","type":"NodeHealthy"},{"lastTransitionTime":"2025-04-08T02:58:21Z","status":"True","type":"PlanApplied"},{"lastTransitionTime":"2025-04-08T04:01:55Z","status":"True","type":"PreDrainDeleteHookSucceeded"},{"lastTransitionTime":"2025-04-08T04:01:55Z","status":"True","type":"PreTerminateDeleteHookSucceeded"}

How to reproduce it (as minimally and precisely as possible):
Rancher + harvester as provider + CA
I haven't been able to find an exact reproduction scenario. It happens randomly.

I can assume that the cluster autoscaler triggers Rancher to reduce the pool size (but why?), and then Rancher deletes a completely random node from the pool.

@SnelsSM SnelsSM added the kind/bug Categorizes issue or PR as related to a bug. label Apr 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants