Skip to content

Cluster autoscaler is not able to scale down Rancher managed cluster #7981

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dirkdaems opened this issue Mar 26, 2025 · 3 comments
Open
Labels
area/cluster-autoscaler area/provider/rancher kind/bug Categorizes issue or PR as related to a bug.

Comments

@dirkdaems
Copy link

dirkdaems commented Mar 26, 2025

Which component are you using?:
cluster-autoscaler

What version of the component are you using?:
Component version: v1.29.0

What k8s version are you using (kubectl version)?:

$ kubectl version
Client Version: v1.32.3
Kustomize Version: v5.5.0
Server Version: v1.29.4+rke2r1

What environment is this in?:
Rancher managed Kubernetes cluster on an OpenStack based cloud at CloudFerro.

What did you expect to happen?:
The autoscaler should scale down the worker nodes.

What happened instead?:
The autoscaler was not able to scale down the worker nodes.

How to reproduce it (as minimally and precisely as possible):

  • Deploy a Rancher managed Kubernetes cluster on an OpenStack based cloud.
  • Start a workload which will exceed quota or which will exhaust OpenStack resources
  • Stop the workload
  • Autoscaler will not be able to downscale the cluster anymore

Anything else we need to know?:

When the issue occurs, these kind of logs can be found in the autoscaler pod logs:
I0326 12:39:22.919386 1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"k8s-worker-eo2a-xlarge-c29f50b3-s74qs.novalocal", UID:"c23b0b6d-4f0c-484d-a5c5-c7c896c08be1", APIVersion:"v1", ResourceVersion:"101114911", FieldPath:""}): type: 'Warning' reason: 'ScaleDownFailed' failed to delete empty node: failed to delete nodes from group worker-eo2a-xlarge: could not find providerID in machine: k8s-worker-eo2a-xlarge-78b857bb76x5hgc6-gz6t8/fleet-default

Previously, we logged #6778 which was closed because we thought that after upgrading the issue was fixed. We now noticed it again, so logging this ticket.

In the autoscaler Grafana dashboard you typically see that the autoscaler is aware of the unneeded nodes, but scaling down fails, probably due to this providerID issue:

Image

@dirkdaems dirkdaems added the kind/bug Categorizes issue or PR as related to a bug. label Mar 26, 2025
@Shubham82
Copy link
Contributor

/area provider/rancher
/area cluster-autoscaler

@Shubham82
Copy link
Contributor

cc @ctrox

@dirkdaems
Copy link
Author

When this happens, a Rancher machine k8s resource can't be deleted:
Image

To resolve the issue the finalizers on the Rancher machine k8s resource have to be removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/rancher kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants