Skip to content

Use ReaderFailOnMissingInformer for cache options #315

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

bogdando
Copy link
Contributor

When controller makes queries with client.{Get,List} on resources haven’t been declared upfront, controller-runtime will initialize an informer on-the-fly and block on warming up its cache. This leads to issues like:

  • controller-runtime starting a watch for a resource type and start caching all its objects in memory (even if you were trying to query only one resource), potentially leading to the process running out of memory.
  • unpredictable reconciliation times while the informer cache is syncing, during which your worker goroutine will be blocked from reconciling other resources.

@bogdando bogdando requested a review from stuggi April 28, 2025 15:59
Copy link
Contributor

openshift-ci bot commented Apr 28, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bogdando

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bogdando
Copy link
Contributor Author

bogdando commented Apr 28, 2025

This is related to recommendations made here https://ahmet.im/blog/controller-pitfalls/#understand-the-cached-clients
Let's see if this works. If it fails with missing informers, when we should adjust ctrl.NewControllerManagedBy
to start building always static caches upfront.
We could extend this to other operators as well.
@stuggi @gibizer @mrkisaolamb @dprince @bshephar wdyt?

When controller makes queries with client.{Get,List} on resources
haven’t been declared upfront, controller-runtime will initialize an
informer on-the-fly and block on warming up its cache. This leads to
issues like:

* controller-runtime starting a watch for a resource type and start
  caching all its objects in memory (even if you were trying to query
  only one resource), potentially leading to the process running
  out of memory.
* unpredictable reconciliation times while the informer cache is
  syncing, during which your worker goroutine will be blocked from
  reconciling other resources.

Signed-off-by: Bohdan Dobrelia <[email protected]>
Copy link
Contributor

openshift-ci bot commented Apr 28, 2025

@bogdando: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/placement-operator-build-deploy-kuttl c2e1165 link true /test placement-operator-build-deploy-kuttl

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@stuggi
Copy link
Contributor

stuggi commented Apr 29, 2025

With this change all the operators would have to watch e.g. KeystoneAPI (which they currently don't), right?

        +      Kind=KeystoneAPI is not cached
        +    reason: Error
        +    severity: Warning
        +    status: "False"
             type: ServiceConfigReady

I don't think they strictly have to. They do the get to validate that a keystoneAPI exists and get the endpoints from there, but as soon they exist, they are static and would not have to watch and reconcile on changes.

@mrkisaolamb
Copy link
Contributor

I'm not sure if we will see the gains mentioned in the blog post. We are only checking Keystone and MariaDB resources, and even if a resource is not immediately available, we simply reconcile again. Once these resources are created, they will only be deleted when we delete the Placement resource — we don't expect them to be removed otherwise.

Also, if we really want to disable caching completely, we would need to make more changes, because clearly something else is happening — the kuttl test failures seem to be directly related to these changes.

BTW Thanks Bogdan to bring this up, it's definitely good to keep this behavior in mind.

@bogdando
Copy link
Contributor Author

I'm not sure if we will see the gains mentioned in the blog post. We are only checking Keystone and MariaDB resources, and even if a resource is not immediately available, we simply reconcile again. Once these resources are created, they will only be deleted when we delete the Placement resource — we don't expect them to be removed otherwise.

Also, if we really want to disable caching completely, we would need to make more changes, because clearly something else is happening — the kuttl test failures seem to be directly related to these changes.

BTW Thanks Bogdan to bring this up, it's definitely good to keep this behavior in mind.

No, I don't think we should disable caching completely

@bogdando
Copy link
Contributor Author

With this change all the operators would have to watch e.g. KeystoneAPI (which they currently don't), right?

        +      Kind=KeystoneAPI is not cached
        +    reason: Error
        +    severity: Warning
        +    status: "False"
             type: ServiceConfigReady

I don't think they strictly have to. They do the get to validate that a keystoneAPI exists and get the endpoints from there, but as soon they exist, they are static and would not have to watch and reconcile on changes.

We don't have a lot of KeystoneAPI to watch, so I don't see this potential change as a problem. On the other hand, by following the rule of it is better to be explicit than implicit, this provides a guardrail against future additions of list()/get() operations without updating the "allow list" first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants