Use ReaderFailOnMissingInformer for cache options #315

bogdando · 2025-04-28T15:59:47Z

When controller makes queries with client.{Get,List} on resources haven’t been declared upfront, controller-runtime will initialize an informer on-the-fly and block on warming up its cache. This leads to issues like:

controller-runtime starting a watch for a resource type and start caching all its objects in memory (even if you were trying to query only one resource), potentially leading to the process running out of memory.
unpredictable reconciliation times while the informer cache is syncing, during which your worker goroutine will be blocked from reconciling other resources.

openshift-ci · 2025-04-28T15:59:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bogdando

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [bogdando]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bogdando · 2025-04-28T16:02:51Z

This is related to recommendations made here https://ahmet.im/blog/controller-pitfalls/#understand-the-cached-clients
Let's see if this works. If it fails with missing informers, when we should adjust ctrl.NewControllerManagedBy
to start building always static caches upfront.
We could extend this to other operators as well.
@stuggi @gibizer @mrkisaolamb @dprince @bshephar wdyt?

When controller makes queries with client.{Get,List} on resources haven’t been declared upfront, controller-runtime will initialize an informer on-the-fly and block on warming up its cache. This leads to issues like: * controller-runtime starting a watch for a resource type and start caching all its objects in memory (even if you were trying to query only one resource), potentially leading to the process running out of memory. * unpredictable reconciliation times while the informer cache is syncing, during which your worker goroutine will be blocked from reconciling other resources. Signed-off-by: Bohdan Dobrelia <[email protected]>

openshift-ci · 2025-04-28T17:09:29Z

@bogdando: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/placement-operator-build-deploy-kuttl	`c2e1165`	link	true	`/test placement-operator-build-deploy-kuttl`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

stuggi · 2025-04-29T07:18:04Z

With this change all the operators would have to watch e.g. KeystoneAPI (which they currently don't), right?

        +      Kind=KeystoneAPI is not cached
        +    reason: Error
        +    severity: Warning
        +    status: "False"
             type: ServiceConfigReady

I don't think they strictly have to. They do the get to validate that a keystoneAPI exists and get the endpoints from there, but as soon they exist, they are static and would not have to watch and reconcile on changes.

mrkisaolamb · 2025-04-29T07:56:54Z

I'm not sure if we will see the gains mentioned in the blog post. We are only checking Keystone and MariaDB resources, and even if a resource is not immediately available, we simply reconcile again. Once these resources are created, they will only be deleted when we delete the Placement resource — we don't expect them to be removed otherwise.

Also, if we really want to disable caching completely, we would need to make more changes, because clearly something else is happening — the kuttl test failures seem to be directly related to these changes.

BTW Thanks Bogdan to bring this up, it's definitely good to keep this behavior in mind.

bogdando · 2025-04-29T13:10:54Z

I'm not sure if we will see the gains mentioned in the blog post. We are only checking Keystone and MariaDB resources, and even if a resource is not immediately available, we simply reconcile again. Once these resources are created, they will only be deleted when we delete the Placement resource — we don't expect them to be removed otherwise.

Also, if we really want to disable caching completely, we would need to make more changes, because clearly something else is happening — the kuttl test failures seem to be directly related to these changes.

BTW Thanks Bogdan to bring this up, it's definitely good to keep this behavior in mind.

No, I don't think we should disable caching completely

bogdando · 2025-04-29T13:13:23Z

With this change all the operators would have to watch e.g. KeystoneAPI (which they currently don't), right?
        +      Kind=KeystoneAPI is not cached
        +    reason: Error
        +    severity: Warning
        +    status: "False"
             type: ServiceConfigReady
I don't think they strictly have to. They do the get to validate that a keystoneAPI exists and get the endpoints from there, but as soon they exist, they are static and would not have to watch and reconcile on changes.

We don't have a lot of KeystoneAPI to watch, so I don't see this potential change as a problem. On the other hand, by following the rule of it is better to be explicit than implicit, this provides a guardrail against future additions of list()/get() operations without updating the "allow list" first

bogdando requested a review from stuggi April 28, 2025 15:59

openshift-ci bot requested review from frenzyfriday and lewisdenny April 28, 2025 15:59

openshift-ci bot added the approved label Apr 28, 2025

bogdando force-pushed the static_caching branch from a3a7cfc to c2e1165 Compare April 28, 2025 16:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use ReaderFailOnMissingInformer for cache options #315

Use ReaderFailOnMissingInformer for cache options #315

bogdando commented Apr 28, 2025

Uh oh!

openshift-ci bot commented Apr 28, 2025

Uh oh!

bogdando commented Apr 28, 2025 •

edited

Loading

Uh oh!

openshift-ci bot commented Apr 28, 2025

Uh oh!

stuggi commented Apr 29, 2025

Uh oh!

mrkisaolamb commented Apr 29, 2025

Uh oh!

bogdando commented Apr 29, 2025

Uh oh!

bogdando commented Apr 29, 2025

Uh oh!

Uh oh!

Use ReaderFailOnMissingInformer for cache options #315

Are you sure you want to change the base?

Use ReaderFailOnMissingInformer for cache options #315

Conversation

bogdando commented Apr 28, 2025

Uh oh!

openshift-ci bot commented Apr 28, 2025

Uh oh!

bogdando commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Apr 28, 2025

Uh oh!

stuggi commented Apr 29, 2025

Uh oh!

mrkisaolamb commented Apr 29, 2025

Uh oh!

bogdando commented Apr 29, 2025

Uh oh!

bogdando commented Apr 29, 2025

Uh oh!

Uh oh!

bogdando commented Apr 28, 2025 •

edited

Loading