Skip to content

feat: primary resource caching for followup reconciliation(s) #2761

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

csviri
Copy link
Collaborator

@csviri csviri commented Apr 11, 2025

It provides facilities to cache the primary resource for follow-up reconciliations, thus ensuring that the reconciler always handles the up-to-date primary.

Later, we might extend this to the resource part, not just status (in terms of utils, cache already supports that) - now the utility methods focus on status.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 11, 2025
@csviri csviri changed the title support reconciler feat: resource cache Apr 11, 2025
@csviri csviri linked an issue Apr 11, 2025 that may be closed by this pull request
@csviri csviri changed the title feat: resource cache [WIP] feat: resource cache Apr 11, 2025
@csviri csviri changed the title [WIP] feat: resource cache [WIP] feat: primary resource caching Apr 14, 2025
@csviri csviri force-pushed the support-reconciler branch from 9019eb9 to f8b6dc6 Compare April 14, 2025 10:13
@csviri csviri marked this pull request as ready for review April 15, 2025 12:11
@openshift-ci openshift-ci bot requested review from metacosm and xstefank April 15, 2025 12:11
@csviri csviri changed the title [WIP] feat: primary resource caching feat: primary resource caching Apr 15, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 15, 2025
@csviri csviri changed the title feat: primary resource caching feat: primary resource caching for followup reconciliation Apr 15, 2025
csviri added 8 commits April 15, 2025 14:18
Signed-off-by: Attila Mészáros <[email protected]>
Signed-off-by: Attila Mészáros <[email protected]>
Signed-off-by: Attila Mészáros <[email protected]>
Signed-off-by: Attila Mészáros <[email protected]>
Signed-off-by: Attila Mészáros <[email protected]>
Signed-off-by: Attila Mészáros <[email protected]>
Signed-off-by: Attila Mészáros <[email protected]>
Signed-off-by: Attila Mészáros <[email protected]>
@csviri csviri force-pushed the support-reconciler branch from 19dd31c to 3b99f78 Compare April 15, 2025 12:18
@csviri csviri changed the title feat: primary resource caching for followup reconciliation feat: primary resource caching for followup reconciliation(s) Apr 15, 2025
csviri added 2 commits April 15, 2025 14:21
Signed-off-by: Attila Mészáros <[email protected]>
Signed-off-by: Attila Mészáros <[email protected]>
csviri and others added 2 commits April 15, 2025 17:21
…tor/api/reconciler/PrimaryUpdateAndCacheUtils.java

Co-authored-by: Martin Stefanko <[email protected]>
Signed-off-by: Attila Mészáros <[email protected]>
@csviri csviri requested a review from xstefank April 15, 2025 15:23
Signed-off-by: Attila Mészáros <[email protected]>
csviri and others added 2 commits April 15, 2025 17:39
Signed-off-by: Attila Mészáros <[email protected]>
…tor/api/reconciler/PrimaryUpdateAndCacheUtils.java

Co-authored-by: Antonio <[email protected]>
Copy link
Collaborator

@xstefank xstefank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@afalhambra-hivemq afalhambra-hivemq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using the PatchAndCacheStatusWithLock approach, we will need to explicitly use optimistic locking here, which in theory it should be fine.
However, the default retry mechanism provided by the JOSDK when a 409 conflict happen doesn't pick the latest resource status, forcing us to implement a custom retry mechanism for this. Otherwise the default retry mechanism is not useful here.

Anyways, I can raise a separate issue for that default mechanism to see if it can be improved there.

Beside that, great work and thanks a lot for this PR.

(P p, KubernetesClient c) -> c.resource(primary).updateStatus());
}

public static <P extends HasMetadata> P patchAndCacheStatus(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add small JavaDoc for this exposed method as well?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still would be nice to have a JavaDoc for this public method.

public class PeriodicTriggerEventSource<P extends HasMetadata>
extends AbstractEventSource<Void, P> {

public static final int DEFAULT_PERIOD = 30;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public static final int DEFAULT_PERIOD = 30;
private static final int DEFAULT_PERIOD = 30;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for tests, and for reusability would stick with the public approach

.untilAsserted(
() -> {
assertThat(reconciler.errorPresent).isFalse();
assertThat(reconciler.latestValue).isGreaterThan(10);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe to make sure and guarantee no status update is lost, we could check an array or a list of values from 1 to 10 are present (all of them in particular order).
Otherwise, this assert isGreaterThan(10) will eventually be true regardless of the caching.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi there is that is error check, what should induce that there was monotonic, but yeah definitelly can improve the azsserts

csviri and others added 2 commits April 16, 2025 10:02
…aseapi/statuscache/primarycache/StatusPatchPrimaryCacheIT.java

Co-authored-by: Antonio <[email protected]>
…tor/api/reconciler/PrimaryUpdateAndCacheUtils.java

Co-authored-by: Antonio <[email protected]>
@csviri
Copy link
Collaborator Author

csviri commented Apr 16, 2025

When using the PatchAndCacheStatusWithLock approach, we will need to explicitly use optimistic locking here, which in theory it should be fine. However, the default retry mechanism provided by the JOSDK when a 409 conflict happen doesn't pick the latest resource status, forcing us to implement a custom retry mechanism for this. Otherwise the default retry mechanism is not useful here.

Anyways, I can raise a separate issue for that default mechanism to see if it can be improved there.

Beside that, great work and thanks a lot for this PR.

Probably the best would be to not use the approach with the lock, but the lockless alternative.

@afalhambra-hivemq
Copy link

Probably the best would be to not use the approach with the lock, but the lockless alternative.

The only drawback here with the lockless alternative is that it will require us to handle the cache in the operator itself

@csviri
Copy link
Collaborator Author

csviri commented Apr 16, 2025

Probably the best would be to not use the approach with the lock, but the lockless alternative.

The only drawback here with the lockless alternative is that it will require us to handle the cache in the operator itself

Yes, but that is as simple as in the integration test. We might interate on this in the future, but in the short term I think that is best for you.

@csviri
Copy link
Collaborator Author

csviri commented Apr 16, 2025

Just thinking more about this, that the locked version in this form is not very useful. Since it fails to fulfill its purpose, thus to make sure that the resource is updated with a generated id.

On the other hand we could do something that (now deprecated in client) is the resource(resource).replace(), that would basically retry locally the update, thus if fails read the resource and try again in place.

Some kind of retry would be nice also for patch withouth the lock. So will spend some more time on this to improve.

@afalhambra-hivemq
Copy link

resource(resource).replace(), that would basically retry locally the update

Curious here, what do you mean by retry locally?

Some kind of retry would be nice also for patch withouth the lock. So will spend some more time on this to improve.

That would be awesome! Thanks

csviri and others added 8 commits April 17, 2025 11:34
Signed-off-by: Attila Mészáros <[email protected]>
Signed-off-by: Attila Mészáros <[email protected]>
Signed-off-by: Attila Mészáros <[email protected]>
Signed-off-by: Attila Mészáros <[email protected]>
Signed-off-by: Attila Mészáros <[email protected]>
@csviri
Copy link
Collaborator Author

csviri commented Apr 22, 2025

Curious here, what do you mean by retry locally?

Basically that you configure fabric8 client to retry.

@csviri
Copy link
Collaborator Author

csviri commented Apr 22, 2025

@afalhambra-hivemq @metacosm @xstefank added docs and made some improvements.

(P p, KubernetesClient c) -> c.resource(primary).updateStatus());
}

public static <P extends HasMetadata> P patchAndCacheStatus(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still would be nice to have a JavaDoc for this public method.

* Updates the resource and adds it to the {@link PrimaryResourceCache} provided. Optimistic
* locking is not required.
*
* @param primary resource*

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @param primary resource*
* @param primary resource

* Patches the resource with JSON Merge patch and adds it to the {@link PrimaryResourceCache}
* provided. Optimistic locking is not required.
*
* @param primary resource*

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @param primary resource*
* @param primary resource

* Patches the resource with JSON Patch and adds it to the {@link PrimaryResourceCache} provided.
* Optimistic locking is not required.
*
* @param primary resource*

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @param primary resource*
* @param primary resource

private static final Logger log = LoggerFactory.getLogger(PrimaryUpdateAndCacheUtils.class);

/**
* Makes sure that the up-to-date primary resource will be present during the next reconciliation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the signature of the method has changed and the checkResourceVersionPresent method is removed, would be nice if we make it clear and explicit in the JavaDocs which methods require optimistic locking and which don't?


Therefore,
the framework provides facilities
to cover these use cases withing [`PrimaryUpdateAndCacheUtils`](https://github.com/operator-framework/java-operator-sdk/blob/main/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/api/reconciler/PrimaryUpdateAndCacheUtils.java#L16).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
to cover these use cases withing [`PrimaryUpdateAndCacheUtils`](https://github.com/operator-framework/java-operator-sdk/blob/main/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/api/reconciler/PrimaryUpdateAndCacheUtils.java#L16).
to cover these use cases with [`PrimaryUpdateAndCacheUtils`](https://github.com/operator-framework/java-operator-sdk/blob/main/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/api/reconciler/PrimaryUpdateAndCacheUtils.java#L16).

Comment on lines +219 to +221
recent version of the resource. Note that it is not necessarily the version from the update response, it can be newer
since other parties can do additional updates meanwhile, but if not explicitly modified, it will contain the up-to-date
status.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it is not necessarily the version from the update response, it can be newer
since other parties can do additional updates meanwhile, but if not explicitly modified, it will contain the up-to-date
status.

Not so clear, what you mean here. Updated resource version?

In other words, when to evict the resource from the cache. Typically, as show in the [integration test](https://github.com/operator-framework/java-operator-sdk/blob/main/operator-framework/src/test/java/io/javaoperatorsdk/operator/baseapi/statuscache/primarycache)
you can have a counter in status to check on that.

Since all of this happens explicitly, you cannot use it for now with managed dependent resources and workflows.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all of this happens explicitly, you cannot use it for now with managed dependent resources and workflows.

Can you please elaborate? This PrimaryUpdateAndCacheUtils.ssaPatchAndCacheStatus(primary, freshCopy, context, cache); cannot be used under what conditions?

}
```

In the background `PrimaryUpdateAndCacheUtils.ssaPatchAndCacheStatusWith` puts the result of the update into an internal

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In the background `PrimaryUpdateAndCacheUtils.ssaPatchAndCacheStatusWith` puts the result of the update into an internal
In the background `PrimaryUpdateAndCacheUtils.ssaPatchAndCacheStatus` puts the result of the update into an internal

#### Additional remarks

As shown in the integration tests, there is no optimistic locking used when updating the
[resource](https://github.com/operator-framework/java-operator-sdk/blob/main/operator-framework/src/test/java/io/javaoperatorsdk/operator/baseapi/statuscache/internal/StatusPatchCacheReconciler.java#L41)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular IT, optimistic locking is in place, or am completely I wrong here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants