Skip to content

trinolb keeps hanging when api server is not available #83

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
maxgruber19 opened this issue Apr 14, 2025 · 11 comments
Open

trinolb keeps hanging when api server is not available #83

maxgruber19 opened this issue Apr 14, 2025 · 11 comments

Comments

@maxgruber19
Copy link

maxgruber19 commented Apr 14, 2025

tonight we had some issues with the kubeapi server (not available because network issue for ~1-2 min) which is used by trinolb to fetch health state of connected trino clusters (CRs). it seems like there is no timeout / retry configured for that type of request, which leads to trinolb waiting forever (30 min in our case, then i restarted the sts manually) to get an answer of the api server.

important to know may be that were using stackable autoscaler

is there a timeout set for kubeapi server calls?

@sbernauer
Copy link
Member

Hi @maxgruber19, interesting to hear about that.
I had a look at the Kubernetes client we are using and has some default timeout configs:

https://github.com/kube-rs/kube/blob/d171d2620e8ad82235230fe589bbea7c9306963d/kube-client/src/config/mod.rs#L390C4-L393

So I'm not sure why this hangs trino-lb

@maxgruber19
Copy link
Author

thanks for clarification, "unfortunately" we fixed the cause for the network issues already, so its hard for us to reproduce the error.

I collected the logs from monday, about 8:21 we had the problem, 20 mins later we manually restarted the pods. when collecting the logs i found an error while communicating with redis. maybe thats the cause? but if redis is not available usually there are tons of errors. do you maybe have a better idea after scanning the logs?

1744620265843	2025-04-14T08:44:25.843233Z  INFO trino_lb::http_server: Shut down
1744620265842	2025-04-14T08:44:25.842275Z  INFO Checking for leftover queued queries:delete_queued_queries_not_accessed_after{not_accessed_after=SystemTime { tv_sec: 1744619965, tv_nsec: 839571696 }}:delete_queued_queries_not_accessed_after_for_cluster_group{cluster_group="default" not_accessed_after=SystemTime { tv_sec: 1744619965, tv_nsec: 839571696 }}: trino_lb_persistence::redis: Deleted all queries that were not accessed after cluster_group="default" removed=0 not_accessed_after=SystemTime { tv_sec: 1744619965, tv_nsec: 839571696 }
1744620265842	2025-04-14T08:44:25.842237Z  INFO Checking for leftover queued queries:delete_queued_queries_not_accessed_after{not_accessed_after=SystemTime { tv_sec: 1744619965, tv_nsec: 839571696 }}:delete_queued_queries_not_accessed_after_for_cluster_group{cluster_group="fallback" not_accessed_after=SystemTime { tv_sec: 1744619965, tv_nsec: 839571696 }}: trino_lb_persistence::redis: Deleted all queries that were not accessed after cluster_group="fallback" removed=0 not_accessed_after=SystemTime { tv_sec: 1744619965, tv_nsec: 839571696 }
1744620265842	2025-04-14T08:44:25.842234Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Current cluster states current_states={"default-1": Unhealthy}
1744620265842	2025-04-14T08:44:25.842220Z  INFO Fetching current query counters: trino_lb::maintenance::query_count_fetcher: QueryCountFetcher: Did not update query counters, as there already was a update Ok(59.520218472s) ago.
1744620265839	2025-04-14T08:44:25.839352Z  INFO trino_lb::http_server: Waiting for all connections to close connections=2
1744620265839	2025-04-14T08:44:25.839325Z  INFO trino_lb::http_server: Shutting down gracefully
1744618919876	2025-04-14T08:21:59.876323Z ERROR trino_lb::scaling: Scaler: reconciled failed error=ReadCurrentClusterStateFromPersistence { source: RedisError { source: GetClusterState { source: Connection reset by peer (os error 104) } }, cluster: "default-1" }
1744618918890	2025-04-14T08:21:58.890705Z  INFO trino_lb::scaling: Scaler: reconciled
1744618918890	2025-04-14T08:21:58.890697Z  INFO Scaler::reconcile: trino_lb::scaling: All cluster groups reconciled
1744618918890	2025-04-14T08:21:58.890664Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Applied all target cluster states
1744618918877	2025-04-14T08:21:58.877292Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Current cluster states current_states={"default-1": Unhealthy}
1744618917890	2025-04-14T08:21:57.890016Z  INFO trino_lb::scaling: Scaler: reconciled
1744618917890	2025-04-14T08:21:57.890009Z  INFO Scaler::reconcile: trino_lb::scaling: All cluster groups reconciled
1744618917890	2025-04-14T08:21:57.889949Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Applied all target cluster states
1744618917876	2025-04-14T08:21:57.876681Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Current cluster states current_states={"default-1": Unhealthy}
1744618916907	2025-04-14T08:21:56.907217Z  INFO trino_lb::scaling: Scaler: reconciled
1744618916907	2025-04-14T08:21:56.907199Z  INFO Scaler::reconcile: trino_lb::scaling: All cluster groups reconciled
1744618916907	2025-04-14T08:21:56.907146Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Applied all target cluster states
1744618916891	2025-04-14T08:21:56.891407Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Current cluster states current_states={"default-1": Unhealthy}
1744618915891	2025-04-14T08:21:55.890959Z  INFO trino_lb::scaling: Scaler: reconciled
1744618915891	2025-04-14T08:21:55.890952Z  INFO Scaler::reconcile: trino_lb::scaling: All cluster groups reconciled
1744618915891	2025-04-14T08:21:55.890931Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Applied all target cluster states
1744618915878	2025-04-14T08:21:55.877834Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Current cluster states current_states={"default-1": Unhealthy}
1744618914889	2025-04-14T08:21:54.889642Z  INFO trino_lb::scaling: Scaler: reconciled
1744618914889	2025-04-14T08:21:54.889634Z  INFO Scaler::reconcile: trino_lb::scaling: All cluster groups reconciled
1744618914889	2025-04-14T08:21:54.889612Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Applied all target cluster states
1744618914876	2025-04-14T08:21:54.876575Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Current cluster states current_states={"default-1": Unhealthy}
1744618913892	2025-04-14T08:21:53.892208Z  INFO trino_lb::scaling: Scaler: reconciled
1744618913892	2025-04-14T08:21:53.892202Z  INFO Scaler::reconcile: trino_lb::scaling: All cluster groups reconciled
1744618913892	2025-04-14T08:21:53.892179Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Applied all target cluster states
1744618913878	2025-04-14T08:21:53.878593Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Current cluster states current_states={"default-1": Unhealthy}
1744618912890	2025-04-14T08:21:52.890052Z  INFO trino_lb::scaling: Scaler: reconciled
1744618912890	2025-04-14T08:21:52.890043Z  INFO Scaler::reconcile: trino_lb::scaling: All cluster groups reconciled
1744618912890	2025-04-14T08:21:52.890008Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Applied all target cluster states
1744618912877	2025-04-14T08:21:52.877102Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Current cluster states current_states={"default-1": Unhealthy}
1744618911890	2025-04-14T08:21:51.890264Z  INFO trino_lb::scaling: Scaler: reconciled
1744618911890	2025-04-14T08:21:51.890256Z  INFO Scaler::reconcile: trino_lb::scaling: All cluster groups reconciled
1744618911890	2025-04-14T08:21:51.890229Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Applied all target cluster states
1744618911877	2025-04-14T08:21:51.877186Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Current cluster states current_states={"default-1": Unhealthy}
1744618910892	2025-04-14T08:21:50.892751Z  INFO trino_lb::scaling: Scaler: reconciled
1744618910892	2025-04-14T08:21:50.892738Z  INFO Scaler::reconcile: trino_lb::scaling: All cluster groups reconciled
1744618910892	2025-04-14T08:21:50.892701Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Applied all target cluster states
1744618910878	2025-04-14T08:21:50.878605Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Current cluster states current_states={"default-1": Unhealthy}
1744618909892	2025-04-14T08:21:49.892141Z  INFO trino_lb::scaling: Scaler: reconciled
1744618909892	2025-04-14T08:21:49.892133Z  INFO Scaler::reconcile: trino_lb::scaling: All cluster groups reconciled
1744618909892	2025-04-14T08:21:49.892104Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Applied all target cluster states
1744618909879	2025-04-14T08:21:49.878845Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Current cluster states current_states={"default-1": Unhealthy}
1744618908892	2025-04-14T08:21:48.892479Z  INFO trino_lb::scaling: Scaler: reconciled
1744618908892	2025-04-14T08:21:48.892472Z  INFO Scaler::reconcile: trino_lb::scaling: All cluster groups reconciled
1744618908892	2025-04-14T08:21:48.892449Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Applied all target cluster states
1744618908879	2025-04-14T08:21:48.879512Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Current cluster states current_states={"default-1": Unhealthy}
1744618907889	2025-04-14T08:21:47.889235Z  INFO trino_lb::scaling: Scaler: reconciled
1744618907889	2025-04-14T08:21:47.889228Z  INFO Scaler::reconcile: trino_lb::scaling: All cluster groups reconciled
1744618907889	2025-04-14T08:21:47.889205Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Applied all target cluster states
1744618907876	2025-04-14T08:21:47.875913Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Current cluster states current_states={"default-1": Unhealthy}
1744618906889	2025-04-14T08:21:46.889463Z  INFO trino_lb::scaling: Scaler: reconciled
1744618906889	2025-04-14T08:21:46.889456Z  INFO Scaler::reconcile: trino_lb::scaling: All cluster groups reconciled
1744618906889	2025-04-14T08:21:46.889434Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Applied all target cluster states
1744618906876	2025-04-14T08:21:46.876222Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Current cluster states current_states={"default-1": Unhealthy}
1744618905890	2025-04-14T08:21:45.890340Z  INFO trino_lb::scaling: Scaler: reconciled
1744618905890	2025-04-14T08:21:45.890330Z  INFO Scaler::reconcile: trino_lb::scaling: All cluster groups reconciled
1744618905890	2025-04-14T08:21:45.890305Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Applied all target cluster states
1744618905877	2025-04-14T08:21:45.877286Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Current cluster states current_states={"default-1": Unhealthy}
1744618904889	2025-04-14T08:21:44.889485Z  INFO trino_lb::scaling: Scaler: reconciled
1744618904889	2025-04-14T08:21:44.889475Z  INFO Scaler::reconcile: trino_lb::scaling: All cluster groups reconciled
1744618904889	2025-04-14T08:21:44.889449Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Applied all target cluster states
1744618904876	2025-04-14T08:21:44.876658Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Current cluster states current_states={"default-1": Unhealthy}
1744618903889	2025-04-14T08:21:43.889534Z  INFO trino_lb::scaling: Scaler: reconciled
1744618903889	2025-04-14T08:21:43.889527Z  INFO Scaler::reconcile: trino_lb::scaling: All cluster groups reconciled
1744618903889	2025-04-14T08:21:43.889501Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Applied all target cluster states
1744618903876	2025-04-14T08:21:43.876404Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Current cluster states current_states={"default-1": Unhealthy}
1744618902890	2025-04-14T08:21:42.890076Z  INFO trino_lb::scaling: Scaler: reconciled
1744618902890	2025-04-14T08:21:42.890069Z  INFO Scaler::reconcile: trino_lb::scaling: All cluster groups reconciled
1744618902890	2025-04-14T08:21:42.890047Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Applied all target cluster states
1744618902877	2025-04-14T08:21:42.876729Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Current cluster states current_states={"default-1": Unhealthy}
1744618901890	2025-04-14T08:21:41.890375Z  INFO trino_lb::scaling: Scaler: reconciled
1744618901890	2025-04-14T08:21:41.890368Z  INFO Scaler::reconcile: trino_lb::scaling: All cluster groups reconciled
1744618901890	2025-04-14T08:21:41.890342Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Applied all target cluster states
1744618901876	2025-04-14T08:21:41.876607Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Current cluster states current_states={"default-1": Ready}
1744618900889	2025-04-14T08:21:40.889717Z  INFO trino_lb::scaling: Scaler: reconciled
1744618900889	2025-04-14T08:21:40.889710Z  INFO Scaler::reconcile: trino_lb::scaling: All cluster groups reconciled
1744618900889	2025-04-14T08:21:40.889674Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Applied all target cluster states
1744618900876	2025-04-14T08:21:40.876141Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Current cluster states current_states={"default-1": Ready}
1744618899891	2025-04-14T08:21:39.891164Z  INFO trino_lb::scaling: Scaler: reconciled
1744618899891	2025-04-14T08:21:39.891157Z  INFO Scaler::reconcile: trino_lb::scaling: All cluster groups reconciled
1744618899891	2025-04-14T08:21:39.891138Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Applied all target cluster states
1744618899877	2025-04-14T08:21:39.877706Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Current cluster states current_states={"default-1": Ready}
1744618898891	2025-04-14T08:21:38.891703Z  INFO trino_lb::scaling: Scaler: reconciled
1744618898891	2025-04-14T08:21:38.891695Z  INFO Scaler::reconcile: trino_lb::scaling: All cluster groups reconciled
1744618898891	2025-04-14T08:21:38.891670Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Applied all target cluster states
1744618898878	2025-04-14T08:21:38.878563Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Current cluster states current_states={"default-1": Ready}
1744618897889	2025-04-14T08:21:37.889255Z  INFO trino_lb::scaling: Scaler: reconciled
1744618897889	2025-04-14T08:21:37.889246Z  INFO Scaler::reconcile: trino_lb::scaling: All cluster groups reconciled
1744618897889	2025-04-14T08:21:37.889215Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Applied all target cluster states
1744618897876	2025-04-14T08:21:37.876828Z  INFO Scaler::reconcile:Scaler::reconcile_cluster_group{cluster_group="default"}: trino_lb::scaling: Current cluster states current_states={"default-1": Ready}

@sbernauer
Copy link
Member

I can see your point.
The loop never stops, so something inside the loop must have been hanging.
It would have been interesting what the trace log level would have told us, but hard to get after the fact...
Maybe it's the redis client?

@maxgruber19
Copy link
Author

maxgruber19 commented Apr 16, 2025

could be, the redis error is the last log before hanging forever. we will try to reproduce that and i'll get in touch with you after that, thank you for now!

@sbernauer may be a stupid question but how to increase log level to trace, didnt find something in the docs / code

@sbernauer
Copy link
Member

You need to set the env variable RUST_LOG to trace (or debug): https://rust-lang-nursery.github.io/rust-cookbook/development_tools/debugging/config_log.html

@maxgruber19
Copy link
Author

We tried out best to reproduce that but no success, everything works as desired now. Tracing worked, thanks for that hint.

I'll close that for now, if we have that issue again I'll let you know, sorry for bothering

@maxgruber19 maxgruber19 closed this as not planned Won't fix, can't repro, duplicate, stale Apr 17, 2025
@sbernauer
Copy link
Member

Thanks a lot for the update! I really hope it does not happen again 😅

@maxgruber19
Copy link
Author

maxgruber19 commented May 7, 2025

We had a similar issue couple of minutes ago, of course tracing was not enabled anymore... 😄 last message in the log is a redis error once again, seems like trino lb is not recovering from redis problems as intended? are retries and timeouts configured for redis as well? i've checked https://github.com/stackabletech/trino-lb/blob/main/trino-lb-persistence/src/redis/mod.rs but didn't find any, that may be caused by my great rust knowledge

after 2 hours the pod is still hanging at that point

2025-05-07T11:00:43.566371Z ERROR trino_lb::scaling: Scaler: reconciled failed error=SetCurrentClusterStateInPersistence { source: RedisError { source: SetClusterState { source: broken pipe } }, cluster: "default-1" }

@maxgruber19 maxgruber19 reopened this May 7, 2025
@sbernauer
Copy link
Member

We are using the ConnectionManager, which should fail the first request, but automatically re-connect in the background...
It has some defaults on connection retries: https://docs.rs/redis/latest/src/redis/aio/connection_manager.rs.html#87-91

    const DEFAULT_CONNECTION_RETRY_EXPONENT_BASE: u64 = 2;
    const DEFAULT_CONNECTION_RETRY_FACTOR: u64 = 100;
    const DEFAULT_NUMBER_OF_CONNECTION_RETRIES: usize = 6;
    const DEFAULT_RESPONSE_TIMEOUT: Option<std::time::Duration> = None;
    const DEFAULT_CONNECTION_TIMEOUT: Option<std::time::Duration> = None;

However, it has no response or connection timeout by default 🤔 I will try to set a default of e.g. 10s and test it out.

@sbernauer
Copy link
Member

I opened #85 for setting timeouts for the Redis connection and pushed it as oci.stackable.tech/sandbox/trino-lb:0.0.0-pr85.
I played around with it for a bit and honestly couldn't notice a difference. Would you be happy to test it out? Or do you think we should merge it as is as it should not worsen the situation and release it?

@maxgruber19
Copy link
Author

Thanks for that, it's a bit hard to test because I'm still not sure what the reason for the hang is. I'd be fine with releasing it because it's not worsening anything

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants