-
-
Notifications
You must be signed in to change notification settings - Fork 8
trinolb keeps hanging when api server is not available #83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @maxgruber19, interesting to hear about that. So I'm not sure why this hangs trino-lb |
thanks for clarification, "unfortunately" we fixed the cause for the network issues already, so its hard for us to reproduce the error. I collected the logs from monday, about 8:21 we had the problem, 20 mins later we manually restarted the pods. when collecting the logs i found an error while communicating with redis. maybe thats the cause? but if redis is not available usually there are tons of errors. do you maybe have a better idea after scanning the logs?
|
I can see your point. |
could be, the redis error is the last log before hanging forever. we will try to reproduce that and i'll get in touch with you after that, thank you for now! @sbernauer may be a stupid question but how to increase log level to trace, didnt find something in the docs / code |
You need to set the env variable |
We tried out best to reproduce that but no success, everything works as desired now. Tracing worked, thanks for that hint. I'll close that for now, if we have that issue again I'll let you know, sorry for bothering |
Thanks a lot for the update! I really hope it does not happen again 😅 |
We had a similar issue couple of minutes ago, of course tracing was not enabled anymore... 😄 last message in the log is a redis error once again, seems like trino lb is not recovering from redis problems as intended? are retries and timeouts configured for redis as well? i've checked https://github.com/stackabletech/trino-lb/blob/main/trino-lb-persistence/src/redis/mod.rs but didn't find any, that may be caused by my great rust knowledge after 2 hours the pod is still hanging at that point
|
We are using the ConnectionManager, which should fail the first request, but automatically re-connect in the background... const DEFAULT_CONNECTION_RETRY_EXPONENT_BASE: u64 = 2;
const DEFAULT_CONNECTION_RETRY_FACTOR: u64 = 100;
const DEFAULT_NUMBER_OF_CONNECTION_RETRIES: usize = 6;
const DEFAULT_RESPONSE_TIMEOUT: Option<std::time::Duration> = None;
const DEFAULT_CONNECTION_TIMEOUT: Option<std::time::Duration> = None; However, it has no response or connection timeout by default 🤔 I will try to set a default of e.g. 10s and test it out. |
I opened #85 for setting timeouts for the Redis connection and pushed it as |
Thanks for that, it's a bit hard to test because I'm still not sure what the reason for the hang is. I'd be fine with releasing it because it's not worsening anything |
Uh oh!
There was an error while loading. Please reload this page.
tonight we had some issues with the kubeapi server (not available because network issue for ~1-2 min) which is used by trinolb to fetch health state of connected trino clusters (CRs). it seems like there is no timeout / retry configured for that type of request, which leads to trinolb waiting forever (30 min in our case, then i restarted the sts manually) to get an answer of the api server.
important to know may be that were using stackable autoscaler
is there a timeout set for kubeapi server calls?
The text was updated successfully, but these errors were encountered: