You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I used confluent-kafka-python to write a worker responsible for processing messages in a specific kafka topic. To implement EOS (Exactly Once Semantics), I enabled transactions and configured the producer and consumer settings accordingly. Since the tasks may involve a longer processing time, I set the producer's transaction.timeout.ms to 3 hours and also adjusted the broker's transaction.max.timeout.ms.
Currently, the worker runs fine on my local machine. However, when I run it in a local Docker container or in a production environment on Kubernetes, I often see the following messages:
%4|1738575666.358|REQTMOUT|rdkafka#producer-2| [thrd:TxnCoordinator]: TxnCoordinator/3: Timed out 0 in-flight, 0 retry-queued, 1 out-queue, 0 partially-sent requests
%3|1738575666.358|FAIL|rdkafka#producer-2| [thrd:TxnCoordinator]: TxnCoordinator: 10.11.123.134:9092: 1 request(s) timed out: disconnect (after 58468ms in state UP)
%4|1738575666.411|REQTMOUT|rdkafka#producer-2| [thrd:10.11.123.134:9092/bootstrap]: 10.11.123.134:9092/3: Timed out 0 in-flight, 0 retry-queued, 1 out-queue, 0 partially-sent requests
%3|1738575666.411|FAIL|rdkafka#producer-2| [thrd:10.11.123.134:9092/bootstrap]: 10.11.123.134:9092/3: 1 request(s) timed out: disconnect (after 26197ms in state UP)
%4|1738575667.420|REQTMOUT|rdkafka#producer-2| [thrd:10.11.123.134:9092/bootstrap]: 10.11.123.134:9092/3: Timed out 0 in-flight, 0 retry-queued, 1 out-queue, 0 partially-sent requests
%3|1738575667.420|FAIL|rdkafka#producer-2| [thrd:10.11.123.134:9092/bootstrap]: 10.11.123.134:9092/3: 1 request(s) timed out: disconnect (after 393ms in state UP, 1 identical error(s) suppressed)
%4|1738575667.924|REQTMOUT|rdkafka#producer-2| [thrd:10.11.123.132:9092/bootstrap]: 10.11.123.132:9092/1: Timed out 0 in-flight, 0 retry-queued, 1 out-queue, 0 partially-sent requests
%3|1738575667.924|FAIL|rdkafka#producer-2| [thrd:10.11.123.132:9092/bootstrap]: 10.11.123.132:9092/1: 1 request(s) timed out: disconnect (after 60053ms in state UP)
%4|1738575668.925|REQTMOUT|rdkafka#producer-2| [thrd:10.11.123.133:9092/bootstrap]: 10.11.123.133:9092/2: Timed out 0 in-flight, 0 retry-queued, 1 out-queue, 0 partially-sent requests
%3|1738575668.925|FAIL|rdkafka#producer-2| [thrd:10.11.123.133:9092/bootstrap]: 10.11.123.133:9092/2: 1 request(s) timed out: disconnect (after 486ms in state UP)
%4|1738575669.933|REQTMOUT|rdkafka#producer-2| [thrd:10.11.123.133:9092/bootstrap]: 10.11.123.133:9092/2: Timed out 0 in-flight, 0 retry-queued, 1 out-queue, 0 partially-sent requests
%3|1738575669.934|FAIL|rdkafka#producer-2| [thrd:10.11.123.133:9092/bootstrap]: 10.11.123.133:9092/2: 1 request(s) timed out: disconnect (after 994ms in state UP, 1 identical error(s) suppressed)
%4|1738575670.942|REQTMOUT|rdkafka#producer-2| [thrd:10.11.123.133:9092/bootstrap]: 10.11.123.133:9092/2: Timed out 0 in-flight, 0 retry-queued, 1 out-queue, 0 partially-sent requests
%4|1738575671.952|REQTMOUT|rdkafka#producer-2| [thrd:10.11.123.133:9092/bootstrap]: 10.11.123.133:9092/2: Timed out 0 in-flight, 0 retry-queued, 1 out-queue, 0 partially-sent requests
It seems that the producer is encountering timeout when sending requests. Based on my online research, the requests are eventually sent successfully due to Kafka's retry mechanism. However, this issue keeps recurring, happening at least once every 15 minutes. When I restart the worker on my local machine, these issues disappear.
Another strange observation is that when I reduce the transaction.timeout.ms to a lower value (e.g., 30 minutes in my tests), these timeout messages no longer appear. However, as far as I know, this setting should not affect whether a request times out.
Has anyone else encountered a similar issue or knows the reason behind this behavior?
Checklist
Please provide the following information:
confluent-kafka-python and librdkafka version (confluent_kafka.version('2.4.0') and confluent_kafka.libversion('2.4.0')):
Apache Kafka broker version: 3.9.0
Client configuration: {...}
Operating system: linux (k8s)、macOS (local machine)
Provide client logs (with 'debug': '..' as necessary)
Provide broker log excerpts
Critical issue
The text was updated successfully, but these errors were encountered:
Description
I used confluent-kafka-python to write a worker responsible for processing messages in a specific kafka topic. To implement EOS (Exactly Once Semantics), I enabled transactions and configured the producer and consumer settings accordingly. Since the tasks may involve a longer processing time, I set the producer's
transaction.timeout.ms
to 3 hours and also adjusted the broker'stransaction.max.timeout.ms
.Currently, the worker runs fine on my local machine. However, when I run it in a local Docker container or in a production environment on Kubernetes, I often see the following messages:
It seems that the producer is encountering timeout when sending requests. Based on my online research, the requests are eventually sent successfully due to Kafka's retry mechanism. However, this issue keeps recurring, happening at least once every 15 minutes. When I restart the worker on my local machine, these issues disappear.
Another strange observation is that when I reduce the
transaction.timeout.ms
to a lower value (e.g., 30 minutes in my tests), these timeout messages no longer appear. However, as far as I know, this setting should not affect whether a request times out.Has anyone else encountered a similar issue or knows the reason behind this behavior?
Checklist
Please provide the following information:
confluent-kafka-python and librdkafka version (confluent_kafka.version('2.4.0') and confluent_kafka.libversion('2.4.0')):
Apache Kafka broker version: 3.9.0
Client configuration: {...}
Operating system: linux (k8s)、macOS (local machine)
Provide client logs (with 'debug': '..' as necessary)
Provide broker log excerpts
Critical issue
The text was updated successfully, but these errors were encountered: