raise exception after exiting watcher.each #214

mszabo · 2020-02-26T19:40:41Z

fixes #213

jcantrill · 2020-02-26T19:44:32Z

lib/fluent/plugin/kubernetes_metadata_watch_namespaces.rb

@@ -119,6 +119,7 @@ def process_namespace_watcher_notices(watcher)
            @stats.bump(:namespace_cache_watch_ignored)
        end
      end
+      raise


Won't this simply fall through all the time and raise an exception on every notification. Should we be evaluating for a specific notice.type or at least have continue statements above

as i suspected:

Error: test: pod MODIFIED cached when hostname matches(DefaultPodWatchStrategyTest) : RuntimeError: /home/circleci/fluent-plugin-kubernetes_metadata_filter/lib/fluent/plugin/kubernetes_metadata_watch_pods.rb:128:in `process_pod_watcher_notices' /home/circleci/fluen

its probably something more like:

when 'ERROR': message = notice['object']['message'] if notice['object'] && notice['object']['message'] raise "Error while watching namespaces: #{message}"

I was thinking about doing it with when 'ERROR', but then it can still exit the watcher.each block without an exception which can cause endless loop without backoff.

I don't know how this testing framework works, but in reality it stays inside the watcher.each block until the connection is open. Watching opens long running connections, the API server usually kills the connection after 40-50 minutes.

Hmm... I thought the notice.type value can only be one of these there: ADDED, MODIFIED, DELETED.

Reference: https://github.com/abonas/kubeclient/blame/28dfc4d538d72127015dc9b90e6b776c4b0fb986/test/test_watch.rb#L8

@jcantrill 's suggestion in https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/pull/214/files#r384729998 makes sense to me.

Looking at the issue reported at #213 (comment), it's not an unexpected exception. Besides, if an exception if thrown,

fluent-plugin-kubernetes_metadata_filter/lib/fluent/plugin/kubernetes_metadata_watch_pods.rb

Line 42 in 9b48b09

rescue Exception => e

should have caught it.

There is no exception thrown when the connection is closed.

When you open the HTTP connection for the watcher you get a 200 OK reply from the server, but connection stays open and it is streaming the events over the same connection. Kubernetes will close it after around 1 hour (at least on our cluster).

Once it's closed there is no exception, it's finishing normally.

So the thread loop will execute, but because there were no exception the rescue block did not execute and nothing set the pod_watcher to nil. It will use resource_version from the first run.
Kubernetes API remembers old resources for a while, but after some time it will return ERROR with "too old resource version" message and closes the connection (no exception happens).

When this happens the thread loop will start the watcher again with the old resource version, gets error again and this continues in a loop. It results in a high load on the API server and no way to recover except restarting fluentd.

I can rework this PR to use the when 'ERROR' case, but then the issue can still happen if the connection is closed by the API server without returning ERROR and throwing exception.

I think an exception should be raised whenever the watcher.each block is exited, because it means the long-running connection was closed gracefully.
Unfortunately I am not sure how I can make it work with the tests.

Re #214 (comment): Ah I see. That sounds good to me. Instead of a simple raise, can we give it some message to help with debugging / trouble shooting when the issue happens?

For the fact that it returned ERROR notice, seems like there is some discussion at kubeclient/pull/275. Sounds like it might be fixed in

qingling128 · 2020-03-06T02:17:43Z

Perhaps we can get this fix merged the way it is right now to handle the ERROR type of notice and iterate on it later once the dust settles for ManageIQ/kubeclient#275?

qingling128

LGTM.

mszabo · 2020-03-06T20:34:31Z

@qingling128 Iooking at the code again, I think Thread.current[:pod_watch_retry_count] should be reset when non-error events are received otherwise that counter will just keep increasing whenever the API server respons 410 Gone and after couple of iterations the code will raise Fluent::UnrecoverableError. What do you think?

qingling128 · 2020-03-09T16:55:11Z

@mszabo - Good point. What do you think about resetting it to 0 in the get_pods_and_start_watcher method instead of every non-error events? That way if the events are a mix of good ones and bad ones, we can still guarantee we will eventually still restart.

mszabo · 2020-03-09T23:41:14Z

@qingling128 sounds good. I tried to add it after the resourceVersion successfully queried from the API server, but it makes the "pod watch retries when exceptions are encountered" test hang forever and I am not sure how to fix the test

qingling128 · 2020-03-10T02:03:14Z

@mszabo - Interesting. I took a closer look. This (https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/pull/215/files) seems to work and should be equivalent. Also, now I think about it, once an Exception is raised, Fluentd will restart, which resets the variable anyway. So the change is probably not needed in the first place.

BTW, I'm not entirely sure why the test failed in the other approach. I guess it might be related to how the stubs work or something.

jcantrill · 2020-03-10T02:11:54Z

@mszabo can you post unit test results until we can properly fix up the CI?

qingling128 · 2020-03-10T07:12:14Z

I could view the test results in this PR. Quoted below:

!/bin/bash -eo pipefail
bundle exec rake test
Run options: --seed 21318

# Running tests:



Finished tests in 0.000154s, 0.0000 tests/s, 0.0000 assertions/s.

0 tests, 0 assertions, 0 failures, 0 errors, 0 skips
Loaded suite /home/circleci/fluent-plugin-kubernetes_metadata_filter/vendor/bundle/ruby/2.4.0/gems/rake-13.0.1/lib/rake/rake_test_loader
Started
...................................................rake aborted!
Interrupt: 
/home/circleci/fluent-plugin-kubernetes_metadata_filter/vendor/bundle/ruby/2.4.0/gems/rake-13.0.1/exe/rake:27:in `<top (required)>'
/usr/local/bundle/gems/bundler-2.1.4/lib/bundler/cli/exec.rb:63:in `load'
/usr/local/bundle/gems/bundler-2.1.4/lib/bundler/cli/exec.rb:63:in `kernel_load'
/usr/local/bundle/gems/bundler-2.1.4/lib/bundler/cli/exec.rb:28:in `run'
/usr/local/bundle/gems/bundler-2.1.4/lib/bundler/cli.rb:476:in `exec'
/usr/local/bundle/gems/bundler-2.1.4/lib/bundler/vendor/thor/lib/thor/command.rb:27:in `run'
/usr/local/bundle/gems/bundler-2.1.4/lib/bundler/vendor/thor/lib/thor/invocation.rb:127:in `invoke_command'
/usr/local/bundle/gems/bundler-2.1.4/lib/bundler/vendor/thor/lib/thor.rb:399:in `dispatch'
/usr/local/bundle/gems/bundler-2.1.4/lib/bundler/cli.rb:30:in `dispatch'
/usr/local/bundle/gems/bundler-2.1.4/lib/bundler/vendor/thor/lib/thor/base.rb:476:in `start'
/usr/local/bundle/gems/bundler-2.1.4/lib/bundler/cli.rb:24:in `start'
/usr/local/bundle/gems/bundler-2.1.4/exe/bundle:46:in `block in <top (required)>'
/usr/local/bundle/gems/bundler-2.1.4/lib/bundler/friendly_errors.rb:123:in `with_friendly_errors'
/usr/local/bundle/gems/bundler-2.1.4/exe/bundle:34:in `<top (required)>'
/usr/local/bundle/bin/bundle:23:in `load'
/usr/local/bundle/bin/bundle:23:in `<main>'
Tasks: TOP => test => base_test
(See full trace by running task with --trace)
F
===============================================================================
Failure: test: pod watch retries when exceptions are encountered(DefaultPodWatchStrategyTest)
/home/circleci/fluent-plugin-kubernetes_metadata_filter/test/plugin/test_watch_pods.rb:216:in `block (3 levels) in <class:DefaultPodWatchStrategyTest>'
     213:     test 'pod watch retries when exceptions are encountered' do
     214:       @client.stub :get_pods, @initial do
     215:         @client.stub :watch_pods, [[@created, @exception_raised]] do
  => 216:           assert_raise Fluent::UnrecoverableError do
     217:             set_up_pod_thread
     218:           end
     219:           assert_equal(3, @stats[:pod_watch_failures])
/home/circleci/fluent-plugin-kubernetes_metadata_filter/vendor/bundle/ruby/2.4.0/gems/minitest-4.7.5/lib/minitest/mock.rb:187:in `stub'
/home/circleci/fluent-plugin-kubernetes_metadata_filter/test/plugin/test_watch_pods.rb:215:in `block (2 levels) in <class:DefaultPodWatchStrategyTest>'
/home/circleci/fluent-plugin-kubernetes_metadata_filter/vendor/bundle/ruby/2.4.0/gems/minitest-4.7.5/lib/minitest/mock.rb:187:in `stub'
/home/circleci/fluent-plugin-kubernetes_metadata_filter/test/plugin/test_watch_pods.rb:214:in `block in <class:DefaultPodWatchStrategyTest>'

<Fluent::UnrecoverableError> expected but was
<SignalException(<SIGHUP>)>

diff:
? F    lu  e    nt::   UnrecoverableError
? Signa Exc ptio (<SIGH P>)              
===============================================================================
.....
Too long with no output (exceeded 10m0s): context deadline exceeded

I was able to reproduce the error above locally. The tests simply hang indefinitely until I manually killed it.

BTW, CI seems fixed already (at least it's passing for the master branch when I tried it). The experiment PR I tried also passed: #215.

mszabo · 2020-03-10T09:05:29Z

@mszabo - Interesting. I took a closer look. This (https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/pull/215/files) seems to work and should be equivalent. Also, now I think about it, once an Exception is raised, Fluentd will restart, which resets the variable anyway. So the change is probably not needed in the first place.

@qingling128 I don't think #215 is equivalent. As you said, Fluent::UnrecoverableError will restart the process and the variable will reset, so no point resetting the variable just before throwing Fluent::UnrecoverableError. The original problem is that the Kubernetes API server will send 410 Gone ERROR (note: 410 is not a HTTP response code here) every 40-50 minutes, it's handled in this PR by raising an exception and this exception is caught and the retry is executed (new resourceVersion polled from the API server). If the retry counter is never reset, after 10 410 Gone messages we will get a Fluent::UnrecoverableError and a process restart even though it's unnecessary. So I think the right place to reset the retry counter is after the new resourceVersion is successfully queried from the API server. I believe because the test always return a mocked resourceVersion it will get into an endless loop, but in reality when a new resourceVersion is returned the watch should continue from there like an endless loop, so this is normal. Just not sure how to fix the test. Fluent::UnrecoverableError should only be seen when the API server is unavailable for long enough time to fail all the retries.

qingling128 · 2020-03-11T06:27:32Z

@mszabo - Ah, so it sounds like 410 Gone ERROR is a recoverable error? I thought it's not recoverable error. If it's recoverable, we can do 1382745 instead.

mszabo · 2020-03-11T13:49:19Z

@qingling128 yes, it's recoverable, just need to refresh the resourceVersion. The commit you mentioned also could help the situation, but get_pods_and_start_watcher is a better place to reset the counter in my opinion. I think the test is wrong here. Anyway, I am also happy with 1382745 as well and the tests can stay as is with it.

…s already merged

mszabo · 2020-03-17T16:58:11Z

I removed the counter resets from this PR as it is handled in 1382745

qingling128 · 2020-03-22T18:11:04Z

FYI, I set up some experiments that ran overnight, and verified that the latest commit (which includes this fix) has better performance than 2.4.4.

The two commit hashes of the branches in comparison

Commit hash of lingshi-2-4--4 branch:

lingshi@lingshi-dev:~/repo/fluent-plugin-kubernetes_metadata_filter$ git log -n 1
commit 70e2a15873153c1130a4e49e52c971138dbf7cf8 (HEAD -> lingshi-2-4--4, tag: v2.4.4, origin/lingshi-2-4--4)
Author: Jeff Cantrill <[email protected]>
Date:   Thu Mar 12 16:47:29 2020 -0400

    v2.4.4

Commit hash of lingshi-add-pod-ip branch:

$ git log -n 1
commit d3832d32425a4fb4cf9f0158f39e0e0d79930f33 (HEAD -> lingshi-add-pod-ip, origin/lingshi-add-pod-ip)
Author: Maxim Makarov <[email protected]>
Date:   Fri Mar 20 17:47:53 2020 +0300

    Add pod IP to kubernetes metadata (#218)

    * Add pod IP to kubernetes metadata

CPU comparison

Memory comparison

Fluentd log entry processing capability comparison

Retries still works

No container restarts

Thanks @mszabo and @jcantrill !

One thing I noticed is that pod_watch_retry_backoff_interval did not get reset to 0 once the retry succeeded (not related to this PR), and I sent a fix: #219

jcantrill · 2020-03-22T23:40:42Z

@qingling128 great work!

qingling128 · 2020-03-23T02:44:01Z

Thank you!

I've verified the performance again as shown below after #219 got merged.

lingshi-70e2a1-244 is the 2.4.4 hash. lingshi-faff00 is the latest hash in the master branch.

CPU

memory

log throughput

retries are functional

@jcantrill - Now that it looks all good, could you help create a new release when you get a sec?

raise exception after exiting watcher.each

f4a4639

jcantrill reviewed Feb 26, 2020

View reviewed changes

raise exception when ERROR received while watching

d3475fc

qingling128 mentioned this pull request Mar 1, 2020

document that watcher stops see #273 and add each_with_retry ManageIQ/kubeclient#275

Open

qingling128 approved these changes Mar 6, 2020

View reviewed changes

reset watch_retry_count after resourceVersion retrieved from API server

2b6de68

remove watch_retry_count resets, it was handled in another PR which i…

5019676

…s already merged

jcantrill merged commit f67dee0 into fabric8io:master Mar 18, 2020

qingling128 mentioned this pull request Mar 22, 2020

Reset pod and namespace watch retry count and backoff interval on successful events and add unit tests. #219

Merged

qingling128 mentioned this pull request Mar 23, 2020

Upgrade fluent-plugin-kubernetes_metadata_filter gem to 2.4.5. GoogleCloudPlatform/google-fluentd#250

Merged

qingling128 mentioned this pull request Apr 6, 2020

The cpu is very high when I use v2.4.5 #224

Closed

matthewmodestino mentioned this pull request Apr 7, 2020

Splunk-connect consuming lot of API Server resources | 98% of the API connections are consumed by Fluentd splunk/splunk-connect-for-kubernetes#359

Closed

cben mentioned this pull request Jul 10, 2020

How to detect 410 Gone in watch response? ManageIQ/kubeclient#452

Open

andrzej-stencel mentioned this pull request Oct 13, 2020

Upgrade fluent-plugin-kubernetes_metadata_filter from 2.4.1 to 2.5.2 SumoLogic/sumologic-kubernetes-collection#998

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raise exception after exiting watcher.each #214

raise exception after exiting watcher.each #214

mszabo commented Feb 26, 2020

jcantrill Feb 26, 2020

jcantrill Feb 26, 2020

jcantrill Feb 26, 2020 •

edited

Loading

mszabo Feb 26, 2020

qingling128 Feb 26, 2020

mszabo Feb 26, 2020

qingling128 Mar 1, 2020

qingling128 commented Mar 6, 2020

qingling128 left a comment

mszabo commented Mar 6, 2020

qingling128 commented Mar 9, 2020

mszabo commented Mar 9, 2020

qingling128 commented Mar 10, 2020

jcantrill commented Mar 10, 2020

qingling128 commented Mar 10, 2020

mszabo commented Mar 10, 2020

qingling128 commented Mar 11, 2020

mszabo commented Mar 11, 2020

mszabo commented Mar 17, 2020

qingling128 commented Mar 22, 2020

jcantrill commented Mar 22, 2020

qingling128 commented Mar 23, 2020

raise exception after exiting watcher.each #214

raise exception after exiting watcher.each #214

Conversation

mszabo commented Feb 26, 2020

jcantrill Feb 26, 2020

Choose a reason for hiding this comment

jcantrill Feb 26, 2020

Choose a reason for hiding this comment

jcantrill Feb 26, 2020 • edited Loading

Choose a reason for hiding this comment

mszabo Feb 26, 2020

Choose a reason for hiding this comment

qingling128 Feb 26, 2020

Choose a reason for hiding this comment

mszabo Feb 26, 2020

Choose a reason for hiding this comment

qingling128 Mar 1, 2020

Choose a reason for hiding this comment

qingling128 commented Mar 6, 2020

qingling128 left a comment

Choose a reason for hiding this comment

mszabo commented Mar 6, 2020

qingling128 commented Mar 9, 2020

mszabo commented Mar 9, 2020

qingling128 commented Mar 10, 2020

jcantrill commented Mar 10, 2020

qingling128 commented Mar 10, 2020

mszabo commented Mar 10, 2020

qingling128 commented Mar 11, 2020

mszabo commented Mar 11, 2020

mszabo commented Mar 17, 2020

qingling128 commented Mar 22, 2020

jcantrill commented Mar 22, 2020

qingling128 commented Mar 23, 2020

jcantrill Feb 26, 2020 •

edited

Loading