Skip to content

raise exception after exiting watcher.each #214

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 18, 2020
Merged

Conversation

mszabo
Copy link
Contributor

@mszabo mszabo commented Feb 26, 2020

fixes #213

@@ -119,6 +119,7 @@ def process_namespace_watcher_notices(watcher)
@stats.bump(:namespace_cache_watch_ignored)
end
end
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this simply fall through all the time and raise an exception on every notification. Should we be evaluating for a specific notice.type or at least have continue statements above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as i suspected:

Error: test: pod MODIFIED cached when hostname matches(DefaultPodWatchStrategyTest)
: RuntimeError: 
/home/circleci/fluent-plugin-kubernetes_metadata_filter/lib/fluent/plugin/kubernetes_metadata_watch_pods.rb:128:in `process_pod_watcher_notices'
/home/circleci/fluen

Copy link
Contributor

@jcantrill jcantrill Feb 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its probably something more like:

when 'ERROR':
  message = notice['object']['message'] if notice['object'] && notice['object']['message']
  raise "Error while watching namespaces: #{message}"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about doing it with when 'ERROR', but then it can still exit the watcher.each block without an exception which can cause endless loop without backoff.

I don't know how this testing framework works, but in reality it stays inside the watcher.each block until the connection is open. Watching opens long running connections, the API server usually kills the connection after 40-50 minutes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I thought the notice.type value can only be one of these there: ADDED, MODIFIED, DELETED.

Reference: https://github.com/abonas/kubeclient/blame/28dfc4d538d72127015dc9b90e6b776c4b0fb986/test/test_watch.rb#L8

@jcantrill 's suggestion in https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/pull/214/files#r384729998 makes sense to me.

Looking at the issue reported at #213 (comment), it's not an unexpected exception. Besides, if an exception if thrown,

should have caught it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no exception thrown when the connection is closed.

When you open the HTTP connection for the watcher you get a 200 OK reply from the server, but connection stays open and it is streaming the events over the same connection. Kubernetes will close it after around 1 hour (at least on our cluster).

Once it's closed there is no exception, it's finishing normally.

So the thread loop will execute, but because there were no exception the rescue block did not execute and nothing set the pod_watcher to nil. It will use resource_version from the first run.
Kubernetes API remembers old resources for a while, but after some time it will return ERROR with "too old resource version" message and closes the connection (no exception happens).

When this happens the thread loop will start the watcher again with the old resource version, gets error again and this continues in a loop. It results in a high load on the API server and no way to recover except restarting fluentd.

I can rework this PR to use the when 'ERROR' case, but then the issue can still happen if the connection is closed by the API server without returning ERROR and throwing exception.

I think an exception should be raised whenever the watcher.each block is exited, because it means the long-running connection was closed gracefully.
Unfortunately I am not sure how I can make it work with the tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re #214 (comment): Ah I see. That sounds good to me. Instead of a simple raise, can we give it some message to help with debugging / trouble shooting when the issue happens?

For the fact that it returned ERROR notice, seems like there is some discussion at kubeclient/pull/275. Sounds like it might be fixed in

@qingling128
Copy link
Contributor

Perhaps we can get this fix merged the way it is right now to handle the ERROR type of notice and iterate on it later once the dust settles for ManageIQ/kubeclient#275?

Copy link
Contributor

@qingling128 qingling128 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@mszabo
Copy link
Contributor Author

mszabo commented Mar 6, 2020

@qingling128 Iooking at the code again, I think Thread.current[:pod_watch_retry_count] should be reset when non-error events are received otherwise that counter will just keep increasing whenever the API server respons 410 Gone and after couple of iterations the code will raise Fluent::UnrecoverableError. What do you think?

@qingling128
Copy link
Contributor

@mszabo - Good point. What do you think about resetting it to 0 in the get_pods_and_start_watcher method instead of every non-error events? That way if the events are a mix of good ones and bad ones, we can still guarantee we will eventually still restart.

@mszabo
Copy link
Contributor Author

mszabo commented Mar 9, 2020

@qingling128 sounds good. I tried to add it after the resourceVersion successfully queried from the API server, but it makes the "pod watch retries when exceptions are encountered" test hang forever and I am not sure how to fix the test

@qingling128
Copy link
Contributor

@mszabo - Interesting. I took a closer look. This (https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/pull/215/files) seems to work and should be equivalent. Also, now I think about it, once an Exception is raised, Fluentd will restart, which resets the variable anyway. So the change is probably not needed in the first place.

BTW, I'm not entirely sure why the test failed in the other approach. I guess it might be related to how the stubs work or something.

@jcantrill
Copy link
Contributor

@mszabo can you post unit test results until we can properly fix up the CI?

@qingling128
Copy link
Contributor

I could view the test results in this PR. Quoted below:

!/bin/bash -eo pipefail
bundle exec rake test
Run options: --seed 21318

# Running tests:



Finished tests in 0.000154s, 0.0000 tests/s, 0.0000 assertions/s.

0 tests, 0 assertions, 0 failures, 0 errors, 0 skips
Loaded suite /home/circleci/fluent-plugin-kubernetes_metadata_filter/vendor/bundle/ruby/2.4.0/gems/rake-13.0.1/lib/rake/rake_test_loader
Started
...................................................rake aborted!
Interrupt: 
/home/circleci/fluent-plugin-kubernetes_metadata_filter/vendor/bundle/ruby/2.4.0/gems/rake-13.0.1/exe/rake:27:in `<top (required)>'
/usr/local/bundle/gems/bundler-2.1.4/lib/bundler/cli/exec.rb:63:in `load'
/usr/local/bundle/gems/bundler-2.1.4/lib/bundler/cli/exec.rb:63:in `kernel_load'
/usr/local/bundle/gems/bundler-2.1.4/lib/bundler/cli/exec.rb:28:in `run'
/usr/local/bundle/gems/bundler-2.1.4/lib/bundler/cli.rb:476:in `exec'
/usr/local/bundle/gems/bundler-2.1.4/lib/bundler/vendor/thor/lib/thor/command.rb:27:in `run'
/usr/local/bundle/gems/bundler-2.1.4/lib/bundler/vendor/thor/lib/thor/invocation.rb:127:in `invoke_command'
/usr/local/bundle/gems/bundler-2.1.4/lib/bundler/vendor/thor/lib/thor.rb:399:in `dispatch'
/usr/local/bundle/gems/bundler-2.1.4/lib/bundler/cli.rb:30:in `dispatch'
/usr/local/bundle/gems/bundler-2.1.4/lib/bundler/vendor/thor/lib/thor/base.rb:476:in `start'
/usr/local/bundle/gems/bundler-2.1.4/lib/bundler/cli.rb:24:in `start'
/usr/local/bundle/gems/bundler-2.1.4/exe/bundle:46:in `block in <top (required)>'
/usr/local/bundle/gems/bundler-2.1.4/lib/bundler/friendly_errors.rb:123:in `with_friendly_errors'
/usr/local/bundle/gems/bundler-2.1.4/exe/bundle:34:in `<top (required)>'
/usr/local/bundle/bin/bundle:23:in `load'
/usr/local/bundle/bin/bundle:23:in `<main>'
Tasks: TOP => test => base_test
(See full trace by running task with --trace)
F
===============================================================================
Failure: test: pod watch retries when exceptions are encountered(DefaultPodWatchStrategyTest)
/home/circleci/fluent-plugin-kubernetes_metadata_filter/test/plugin/test_watch_pods.rb:216:in `block (3 levels) in <class:DefaultPodWatchStrategyTest>'
     213:     test 'pod watch retries when exceptions are encountered' do
     214:       @client.stub :get_pods, @initial do
     215:         @client.stub :watch_pods, [[@created, @exception_raised]] do
  => 216:           assert_raise Fluent::UnrecoverableError do
     217:             set_up_pod_thread
     218:           end
     219:           assert_equal(3, @stats[:pod_watch_failures])
/home/circleci/fluent-plugin-kubernetes_metadata_filter/vendor/bundle/ruby/2.4.0/gems/minitest-4.7.5/lib/minitest/mock.rb:187:in `stub'
/home/circleci/fluent-plugin-kubernetes_metadata_filter/test/plugin/test_watch_pods.rb:215:in `block (2 levels) in <class:DefaultPodWatchStrategyTest>'
/home/circleci/fluent-plugin-kubernetes_metadata_filter/vendor/bundle/ruby/2.4.0/gems/minitest-4.7.5/lib/minitest/mock.rb:187:in `stub'
/home/circleci/fluent-plugin-kubernetes_metadata_filter/test/plugin/test_watch_pods.rb:214:in `block in <class:DefaultPodWatchStrategyTest>'

<Fluent::UnrecoverableError> expected but was
<SignalException(<SIGHUP>)>

diff:
? F    lu  e    nt::   UnrecoverableError
? Signa Exc ptio (<SIGH P>)              
===============================================================================
.....
Too long with no output (exceeded 10m0s): context deadline exceeded

I was able to reproduce the error above locally. The tests simply hang indefinitely until I manually killed it.

BTW, CI seems fixed already (at least it's passing for the master branch when I tried it). The experiment PR I tried also passed: #215.

@mszabo
Copy link
Contributor Author

mszabo commented Mar 10, 2020

@mszabo - Interesting. I took a closer look. This (https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/pull/215/files) seems to work and should be equivalent. Also, now I think about it, once an Exception is raised, Fluentd will restart, which resets the variable anyway. So the change is probably not needed in the first place.

@qingling128 I don't think #215 is equivalent. As you said, Fluent::UnrecoverableError will restart the process and the variable will reset, so no point resetting the variable just before throwing Fluent::UnrecoverableError. The original problem is that the Kubernetes API server will send 410 Gone ERROR (note: 410 is not a HTTP response code here) every 40-50 minutes, it's handled in this PR by raising an exception and this exception is caught and the retry is executed (new resourceVersion polled from the API server). If the retry counter is never reset, after 10 410 Gone messages we will get a Fluent::UnrecoverableError and a process restart even though it's unnecessary. So I think the right place to reset the retry counter is after the new resourceVersion is successfully queried from the API server. I believe because the test always return a mocked resourceVersion it will get into an endless loop, but in reality when a new resourceVersion is returned the watch should continue from there like an endless loop, so this is normal. Just not sure how to fix the test. Fluent::UnrecoverableError should only be seen when the API server is unavailable for long enough time to fail all the retries.

@qingling128
Copy link
Contributor

@mszabo - Ah, so it sounds like 410 Gone ERROR is a recoverable error? I thought it's not recoverable error. If it's recoverable, we can do 1382745 instead.

@mszabo
Copy link
Contributor Author

mszabo commented Mar 11, 2020

@qingling128 yes, it's recoverable, just need to refresh the resourceVersion. The commit you mentioned also could help the situation, but get_pods_and_start_watcher is a better place to reset the counter in my opinion. I think the test is wrong here. Anyway, I am also happy with 1382745 as well and the tests can stay as is with it.

@mszabo
Copy link
Contributor Author

mszabo commented Mar 17, 2020

I removed the counter resets from this PR as it is handled in 1382745

@jcantrill jcantrill merged commit f67dee0 into fabric8io:master Mar 18, 2020
@qingling128
Copy link
Contributor

FYI, I set up some experiments that ran overnight, and verified that the latest commit (which includes this fix) has better performance than 2.4.4.

The two commit hashes of the branches in comparison

Commit hash of lingshi-2-4--4 branch:

lingshi@lingshi-dev:~/repo/fluent-plugin-kubernetes_metadata_filter$ git log -n 1
commit 70e2a15873153c1130a4e49e52c971138dbf7cf8 (HEAD -> lingshi-2-4--4, tag: v2.4.4, origin/lingshi-2-4--4)
Author: Jeff Cantrill <[email protected]>
Date:   Thu Mar 12 16:47:29 2020 -0400

    v2.4.4

Commit hash of lingshi-add-pod-ip branch:

$ git log -n 1
commit d3832d32425a4fb4cf9f0158f39e0e0d79930f33 (HEAD -> lingshi-add-pod-ip, origin/lingshi-add-pod-ip)
Author: Maxim Makarov <[email protected]>
Date:   Fri Mar 20 17:47:53 2020 +0300

    Add pod IP to kubernetes metadata (#218)

    * Add pod IP to kubernetes metadata

CPU comparison
image

Memory comparison
image

Fluentd log entry processing capability comparison
image

Retries still works
image

No container restarts
image

Thanks @mszabo and @jcantrill !

One thing I noticed is that pod_watch_retry_backoff_interval did not get reset to 0 once the retry succeeded (not related to this PR), and I sent a fix: #219

@jcantrill
Copy link
Contributor

@qingling128 great work!

@qingling128
Copy link
Contributor

Thank you!

I've verified the performance again as shown below after #219 got merged.

lingshi-70e2a1-244 is the 2.4.4 hash. lingshi-faff00 is the latest hash in the master branch.

CPU
image

memory
image

log throughput
image

retries are functional
image

@jcantrill - Now that it looks all good, could you help create a new release when you get a sec?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

High request rate towards Kubernetes API server
3 participants