Bug 1302264 - Add support for new-style gzip-compressed objects #88

whd · 2016-11-08T22:36:55Z

I've made it the responsibility of the code that is returning the s3 byte stream to apply the streaming gzip wrapper, because that code generally has the easiest access to the object's content-encoding.

The implementation of parse_heka_message is mocked in the tests in this repo, but the gzip parsing code is tested in telemetry-tools.

The only piece without tests is the conditional movement of the payload from Payload to Fields[submission] in the new infra, which I have added a special case for and tested manually.

This change is

vitillo · 2016-11-09T15:58:59Z

@maurodoglio r?

maurodoglio · 2016-11-14T10:16:56Z

Hey @whd it looks like the tests are failing in CI, could you please fix it?

maurodoglio · 2016-11-14T10:43:52Z

moztelemetry/heka_message_parser.py

+            # Special case: the submission field (bytes) replaces the top level
+            # Payload in the hindsight-based infra
+            if name[0] == 'submission':
+                result.update(json.loads(field.value_bytes[0].decode('utf-8')))


Someone with some knowledge of heka should review these 4 lines. Maybe @mreid-moz ?

maurodoglio · 2016-11-14T15:12:19Z

moztelemetry/store.py

@@ -36,7 +38,12 @@ def get_key(self, key):
        try:
            # get_key must return a file-like object because that's what's
            # required by parse_heka_message
-            return bucket.Object(key).get()['Body']
+            s3object = bucket.Object(key).get()
+            if s3object['ResponseMetadata']['HTTPHeaders'].get(


According to the boto3 docs there should be a ContentEncoding key in the returned dictionary

It would be great to have a test for this condition.

I've updated the code to use the top-level key for content-encoding instead of the low-level one.

Like with telemetry-batch-view, the testing harness here creates an implementation of the S3Store (InMemoryStore) for testing that makes adding a test that actually uses the code above difficult. I can add a test like test_get_gzip_key that adds the streaming wrapper within the testing logic and confirms the data written to disk is decompressed, but that's essentially a subset of what the test in telemetry-tools does and to me feels like confirming that python can gzip and gunzip a file. It would also potentially involve changing the S3Store API to support specifying a content-encoding when uploading an object, which is something I didn't want to do. An option to avoid changing the API would be to use e.g. file extension to determine content-encoding for the InMemoryStore, but that reverts to the same gzip-gunzip cycle.

whd · 2016-11-14T17:00:08Z

@maurodoglio the travis build fails because it requires mozilla/telemetry-tools#6 to be merged first. I'll have @mreid-moz review that because he has some knowledge of heka.

mreid-moz · 2016-11-14T19:37:36Z

Reviewed 1 of 3 files at r1.
Review status: 1 of 3 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.

moztelemetry/heka_message_parser.py, line 42 at r1 (raw file):

Previously, maurodoglio (Mauro Doglio) wrote…

Someone with some knowledge of heka should review these 4 lines. Maybe @mreid-moz ?

This looks fine to me. It could slightly change the semantics if there's a field in the `submission`/`payload` with the same name as one of the heka fields. Previously: the payload fields would be overwritten by any heka field with the same name. Now: which field wins out will be based on the order in which the fields are encountered.

In the telemetry use case, I don't think this is a concern, since we should never encounter a message where a field of the same name exists in both the submission and a heka message field.

Comments from Reviewable

whd · 2016-11-15T02:47:58Z

I've come across an issue while testing this PR a bit more that requires investigation / discussion, so this PR should not be merged until it has been resolved. It's problematic enough that I will file a separate bug about it tomorrow, but the gist of it is that the client, the new infra, or both occasionally produce some large numbers that don't fit into doubles, and ujson doesn't like that very much. It is masked in the current infra by mpx/lua-cjson#37. See also ultrajson/ultrajson#49.

whd · 2017-01-04T08:59:28Z

I believe I've resolved the above issues, as follows:

Fall back to the standard python json parser for cases where ujson fails.

Per https://bugzilla.mozilla.org/show_bug.cgi?id=1326107 and @mreid-moz's above comment, perform the payload/submission merge in known order due to the residual NULLs.

@mreid-moz can you re-r? This still requires mozilla/telemetry-tools#6.

coveralls · 2017-01-13T22:32:58Z

Coverage increased (+0.6%) to 63.442% when pulling db505f6 on whd:s3_gzip into 93c180a on mozilla:master.

coveralls · 2017-01-13T22:34:52Z

Coverage increased (+0.6%) to 63.442% when pulling db505f6 on whd:s3_gzip into 93c180a on mozilla:master.

coveralls · 2017-01-13T23:43:53Z

Coverage increased (+0.3%) to 63.182% when pulling 8e6b85f on whd:s3_gzip into 93c180a on mozilla:master.

whd · 2017-01-14T00:12:11Z

I've imported/rebased the changes I had from mozilla/telemetry-tools/pull/6 here. I also added a UTF8 check per feedback in mozilla/telemetry-batch-view/pull/159 and added a helper json parsing method
since json.loads was being called (and failing) in multiple places.

I put the streaming gzip wrapper in a util/ subdirectory since it's not strictly related to heka but it can be moved elsewhere.

@maurodoglio / @mreid-moz can you re-r?

maurodoglio

Thanks @whd!

mreid-moz · 2017-01-16T14:45:02Z

Reviewed 9 of 9 files at r2.
Review status: all files reviewed at latest revision, 2 unresolved discussions.

Comments from Reviewable

mreid-moz · 2017-01-16T14:50:23Z

Looks good to me too.

maurodoglio reviewed Nov 14, 2016

View reviewed changes

Add support for new-style gzip-compressed objects

8e6b85f

maurodoglio approved these changes Jan 16, 2017

View reviewed changes

mreid-moz merged commit 6c88927 into mozilla:master Jan 16, 2017

whd deleted the s3_gzip branch January 17, 2017 20:55

mreid-moz mentioned this pull request Jan 27, 2017

Move Dataset API from telemetry-batch-view to its own package on maven mozilla/moztelemetry#1

Merged

Bug 1302264 - Add support for new-style gzip-compressed objects #88

Bug 1302264 - Add support for new-style gzip-compressed objects #88

Uh oh!

Conversation

whd commented Nov 8, 2016 • edited by maurodoglio Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vitillo commented Nov 9, 2016

Uh oh!

maurodoglio commented Nov 14, 2016

Uh oh!

maurodoglio Nov 14, 2016

Choose a reason for hiding this comment

Uh oh!

maurodoglio Nov 14, 2016

Choose a reason for hiding this comment

Uh oh!

maurodoglio Nov 14, 2016

Choose a reason for hiding this comment

Uh oh!

whd Nov 14, 2016

Choose a reason for hiding this comment

Uh oh!

whd commented Nov 14, 2016

Uh oh!

mreid-moz commented Nov 14, 2016

Uh oh!

whd commented Nov 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

whd commented Jan 4, 2017

Uh oh!

coveralls commented Jan 13, 2017

Uh oh!

coveralls commented Jan 13, 2017

Uh oh!

coveralls commented Jan 13, 2017

Uh oh!

whd commented Jan 14, 2017

Uh oh!

maurodoglio left a comment

Choose a reason for hiding this comment

Uh oh!

mreid-moz commented Jan 16, 2017

Uh oh!

mreid-moz commented Jan 16, 2017

Uh oh!

Uh oh!

whd commented Nov 8, 2016 •

edited by maurodoglio

Loading

whd commented Nov 15, 2016 •

edited

Loading