Skip to content
This repository was archived by the owner on Nov 21, 2023. It is now read-only.

Bug 1302264 - Add support for new-style gzip-compressed objects #88

Merged
merged 1 commit into from
Jan 16, 2017
Merged

Conversation

whd
Copy link
Member

@whd whd commented Nov 8, 2016

Requires mozilla/telemetry-tools#6.

I've made it the responsibility of the code that is returning the s3 byte stream to apply the streaming gzip wrapper, because that code generally has the easiest access to the object's content-encoding.

The implementation of parse_heka_message is mocked in the tests in this repo, but the gzip parsing code is tested in telemetry-tools.

The only piece without tests is the conditional movement of the payload from Payload to Fields[submission] in the new infra, which I have added a special case for and tested manually.


This change is Reviewable

@vitillo
Copy link
Contributor

vitillo commented Nov 9, 2016

@maurodoglio r?

@maurodoglio
Copy link
Contributor

Hey @whd it looks like the tests are failing in CI, could you please fix it?

# Special case: the submission field (bytes) replaces the top level
# Payload in the hindsight-based infra
if name[0] == 'submission':
result.update(json.loads(field.value_bytes[0].decode('utf-8')))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone with some knowledge of heka should review these 4 lines. Maybe @mreid-moz ?

@@ -36,7 +38,12 @@ def get_key(self, key):
try:
# get_key must return a file-like object because that's what's
# required by parse_heka_message
return bucket.Object(key).get()['Body']
s3object = bucket.Object(key).get()
if s3object['ResponseMetadata']['HTTPHeaders'].get(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the boto3 docs there should be a ContentEncoding key in the returned dictionary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to have a test for this condition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the code to use the top-level key for content-encoding instead of the low-level one.

Like with telemetry-batch-view, the testing harness here creates an implementation of the S3Store (InMemoryStore) for testing that makes adding a test that actually uses the code above difficult. I can add a test like test_get_gzip_key that adds the streaming wrapper within the testing logic and confirms the data written to disk is decompressed, but that's essentially a subset of what the test in telemetry-tools does and to me feels like confirming that python can gzip and gunzip a file. It would also potentially involve changing the S3Store API to support specifying a content-encoding when uploading an object, which is something I didn't want to do. An option to avoid changing the API would be to use e.g. file extension to determine content-encoding for the InMemoryStore, but that reverts to the same gzip-gunzip cycle.

@whd
Copy link
Member Author

whd commented Nov 14, 2016

@maurodoglio the travis build fails because it requires mozilla/telemetry-tools#6 to be merged first. I'll have @mreid-moz review that because he has some knowledge of heka.

@mreid-moz
Copy link
Contributor

Reviewed 1 of 3 files at r1.
Review status: 1 of 3 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


moztelemetry/heka_message_parser.py, line 42 at r1 (raw file):

Previously, maurodoglio (Mauro Doglio) wrote…

Someone with some knowledge of heka should review these 4 lines. Maybe @mreid-moz ?

This looks fine to me. It could slightly change the semantics if there's a field in the `submission`/`payload` with the same name as one of the heka fields. Previously: the payload fields would be overwritten by any heka field with the same name. Now: which field wins out will be based on the order in which the fields are encountered.

In the telemetry use case, I don't think this is a concern, since we should never encounter a message where a field of the same name exists in both the submission and a heka message field.


Comments from Reviewable

@whd
Copy link
Member Author

whd commented Nov 15, 2016

I've come across an issue while testing this PR a bit more that requires investigation / discussion, so this PR should not be merged until it has been resolved. It's problematic enough that I will file a separate bug about it tomorrow, but the gist of it is that the client, the new infra, or both occasionally produce some large numbers that don't fit into doubles, and ujson doesn't like that very much. It is masked in the current infra by mpx/lua-cjson#37. See also ultrajson/ultrajson#49.

@whd
Copy link
Member Author

whd commented Jan 4, 2017

I believe I've resolved the above issues, as follows:

Fall back to the standard python json parser for cases where ujson fails.

Per https://bugzilla.mozilla.org/show_bug.cgi?id=1326107 and @mreid-moz's above comment, perform the payload/submission merge in known order due to the residual NULLs.

@mreid-moz can you re-r? This still requires mozilla/telemetry-tools#6.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.6%) to 63.442% when pulling db505f6 on whd:s3_gzip into 93c180a on mozilla:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.6%) to 63.442% when pulling db505f6 on whd:s3_gzip into 93c180a on mozilla:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.3%) to 63.182% when pulling 8e6b85f on whd:s3_gzip into 93c180a on mozilla:master.

@whd
Copy link
Member Author

whd commented Jan 14, 2017

I've imported/rebased the changes I had from mozilla/telemetry-tools/pull/6 here. I also added a UTF8 check per feedback in mozilla/telemetry-batch-view/pull/159 and added a helper json parsing method
since json.loads was being called (and failing) in multiple places.

I put the streaming gzip wrapper in a util/ subdirectory since it's not strictly related to heka but it can be moved elsewhere.

@maurodoglio / @mreid-moz can you re-r?

Copy link
Contributor

@maurodoglio maurodoglio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @whd!

@mreid-moz
Copy link
Contributor

:lgtm:


Reviewed 9 of 9 files at r2.
Review status: all files reviewed at latest revision, 2 unresolved discussions.


Comments from Reviewable

@mreid-moz
Copy link
Contributor

Looks good to me too.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants