-
Notifications
You must be signed in to change notification settings - Fork 26
Bug 1302264 - Add support for new-style gzip-compressed objects #88
Conversation
@maurodoglio r? |
Hey @whd it looks like the tests are failing in CI, could you please fix it? |
# Special case: the submission field (bytes) replaces the top level | ||
# Payload in the hindsight-based infra | ||
if name[0] == 'submission': | ||
result.update(json.loads(field.value_bytes[0].decode('utf-8'))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone with some knowledge of heka should review these 4 lines. Maybe @mreid-moz ?
@@ -36,7 +38,12 @@ def get_key(self, key): | |||
try: | |||
# get_key must return a file-like object because that's what's | |||
# required by parse_heka_message | |||
return bucket.Object(key).get()['Body'] | |||
s3object = bucket.Object(key).get() | |||
if s3object['ResponseMetadata']['HTTPHeaders'].get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the boto3 docs there should be a ContentEncoding
key in the returned dictionary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to have a test for this condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the code to use the top-level key for content-encoding instead of the low-level one.
Like with telemetry-batch-view, the testing harness here creates an implementation of the S3Store
(InMemoryStore
) for testing that makes adding a test that actually uses the code above difficult. I can add a test like test_get_gzip_key
that adds the streaming wrapper within the testing logic and confirms the data written to disk is decompressed, but that's essentially a subset of what the test in telemetry-tools does and to me feels like confirming that python can gzip and gunzip a file. It would also potentially involve changing the S3Store
API to support specifying a content-encoding
when uploading an object, which is something I didn't want to do. An option to avoid changing the API would be to use e.g. file extension to determine content-encoding for the InMemoryStore
, but that reverts to the same gzip-gunzip cycle.
@maurodoglio the travis build fails because it requires mozilla/telemetry-tools#6 to be merged first. I'll have @mreid-moz review that because he has some knowledge of heka. |
Reviewed 1 of 3 files at r1. moztelemetry/heka_message_parser.py, line 42 at r1 (raw file):
|
I've come across an issue while testing this PR a bit more that requires investigation / discussion, so this PR should not be merged until it has been resolved. It's problematic enough that I will file a separate bug about it tomorrow, but the gist of it is that the client, the new infra, or both occasionally produce some large numbers that don't fit into doubles, and ujson doesn't like that very much. It is masked in the current infra by mpx/lua-cjson#37. See also ultrajson/ultrajson#49. |
I believe I've resolved the above issues, as follows: Fall back to the standard python json parser for cases where ujson fails. Per https://bugzilla.mozilla.org/show_bug.cgi?id=1326107 and @mreid-moz's above comment, perform the payload/submission merge in known order due to the residual NULLs. @mreid-moz can you re-r? This still requires mozilla/telemetry-tools#6. |
I've imported/rebased the changes I had from mozilla/telemetry-tools/pull/6 here. I also added a UTF8 check per feedback in mozilla/telemetry-batch-view/pull/159 and added a helper json parsing method I put the streaming gzip wrapper in a util/ subdirectory since it's not strictly related to heka but it can be moved elsewhere. @maurodoglio / @mreid-moz can you re-r? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @whd!
Reviewed 9 of 9 files at r2. Comments from Reviewable |
Looks good to me too. |
Requires mozilla/telemetry-tools#6.
I've made it the responsibility of the code that is returning the s3 byte stream to apply the streaming gzip wrapper, because that code generally has the easiest access to the object's content-encoding.
The implementation of
parse_heka_message
is mocked in the tests in this repo, but the gzip parsing code is tested in telemetry-tools.The only piece without tests is the conditional movement of the payload from
Payload
toFields[submission]
in the new infra, which I have added a special case for and tested manually.This change is