-
Notifications
You must be signed in to change notification settings - Fork 113
Use fixed column order in baseline_clients_last_seen #7119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
dry run passed so I think it's ok |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main caveat here is that from now on when new fields get added to the daily
tables, they'll need to be explicitly added to these tables. Before propagation was happening automatically.
Not sure if we want/can add some checks to make sure fields get propagated correctly, since this has been the assumption for these dataset until now: https://docs.telemetry.mozilla.org/datasets/bigquery/clients_last_seen/reference.html?highlight=clients_daily#introduction
One way could be to use the sql generation to fill out the fields. We already kind of were explicitly add fields to last_seen because of the schema.yaml. This is an uncommon and non-blocking failure case since it's just stage dry runs so it's not urgent. I'll create and ticket for this and think more about it |
I lean towards having it explicit, just makes my life easier (easier to troubleshoot too, less things break downstream when new things get added, since you then have to intentionally add to this layer). Just my 2 cents! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with this but we can discuss more widely if needed too (I know there are differing opinions)
Created https://mozilla-hub.atlassian.net/browse/DENG-7977. I'll hold off on this for now but I'll try to get back to it soon |
I'm confused, because there's currently a hard-coded |
Personally I'd be inclined to hardcode the column lists so that changes to these heavily-relied-upon tables are made more intentionally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to the effort of fixing column orders in this query, there are a couple other cases it'd be good to fix up:
- The init query has
submission_date
after the*_bits
columns. I'd recommend we havesubmission_date
be the first column in the init query like it is in the main query (IMO partition columns should always be the first column in the schema). - The
_previous
CTE has the*_bits
columns in a different order than the init query and the_current
CTE. Luckily because theIF(_current.client_id IS NOT NULL, _current, _previous).* REPLACE (...)
expression is replacing those*_bits
columns with explicit references to the associated columns from_current
and_previous
the end result is correct. However, if those*_bits
columns weren't being replaced like that this would silently be doing the wrong thing (similar to unioning two queries with different column orders), so IMO it would be good to make the*_bits
column order in_previous
match the other two cases.
isp, | ||
app_build, | ||
app_channel, | ||
app_display_version, | ||
architecture, | ||
device_manufacturer, | ||
device_model, | ||
telemetry_sdk_build, | ||
first_seen_date, | ||
is_new_profile, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be good if this query's column order matched the schema.yaml
column order (also in the _previous
CTE):
isp, | |
app_build, | |
app_channel, | |
app_display_version, | |
architecture, | |
device_manufacturer, | |
device_model, | |
telemetry_sdk_build, | |
first_seen_date, | |
is_new_profile, | |
app_build, | |
app_channel, | |
app_display_version, | |
architecture, | |
device_manufacturer, | |
device_model, | |
telemetry_sdk_build, | |
first_seen_date, | |
is_new_profile, | |
isp, |
Or alternatively, the schema.yaml
could be updated to match this column order.
The prod deploys also run |
Description
The difference in column order is causing the focus queries to fail when dry running in https://app.circleci.com/pipelines/github/mozilla/bigquery-etl/46413/workflows/1a25c214-59e6-42f7-8e30-834ec5d10098/jobs/542162
The line is
IF(_current.client_id IS NOT NULL, _current, _previous).*
. Apparently column order matters for this so I'm adding all the columns to both CTEs in the same orderReviewer, please follow this checklist