Skip to content

Use fixed column order in baseline_clients_last_seen #7119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

BenWu
Copy link
Contributor

@BenWu BenWu commented Feb 28, 2025

Description

The difference in column order is causing the focus queries to fail when dry running in https://app.circleci.com/pipelines/github/mozilla/bigquery-etl/46413/workflows/1a25c214-59e6-42f7-8e30-834ec5d10098/jobs/542162

[{'code': 400, 'errors': [{'message': 'No matching signature for function IF\n  Argument types: BOOL, STRUCT<days_seen_bits INT64, days_active_bits INT64, days_created_profile_bits INT64, ...>, STRUCT<days_seen_bits INT64, days_active_bits INT64, days_created_profile_bits INT64, ...>\n  Signature: IF(BOOL, T1, T1)...

The line is IF(_current.client_id IS NOT NULL, _current, _previous).*. Apparently column order matters for this so I'm adding all the columns to both CTEs in the same order

Reviewer, please follow this checklist

@BenWu
Copy link
Contributor Author

BenWu commented Feb 28, 2025

dry run passed so I think it's ok

Copy link
Collaborator

@scholtzan scholtzan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main caveat here is that from now on when new fields get added to the daily tables, they'll need to be explicitly added to these tables. Before propagation was happening automatically.
Not sure if we want/can add some checks to make sure fields get propagated correctly, since this has been the assumption for these dataset until now: https://docs.telemetry.mozilla.org/datasets/bigquery/clients_last_seen/reference.html?highlight=clients_daily#introduction

@BenWu
Copy link
Contributor Author

BenWu commented Feb 28, 2025

One way could be to use the sql generation to fill out the fields. We already kind of were explicitly add fields to last_seen because of the schema.yaml. This is an uncommon and non-blocking failure case since it's just stage dry runs so it's not urgent. I'll create and ticket for this and think more about it

@kwindau kwindau self-requested a review February 28, 2025 21:52
@kwindau
Copy link
Contributor

kwindau commented Feb 28, 2025

The main caveat here is that from now on when new fields get added to the daily tables, they'll need to be explicitly added to these tables. Before propagation was happening automatically. Not sure if we want/can add some checks to make sure fields get propagated correctly, since this has been the assumption for these dataset until now: https://docs.telemetry.mozilla.org/datasets/bigquery/clients_last_seen/reference.html?highlight=clients_daily#introduction

I lean towards having it explicit, just makes my life easier (easier to troubleshoot too, less things break downstream when new things get added, since you then have to intentionally add to this layer). Just my 2 cents!

Copy link
Contributor

@kwindau kwindau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with this but we can discuss more widely if needed too (I know there are differing opinions)

@BenWu
Copy link
Contributor Author

BenWu commented Mar 3, 2025

Created https://mozilla-hub.atlassian.net/browse/DENG-7977. I'll hold off on this for now but I'll try to get back to it soon

@sean-rose
Copy link
Contributor

The main caveat here is that from now on when new fields get added to the daily tables, they'll need to be explicitly added to these tables. Before propagation was happening automatically.

I'm confused, because there's currently a hard-coded schema.yaml for baseline_clients_last_seen_v1. Or is that schema.yaml being auto-augmented before deployment based on dry-running the (init) query?

@sean-rose
Copy link
Contributor

Personally I'd be inclined to hardcode the column lists so that changes to these heavily-relied-upon tables are made more intentionally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to the effort of fixing column orders in this query, there are a couple other cases it'd be good to fix up:

  • The init query has submission_date after the *_bits columns. I'd recommend we have submission_date be the first column in the init query like it is in the main query (IMO partition columns should always be the first column in the schema).
  • The _previous CTE has the *_bits columns in a different order than the init query and the _current CTE. Luckily because the IF(_current.client_id IS NOT NULL, _current, _previous).* REPLACE (...) expression is replacing those *_bits columns with explicit references to the associated columns from _current and _previous the end result is correct. However, if those *_bits columns weren't being replaced like that this would silently be doing the wrong thing (similar to unioning two queries with different column orders), so IMO it would be good to make the *_bits column order in _previous match the other two cases.

Comment on lines +60 to +69
isp,
app_build,
app_channel,
app_display_version,
architecture,
device_manufacturer,
device_model,
telemetry_sdk_build,
first_seen_date,
is_new_profile,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good if this query's column order matched the schema.yaml column order (also in the _previous CTE):

Suggested change
isp,
app_build,
app_channel,
app_display_version,
architecture,
device_manufacturer,
device_model,
telemetry_sdk_build,
first_seen_date,
is_new_profile,
app_build,
app_channel,
app_display_version,
architecture,
device_manufacturer,
device_model,
telemetry_sdk_build,
first_seen_date,
is_new_profile,
isp,

Or alternatively, the schema.yaml could be updated to match this column order.

@BenWu
Copy link
Contributor Author

BenWu commented Mar 11, 2025

I'm confused, because there's currently a hard-coded schema.yaml for baseline_clients_last_seen_v1. Or is that schema.yaml being auto-augmented before deployment based on dry-running the (init) query?

The prod deploys also run bqetl query schema update so the schema gets updated based on the dry run schema (not using init query by default). The clients_last_seen query references itself though and I'm not sure what that changes in both prod and stage deploys. It's possible we would need to update the schema.yaml anyway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants