Skip to content

Some ETLs and views may be silently unioning data incorrectly #7461

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
data-sync-user opened this issue May 16, 2025 · 3 comments
Open

Some ETLs and views may be silently unioning data incorrectly #7461

data-sync-user opened this issue May 16, 2025 · 3 comments

Comments

@data-sync-user
Copy link
Collaborator

While investigating an error relating to unioning ping data [~accountid:6047cd5cd7f56e0071965b2d] noticed that the type and enrollment columns in the ping_info.experiments[].value.extra struct are in different orders in various pings, and when unioning such pings together as-is BigQuery won’t complain because their column types are compatible, which could result in data silently ending up in the wrong column in the union output for some pings.

This has been manually worked around in a couple of cases recently (bigquery-etl#6878, bigquery-etl#6887), but there may be other such cases we don’t yet know about.

It’s possible the Schema.generate_compatible_select_expression() method (code) could be used to help with this situation (it’s currently used for unioning pings in the Glean app ping views).

┆Issue is synchronized with this Jira Bug

@data-sync-user
Copy link
Collaborator Author

➤ Sean Rose commented:

#6916 ( https://github.com/mozilla/bigquery-etl/pull/6916|smart-link ) will allow us to programmatically check for possibly incorrect unions.

Here are the results of a test run of that code:

@data-sync-user
Copy link
Collaborator Author

➤ Ben Wu commented:

Did you run this on the generated sql? I would expect more of these errors in the generated sql

@data-sync-user
Copy link
Collaborator Author

➤ Sean Rose commented:

Yes, I ran it on everything in the sql directory on the private-generated-sql ( https://github.com/mozilla/private-bigquery-etl/tree/private-generated-sql/sql ) branch. The only blind spot I’m aware of is there were 26 cases where I didn’t have the necessary permissions on the tables/views being selected from and got a 403 Forbidden error from BigQuery.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant