Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catalog permission syncs severely degrades dashboard import performance #32993

Open
2 of 3 tasks
arafoperata opened this issue Apr 3, 2025 · 15 comments · May be fixed by #33000
Open
2 of 3 tasks

Catalog permission syncs severely degrades dashboard import performance #32993

arafoperata opened this issue Apr 3, 2025 · 15 comments · May be fixed by #33000
Assignees
Labels
#bug:performance Performance bugs dashboard:import Related to importing dashboards

Comments

@arafoperata
Copy link

Bug description

The addition of catalogs in Superset 4.1.0 has resulted in significantly worse dashboard import performance. We're seeing dashboard import time increase from <10 seconds to >3 minutes.

We're currently using Postgres (RDS) to serve embedded dashboards in our multi-tenanted application. As we are multi-tenanted, we have one catalog/database per tenant.

When importing dashboards, Superset now iterates through all catalog names to sync permissions. As we have hundreds of catalogs/dbs, this sync adds a couple of minutes per import.

We need to import a default set of dashboards for each tenant (with their own unique set of datasets pointing to their individual databases). The sync permission step needs to run for each dashboard import.

This results in NxN syncs (in our case, N=200, so on the order of 40k syncs). So, the time to upload dashboards for all our tenants has exploded from 30-60 minutes to over 10 hours.

Expected Behavior

I see there already exists a flag to disable catalog discovery - allow_multi_catalog. I propose that if this flag is set to false, then Superset should only sync permissions for the default catalog.

Screenshots/recordings

No response

Superset version

master / latest-dev

Python version

3.9

Node version

16

Browser

Chrome

Additional context

No response

Checklist

  • I have searched Superset docs and Slack and didn't find a solution to my problem.
  • I have searched the GitHub issue tracker and didn't find a similar bug report.
  • I have checked Superset's logs for errors and if I found a relevant Python stacktrace, I included it here as text in the "additional context" section.
@dosubot dosubot bot added #bug:performance Performance bugs dashboard:import Related to importing dashboards labels Apr 3, 2025
@sadpandajoe sadpandajoe added the review:checkpoint Last PR reviewed during the daily review standup label Apr 3, 2025
@sadpandajoe
Copy link
Member

@betodealmeida and/or @Vitor-Avila I think recently did some performance improvements for catalogs. Not sure if this was covered, if it was, it might be something that is already fixed or can be cherried in a future release.

@betodealmeida
Copy link
Member

betodealmeida commented Apr 3, 2025

I see there already exists a flag to disable catalog discovery - allow_multi_catalog. I propose that if this flag is set to false, then Superset should only sync permissions for the default catalog.

I've had some security concerns about this in the past, since even if you have allow_multi_catalog disabled people can still query across catalogs in SQL Lab. But we do (1) parse the SQL and (2) check if the query is across catalogs, so if we don't create the permissions for non-default catalogs it should still prevent the user from running the query.

Let me do some testing and make sure it's safe to skip the permission creation. If so, then it's easy to run the catalog permission creation only for the default catalog when multi-catalog is off.

@Vitor-Avila
Copy link
Contributor

Vitor-Avila commented Apr 3, 2025

@betodealmeida we have the CATALOGS_SIMPLIFIED_MIGRATION config. I wonder if it makes sense to have something similar that would be applied to imports. The permissions can still be created after the import is executed through the Sync Permissions button, but it would unblock the import performance

@Vitor-Avila
Copy link
Contributor

another option (if it's not the case yet) is to rely on SYNC_DB_PERMISSIONS_IN_ASYNC_MODE to create the perms via celery and not hold the import (not sure if currently that's what happens)

@betodealmeida
Copy link
Member

@Vitor-Avila I'm trying to think if we want to support the use case where (1) multi-catalog is disabled but (2) the admin still wants to manage permissions for non-default catalogs. Maybe for BigQuery this is a valid use case?

@Vitor-Avila
Copy link
Contributor

@betodealmeida even if multi_catalog is disabled end-users can still go to SQL Lab and run cross-DB selects, right? So that would be applicable to BQ, Snowflake, PSQL...

But regardless, it's a half-check because if the user saves the query as a virtual dataset, it gets mapped to the default catalog and then they can query it.

@betodealmeida
Copy link
Member

@betodealmeida even if multi_catalog is disabled end-users can still go to SQL Lab and run cross-DB selects, right? So that would be applicable to BQ, Snowflake, PSQL...

They can, but if they don't have DB-level access it won't work. For example, if I only have catalog access to projectA, this will fail:

SELECT * FROM projectB.some_schema.some_dataset

But regardless, it's a half-check because if the user saves the query as a virtual dataset, it gets mapped to the default catalog and then they can query it.

Hmmm, would it work? If so we should fix it.

@Vitor-Avila
Copy link
Contributor

I see.. so basically we check cross-DB queries even if multi_catalog is disabled.

Hmmm, would it work? If so we should fix it.

Last time I tested it did, because once you create a dataset it gets a catalog perm, and that's the catalog we'll validate if you have access to or not (as opposed to the actual catalog(s) being queried).

@betodealmeida
Copy link
Member

Last time I tested it did, because once you create a dataset it gets a catalog perm, and that's the catalog we'll validate if you have access to or not (as opposed to the actual catalog(s) being queried).

OK, I'll create a ticket and investigate. Thanks!

@betodealmeida
Copy link
Member

@Vitor-Avila one thing different here is that Postgres (and RDS, which is what @arafoperata is using) does not allow cross-catalog queries. Unlike BigQuery, this syntax is invalid:

SELECT * FROM other_database.public.table

For databases like this we could have an attribute allow_cross_catalog_queries set to false, and when that's the case we don't create permissions for non-default catalogs when multi-catalog is disabled.

@betodealmeida betodealmeida linked a pull request Apr 4, 2025 that will close this issue
9 tasks
@betodealmeida
Copy link
Member

Tentative solution here: https://github.com/apache/superset/pull/33000/files#diff-8eeb4739ec1b8988500d5a48abf1de2624ce370a4283a14f9ecebd2e081e3a48R200-R214

@betodealmeida
Copy link
Member

@arafoperata can you confirm if RDS supports cross-database queries? We found some answers on StackOverflow saying yes, but it would be nice to have a definitive source.

@arafoperata
Copy link
Author

@arafoperata can you confirm if RDS supports cross-database queries? We found some answers on StackOverflow saying yes, but it would be nice to have a definitive source.

@betodealmeida, it depends on the underlying database engine. Since we use RDS Postgres, which does not natively support cross-database (cross-catalog) queries, we can’t do something like SELECT * FROM other_database.schema.table.

SQL Server on RDS does support cross-database queries, since it’s a native feature of SQL Server.

I’d say allow_cross_catalog_queries should depend on the capabilities of the underlying engine as RDS is just a managed wrapper around those engines.

@Vitor-Avila
Copy link
Contributor

thanks for the additional context, @arafoperata 🙏

@rusackas rusackas removed the review:checkpoint Last PR reviewed during the daily review standup label Apr 7, 2025
@arafoperata
Copy link
Author

No worries, thanks for actioning this so quickly @betodealmeida , @Vitor-Avila

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
#bug:performance Performance bugs dashboard:import Related to importing dashboards
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants