-
Notifications
You must be signed in to change notification settings - Fork 55
Fetch table metadata in parallel and asynchronously and add them to a queue for faster processing #322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
abhisheknath2011
merged 8 commits into
linkedin:main
from
abhisheknath2011:parallel_metadata_fetch
May 21, 2025
Merged
Fetch table metadata in parallel and asynchronously and add them to a queue for faster processing #322
abhisheknath2011
merged 8 commits into
linkedin:main
from
abhisheknath2011:parallel_metadata_fetch
May 21, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… queue for faster processing
teamurko
reviewed
May 15, 2025
apps/spark/src/main/java/com/linkedin/openhouse/jobs/scheduler/JobsScheduler.java
Outdated
Show resolved
Hide resolved
apps/spark/src/main/java/com/linkedin/openhouse/jobs/scheduler/tasks/OperationTasksBuilder.java
Outdated
Show resolved
Hide resolved
… refactoring and cleanup
teamurko
reviewed
May 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @abhisheknath2011, generally looks good
apps/spark/src/main/java/com/linkedin/openhouse/jobs/scheduler/JobsScheduler.java
Outdated
Show resolved
Hide resolved
apps/spark/src/main/java/com/linkedin/openhouse/jobs/scheduler/tasks/OperationTasksBuilder.java
Outdated
Show resolved
Hide resolved
apps/spark/src/main/java/com/linkedin/openhouse/jobs/scheduler/JobsScheduler.java
Show resolved
Hide resolved
…and we don't have replication task in OSS repo
teamurko
previously approved these changes
May 21, 2025
apps/spark/src/main/java/com/linkedin/openhouse/jobs/scheduler/JobsScheduler.java
Show resolved
Hide resolved
teamurko
approved these changes
May 21, 2025
17 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Issue] Briefly discuss the summary of the changes made in this
pull request in 2-3 lines.
The jobs scheduler for each job type fetches table metadata as a first step before start submitting jobs. The metadata discovery process takes around 1.5 - 2 hours for 30k+ tables and this number is eventually going to increase as more tables are onboarded to openhouse. As a result, job submission is delayed. In order to accelerate the job submission it is required to improve the table metadata discovery. Hence this PR fetches table metadata in parallel and asynchronously. The fetched table metadata is added to a queue for faster processing. The below are the changes done in this PR.
onAfterTerminate
which ensures that all the parallel threads finishes and were shutdown. Based on this signal queue consumer terminates.Note: As this feature is configurable we should be able to enable and roll this out by job type.
Changes
For all the boxes checked, please include additional details of the changes made in this pull request.
Testing Done
Tested on docker. Printed the output only for validation. Fetched metadata count and submitted job count matches which is 30 (Refer to the end of the log).
For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.
Additional Information
For all the boxes checked, include additional details of the changes made in this pull request.