Skip to content

feat: log disk space usage info, warn if close to exhaustion #603

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

booxter
Copy link
Contributor

@booxter booxter commented Jun 9, 2025

Closes: #525

Signed-off-by: Ihar Hrachyshka [email protected]

@mergify mergify bot added the ci-failure label Jun 9, 2025
@booxter booxter force-pushed the checkpoint-file-size-checks branch from 3a85dce to 6bb93f8 Compare June 9, 2025 22:42
@mergify mergify bot removed the ci-failure label Jun 9, 2025
@booxter booxter force-pushed the checkpoint-file-size-checks branch 2 times, most recently from 9fe17b0 to 2db72cf Compare June 9, 2025 23:01
@instructlab instructlab deleted a comment from github-actions bot Jun 9, 2025
Copy link

E2E (NVIDIA L40S x4) (python 3.11) workflow launched on this PR: View run

Copy link

e2e workflow succeeded on this PR: View run, congrats!

@booxter
Copy link
Contributor Author

booxter commented Jun 10, 2025

@booxter booxter marked this pull request as ready for review June 10, 2025 13:42
@booxter booxter requested a review from fynnsu June 10, 2025 13:57
@booxter
Copy link
Contributor Author

booxter commented Jun 10, 2025

This will conflict with #605 so one or the other will need a rebase, depending what merges first.

@booxter
Copy link
Contributor Author

booxter commented Jun 18, 2025

@fynnsu Charlie is on PTO till Monday and I don't have much time to land this so if you could please take a look at this PR, it's in line with your suggestion in the issue. Let me know what should be improved here.

Copy link
Collaborator

@fynnsu fynnsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Just added a small comment on placement of check, but with the "early warning steps" the existing code should also work.

@mergify mergify bot added the one-approval label Jun 20, 2025
@fynnsu
Copy link
Collaborator

fynnsu commented Jun 20, 2025

#525

@booxter booxter force-pushed the checkpoint-file-size-checks branch from 2db72cf to 3742584 Compare June 20, 2025 16:13
@mergify mergify bot added the ci-failure label Jun 20, 2025
Copy link
Contributor

mergify bot commented Jun 20, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. @booxter please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 20, 2025
@booxter booxter force-pushed the checkpoint-file-size-checks branch from 3742584 to 8e3f509 Compare June 20, 2025 17:05
@booxter booxter force-pushed the checkpoint-file-size-checks branch from 8e3f509 to cdbb538 Compare June 20, 2025 21:50
@booxter booxter added hold and removed hold labels Jun 23, 2025
@booxter
Copy link
Contributor Author

booxter commented Jun 24, 2025

@Mergifyio rebase

Copy link
Contributor

mergify bot commented Jun 24, 2025

rebase

✅ Branch has been successfully rebased

@booxter booxter force-pushed the checkpoint-file-size-checks branch from cdbb538 to 25f75df Compare June 24, 2025 15:13
@booxter
Copy link
Contributor Author

booxter commented Jun 25, 2025

https://github.com/Mergifyio rebase

not sure why aws creds fail for smoke...

Copy link
Contributor

mergify bot commented Jun 25, 2025

rebase

☑️ Nothing to do, the required conditions are not met

  • any of:
    • #commits > 1 [📌 rebase requirement]
    • #commits-behind > 0 [📌 rebase requirement]
    • -linear-history [📌 rebase requirement]
  • -closed [📌 rebase requirement]
  • -conflict [📌 rebase requirement]
  • queue-position = -1 [📌 rebase requirement]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve error logging during training when disk space is not enough
2 participants