Skip to content

Use gix for cargo package #15534

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Byron
Copy link
Member

@Byron Byron commented May 16, 2025

This should also help fixing these spurious "cannot package because some excluded file is untracked" issues.

Tasks

  • step-by-step conversion of vcs.rs
  • use proper feature toggle
  • cleanup final check by myself
  • move split & rename into its own commit. Probably squash all changes except for the gix upgrade.
  • upgrade to a gix release including various improvements GitoxideLabs/gitoxide#2016

Notes for the Reviewer

Related to GitoxideLabs/gitoxide#106.

@rustbot rustbot added A-git Area: anything dealing with git Command-package labels May 16, 2025
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for kicking of this!

Regardless the approach taken here. I wonder if it is possible to avoid checking repo state completely. In PathSource::list_files we already walked through it. And maybe we can attach extra info in PathEntry. Of course it'll slow down list_files a bit, but with that we could perhaps resolve #14941?.

It is just a bit relevant to this PR though. We can still merge this when ready.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's about time I get started with this - this setup here is super-hacky, but it's designed to allow a very step-wise conversion, hopefully allowing to work on it 15 minutes at a time.

Regarding #14941, I hope with #14955 I have found a representative issue. Right now I can't tell how that should be solved, so I'd probably go for having a gitoxide alternative for cargo package here and tackle the related issues another time. Ideally, the new implementation will be a bit faster and I hope I can test that on the aws-sdk-rust repository.

Copy link
Member Author

@Byron Byron May 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't resist to profile the aws-sdk-rust publishing step in Cargo and here is a picture of the run so far.

Screenshot 2025-05-17 at 12 54 31 Screenshot 2025-05-17 at 12 55 45

I don't know if this will ever finish, and I don't know how they can even get this published, maybe it just takes a very long time.

The culprit, maybe among other things, may be repo.status_file() calls for which there isn't an equivalent in gitoxide anyway. Thus I think this will look very differently when this PR is done and maybe that's an avenue for aws-sdk-rust to publish much faster.

I will definitely test that once it's working and share results :).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops I linked to the wrong issue. Was meant to say #14955

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that dirty_files_outside_pkg_root is probably the reason for the individual git_status_file calls.
The gitoxide implementation can certainly be backported to the git2 one if there is demand, so potential gains could be available sooner.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I realize that even with these improvements, and without --no-vcs-info or similar it will still do one git-status costing about 1s each for each of these 404 crates, wasting nearly 7 minutes. Nonetheless, it's probably acceptable compared to the over three ours that it took otherwise (which would be equivalent to ~10800 git status checks).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think #15416 is also properly fixed now, which also means the output is different in 2 tests when using the gitoxide implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Really appreciate the effort here!

We are probably not going to backports, as the backporting process involves a bit more work, and usually reserved for some regression that has no way out.

Also, given the cargo team has agreed that the dirtiness check is best effort and informational only, I think once the PR is ready we can merge it without FCP.

Thanks again, really a great job!

@Byron Byron force-pushed the gix-status-for-cargo-package branch 3 times, most recently from 7a409fd to f5d1c6b Compare May 17, 2025 19:11
@rustbot

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-author Status: The marked PR is awaiting some action (such as code changes) from the PR author. label May 18, 2025
@Byron Byron force-pushed the gix-status-for-cargo-package branch from f5d1c6b to 39f3f02 Compare May 18, 2025 07:27
@Byron Byron force-pushed the gix-status-for-cargo-package branch from 39f3f02 to 071475b Compare May 18, 2025 09:11
@rustbot rustbot added the A-infrastructure Area: infrastructure around the cargo repo, ci, releases, etc. label May 18, 2025
@Byron Byron force-pushed the gix-status-for-cargo-package branch from 071475b to 7105c4f Compare May 18, 2025 10:20
Byron added 2 commits May 18, 2025 15:16
It contains fixes for making `cargo` tests pass.
This should also help fixing these spurious "cannot package because
some excluded file is untracked" issues.
While at it, adjust two test expectations to deal with improved detail
of status checks.
@Byron Byron force-pushed the gix-status-for-cargo-package branch from fde1e43 to b79bd00 Compare May 18, 2025 16:28
Copy link
Member Author

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ready for a first review. Only gitoxide needs a release which is planned for the 22nd, which is when this PR becomes formally mergeable as it can point to the new release instead of the Git repository.

Especially after RustConf where I was quite apologetic and feared gitoxide could be removed again for not making progress, I thought maybe we should instead consider stabilizing the well-working parts earlier, such as in:

  • nightly with opt-in
  • stable with opt-out
  • stable, with git2 codepath removed.

cargo package would be candidate for it as the new implementation is more correct and as fast as it can be.

What do you think?

@@ -49,7 +49,7 @@ flate2 = { version = "1.1.1", default-features = false, features = ["zlib-rs"] }
git2 = "0.20.0"
git2-curl = "0.21.0"
# When updating this, also see if `gix-transport` further down needs updating or some auth-related tests will fail.
gix = { version = "0.71.0", default-features = false, features = ["blocking-http-transport-curl", "progress-tree", "parallel", "dirwalk"] }
gix = { git = "https://github.com/GitoxideLabs/gitoxide", branch = "improvements", version = "0.72.1", default-features = false, features = ["blocking-http-transport-curl", "progress-tree", "parallel", "dirwalk", "status"] }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be set to the latest release on the 22nd or 23rd.

@@ -450,7 +450,11 @@ fn prepare_archive(
let src_files = src.list_files(pkg)?;

// Check (git) repository state, getting the current commit hash.
let vcs_info = vcs::check_repo_state(pkg, &src_files, ws, &opts)?;
let vcs_info = if ws.gctx().cli_unstable().gitoxide.is_some() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping both worlds neatly separated would allow for a staged rollout:

  • nightly and opt-in
  • stable and opt-out
  • stable (with git2 codepath removed)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I would lean towards making it insta-stable, as I mentioned the check is a best-effort, though I'll put this in Cargo team meeting agenda next week!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much like this approach

#13696

Copy link
Member

@weihanglo weihanglo May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid wasting efforts, let's hold off feature-gate until we hear back from the team :)

use std::collections::HashSet;
use std::path::{Path, PathBuf};
use tracing::debug;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the structure and comments as similar as I could, compared to git2, so it's quite straightforward to compare.

Comment on lines +67 to +68
// TODO: Either remove this whole block, or have a test.
// It's hard to have no Cargo.toml here?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to me how this could possibly happen - there is no test for it either. From looking at git blame it wasn't obvious what motivated this check.
Probably nothing wrong with keeping it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe only some parts of the package is under git? Like src is a git repo.

Copy link
Member

@weihanglo weihanglo May 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When merge, I'll recommend we split rename file into a single commit, so Git knows better to track it.

Comment on lines +67 to +68
// TODO: Either remove this whole block, or have a test.
// It's hard to have no Cargo.toml here?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe only some parts of the package is under git? Like src is a git repo.

Comment on lines +294 to +297
if relative_package_root.is_some_and(|pkg_root| !rel_path.starts_with(pkg_root)) {
dirty_files_outside_of_package_root.push(path);
continue;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity. If we adopt the same optimization in git2 version, how the performance would look like?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe only some parts of the package is under git? Like src is a git repo.

It seemed that what the code really was testing is if the file exists, and not if it was tracked by Git. Thus I made it an explicit 'file exists' check, which seemed strange as Cargo won't function without such a manifest. Removing the whole block doesn't fail CI either, so it's a bit unclear why it's there. There is still no tool I know that could do forensics and find everything that Git knows about it, that would certainly be interesting.

Copy link
Member Author

@Byron Byron May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity. If we adopt the same optimization in git2 version, how the performance would look like?

The 'architecture' changes here could all be applied to the git2 implementation, with the same effects. Performance problems should be solved then.

Edit: If the team decides not to fast-track gix here, then I think the AWS team could trivially contribute a port of this to solve their problems. Of course, I vote for gix as I believe it's more correct when dealing with ignored files, which should solve a problem I am having when releasing gitoxide 😅 (it claims an ignored file is dirty, despite it being ignored).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we are mostly doing the same thing except using gitoxide. Do I understand correctly?
Apart from the dirty_files_outside_of_package_root work, have you found any other stuff to optmize?

Copy link
Member Author

@Byron Byron May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The driver of these changes is the lack of single-file APIs, and I noticed no other (previously unknown) optimisation opportunities.
tldr; I don't think there are simple optimisations left, except for maybe the N^2 portion, but even there it's hard to make it show up in a profiler.

But now that I am thinking about it, there might be one: I a workspace like aws-sdk-rust I believe this code runs once for each crate, meaning there will be a full git status each time. It would certainly require more complex changes to be able to cache these between the publish calls. Each one takes a little less than a second there, so it's quite some time in aggregate.
But before that, I'd probably optimize the fetch crates stage to run only once, it takes most of the time.

@weihanglo
Copy link
Member

While the PR is not ready to merge yet, I am kicking off the process to get a consensus on how we want to merge this.

@rfcbot fcp merge

Proposal: Switch to gix for cargo package VCS dirtiness check completely, without any fallback to libgit2.

Reasons:

@rfcbot
Copy link
Collaborator

rfcbot commented May 20, 2025

Team member @weihanglo has proposed to merge this. The next step is review by the rest of the tagged team members:

No concerns currently listed.

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

@rfcbot rfcbot added proposed-final-comment-period An FCP proposal has started, but not yet signed off. disposition-merge FCP with intent to merge labels May 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-git Area: anything dealing with git A-infrastructure Area: infrastructure around the cargo repo, ci, releases, etc. Command-package disposition-merge FCP with intent to merge proposed-final-comment-period An FCP proposal has started, but not yet signed off. S-waiting-on-author Status: The marked PR is awaiting some action (such as code changes) from the PR author. T-cargo Team: Cargo
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants