-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Use gix
for cargo package
#15534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Use gix
for cargo package
#15534
Conversation
src/cargo/ops/cargo_package/vcs.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for kicking of this!
Regardless the approach taken here. I wonder if it is possible to avoid checking repo state completely. In PathSource::list_files
we already walked through it. And maybe we can attach extra info in PathEntry
. Of course it'll slow down list_files
a bit, but with that we could perhaps resolve #14941?.
It is just a bit relevant to this PR though. We can still merge this when ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's about time I get started with this - this setup here is super-hacky, but it's designed to allow a very step-wise conversion, hopefully allowing to work on it 15 minutes at a time.
Regarding #14941, I hope with #14955 I have found a representative issue. Right now I can't tell how that should be solved, so I'd probably go for having a gitoxide
alternative for cargo package
here and tackle the related issues another time. Ideally, the new implementation will be a bit faster and I hope I can test that on the aws-sdk-rust
repository.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't resist to profile the aws-sdk-rust
publishing step in Cargo and here is a picture of the run so far.


I don't know if this will ever finish, and I don't know how they can even get this published, maybe it just takes a very long time.
The culprit, maybe among other things, may be repo.status_file()
calls for which there isn't an equivalent in gitoxide
anyway. Thus I think this will look very differently when this PR is done and maybe that's an avenue for aws-sdk-rust
to publish much faster.
I will definitely test that once it's working and share results :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops I linked to the wrong issue. Was meant to say #14955
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that dirty_files_outside_pkg_root
is probably the reason for the individual git_status_file
calls.
The gitoxide
implementation can certainly be backported to the git2
one if there is demand, so potential gains could be available sooner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I realize that even with these improvements, and without --no-vcs-info
or similar it will still do one git-status costing about 1s each for each of these 404 crates, wasting nearly 7 minutes. Nonetheless, it's probably acceptable compared to the over three ours that it took otherwise (which would be equivalent to ~10800 git status checks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think #15416 is also properly fixed now, which also means the output is different in 2 tests when using the gitoxide
implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Really appreciate the effort here!
We are probably not going to backports, as the backporting process involves a bit more work, and usually reserved for some regression that has no way out.
Also, given the cargo team has agreed that the dirtiness check is best effort and informational only, I think once the PR is ready we can merge it without FCP.
Thanks again, really a great job!
7a409fd
to
f5d1c6b
Compare
This comment has been minimized.
This comment has been minimized.
f5d1c6b
to
39f3f02
Compare
39f3f02
to
071475b
Compare
071475b
to
7105c4f
Compare
It contains fixes for making `cargo` tests pass.
This should also help fixing these spurious "cannot package because some excluded file is untracked" issues.
7105c4f
to
0e60ce6
Compare
While at it, adjust two test expectations to deal with improved detail of status checks.
fde1e43
to
b79bd00
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is ready for a first review. Only gitoxide
needs a release which is planned for the 22nd, which is when this PR becomes formally mergeable as it can point to the new release instead of the Git repository.
Especially after RustConf where I was quite apologetic and feared gitoxide
could be removed again for not making progress, I thought maybe we should instead consider stabilizing the well-working parts earlier, such as in:
- nightly with opt-in
- stable with opt-out
- stable, with
git2
codepath removed.
cargo package
would be candidate for it as the new implementation is more correct and as fast as it can be.
What do you think?
@@ -49,7 +49,7 @@ flate2 = { version = "1.1.1", default-features = false, features = ["zlib-rs"] } | |||
git2 = "0.20.0" | |||
git2-curl = "0.21.0" | |||
# When updating this, also see if `gix-transport` further down needs updating or some auth-related tests will fail. | |||
gix = { version = "0.71.0", default-features = false, features = ["blocking-http-transport-curl", "progress-tree", "parallel", "dirwalk"] } | |||
gix = { git = "https://github.com/GitoxideLabs/gitoxide", branch = "improvements", version = "0.72.1", default-features = false, features = ["blocking-http-transport-curl", "progress-tree", "parallel", "dirwalk", "status"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be set to the latest release on the 22nd or 23rd.
@@ -450,7 +450,11 @@ fn prepare_archive( | |||
let src_files = src.list_files(pkg)?; | |||
|
|||
// Check (git) repository state, getting the current commit hash. | |||
let vcs_info = vcs::check_repo_state(pkg, &src_files, ws, &opts)?; | |||
let vcs_info = if ws.gctx().cli_unstable().gitoxide.is_some() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping both worlds neatly separated would allow for a staged rollout:
- nightly and opt-in
- stable and opt-out
- stable (with
git2
codepath removed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I would lean towards making it insta-stable, as I mentioned the check is a best-effort, though I'll put this in Cargo team meeting agenda next week!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty much like this approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid wasting efforts, let's hold off feature-gate until we hear back from the team :)
use std::collections::HashSet; | ||
use std::path::{Path, PathBuf}; | ||
use tracing::debug; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the structure and comments as similar as I could, compared to git2
, so it's quite straightforward to compare.
// TODO: Either remove this whole block, or have a test. | ||
// It's hard to have no Cargo.toml here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear to me how this could possibly happen - there is no test for it either. From looking at git blame
it wasn't obvious what motivated this check.
Probably nothing wrong with keeping it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe only some parts of the package is under git? Like src
is a git repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When merge, I'll recommend we split rename file into a single commit, so Git knows better to track it.
// TODO: Either remove this whole block, or have a test. | ||
// It's hard to have no Cargo.toml here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe only some parts of the package is under git? Like src
is a git repo.
if relative_package_root.is_some_and(|pkg_root| !rel_path.starts_with(pkg_root)) { | ||
dirty_files_outside_of_package_root.push(path); | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity. If we adopt the same optimization in git2 version, how the performance would look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe only some parts of the package is under git? Like src is a git repo.
It seemed that what the code really was testing is if the file exists, and not if it was tracked by Git. Thus I made it an explicit 'file exists' check, which seemed strange as Cargo won't function without such a manifest. Removing the whole block doesn't fail CI either, so it's a bit unclear why it's there. There is still no tool I know that could do forensics and find everything that Git knows about it, that would certainly be interesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity. If we adopt the same optimization in git2 version, how the performance would look like?
The 'architecture' changes here could all be applied to the git2
implementation, with the same effects. Performance problems should be solved then.
Edit: If the team decides not to fast-track gix
here, then I think the AWS team could trivially contribute a port of this to solve their problems. Of course, I vote for gix
as I believe it's more correct when dealing with ignored files, which should solve a problem I am having when releasing gitoxide
😅 (it claims an ignored file is dirty, despite it being ignored).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we are mostly doing the same thing except using gitoxide. Do I understand correctly?
Apart from the dirty_files_outside_of_package_root
work, have you found any other stuff to optmize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The driver of these changes is the lack of single-file APIs, and I noticed no other (previously unknown) optimisation opportunities.
tldr; I don't think there are simple optimisations left, except for maybe the N^2 portion, but even there it's hard to make it show up in a profiler.
But now that I am thinking about it, there might be one: I a workspace like aws-sdk-rust
I believe this code runs once for each crate, meaning there will be a full git status
each time. It would certainly require more complex changes to be able to cache these between the publish calls. Each one takes a little less than a second there, so it's quite some time in aggregate.
But before that, I'd probably optimize the fetch crates
stage to run only once, it takes most of the time.
While the PR is not ready to merge yet, I am kicking off the process to get a consensus on how we want to merge this. @rfcbot fcp merge Proposal: Switch to gix for Reasons:
|
Team member @weihanglo has proposed to merge this. The next step is review by the rest of the tagged team members: No concerns currently listed. Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up! See this document for info about what commands tagged team members can give me. |
This should also help fixing these spurious "cannot package because some excluded file is untracked" issues.
Tasks
vcs.rs
cleanupfinal check by myselfNotes for the Reviewer
cargo package
on1.81.0
#14955.gitoxide
to the people, to start on nightly, and then switch it over for everyone with an emergency-switch to usegit2
, to finally remove thegit2
codepath.Related to GitoxideLabs/gitoxide#106.