Skip to content

Tracking Issue for garbage collection #12633

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ehuss opened this issue Sep 7, 2023 · 15 comments
Open

Tracking Issue for garbage collection #12633

ehuss opened this issue Sep 7, 2023 · 15 comments
Labels
A-caching Area: caching of dependencies, repositories, and build artifacts C-tracking-issue Category: A tracking issue for something unstable. Command-clean S-waiting-on-feedback Status: An implemented feature is waiting on community feedback for bugs or design concerns. Z-gc Nightly: garbage collection

Comments

@ehuss
Copy link
Contributor

ehuss commented Sep 7, 2023

Summary

Original proposal: https://hackmd.io/@rust-cargo-team/SJT-p_rL2
Implementation: #12634
Documentation: https://doc.rust-lang.org/nightly/cargo/reference/unstable.html#gc
Issues: Z-gc Nightly: garbage collection

The -Zgc flag enable garbage collection for deleting old, unused files in cargo's cache.

Status

Unresolved Issues

Future Extensions

No response

About tracking issues

Tracking issues are used to record the overall progress of implementation.
They are also used as hubs connecting to other relevant issues, e.g., bugs or open design questions.
A tracking issue is however not meant for large scale discussion, questions, or bug reports about a feature.
Instead, open a dedicated issue for the specific matter and add the relevant feature gate label.

@ehuss ehuss added C-tracking-issue Category: A tracking issue for something unstable. S-waiting-on-feedback Status: An implemented feature is waiting on community feedback for bugs or design concerns. labels Sep 7, 2023
@ehuss ehuss moved this to In Progress in Cargo Roadmap Sep 8, 2023
@epage
Copy link
Contributor

epage commented Sep 11, 2023

For when we get to target/:

From @bjorn3 at https://hachyderm.io/@bjorn3/111047792430714997

The incr comp cache is GCed on every compilation by only copying artifacts that were used for the current compilation session to the new incr comp cache dir and then removing the old incr comp cache dir entirely. In effect it is a maximally eager semi-space garbage collector. This is not suitable for cargo at all because alternating between cargo build -p foo and cargo build -p bar is guaranteed to rebuild things every time as each build would delete the artifacts of the other build.

@epage
Copy link
Contributor

epage commented Sep 11, 2023

Quick scan of brew

  • autoremove
  • cleanup
    • "Removes all downloads more than 120 days old. This can be adjusted with HOMEBREW_CLEANUP_MAX_AGE_DAYS."
  • HOMEBREW_CLEANUP_PERIODIC_FULL_DAYS (default 30 days)

One complaint that came up was " "brew cleanup has not been run in 30 days, running now" ... and then proceeds to run an interminable process in the middle of you attempting to do something else." (mastadon)

@epage
Copy link
Contributor

epage commented Sep 12, 2023

From https://hachyderm.io/@[email protected]/111048319933010803

Also: global vs. local? Just keep all things local. Errant cache hits pulling from a non-scoped global cache are a disaster. Errant cache hits from the same local cache are a "whoops." For shared artifacts (e.g. a singular tar downloaded from the registry), global is fine, and should just grow unbounded. (Seriously: nobody cares.)

pnpm, bun, npm, yarn: package cache is intended to be shared global immutable, unbounded growth.

I'm assuming our global package cache is to help with CI but we should probably explicitly document out priority use cases.

@epage epage added Command-clean A-caching Area: caching of dependencies, repositories, and build artifacts labels Sep 20, 2023
@mcclure

This comment was marked as resolved.

@epage

This comment was marked as resolved.

@mcclure

This comment was marked as resolved.

bors added a commit that referenced this issue Nov 11, 2023
Add cache garbage collection

### What does this PR try to resolve?

This introduces a new garbage collection system which can track the last time files were used in cargo's global cache, and delete old, unused files either automatically or manually.

### How should we test and review this PR?

This is broken up into a large number of commits, and each commit should have a short overview of what it does. I am breaking some of these out into separate PRs as well (unfortunately GitHub doesn't really support stacked pull requests). I expect to reduce the size of this PR if those other PRs are accepted.

I would first review `unstable.md` to give you an idea of what the user side of this looks like. I would then skim over each commit message to give an overview of all the changes. The core change is the introduction of the `GlobalCacheTracker` which is an interface to a sqlite database which is used for tracking the timestamps.

### Additional information

I think the interface for this will almost certainly change over time. This is just a stab to create a starting point where we can start testing and discussing what actual user flags should be exposed. This is also intended to start the process of getting experience using sqlite, and getting some testing in real-world environments to see how things might fail.

I'd like to ask for the review to not focus too much on bikeshedding flag names and options. I expect them to change, so this is by no means a concrete proposal for where it will end up. For example, the options are very granular, and I would like to have fewer options. However, it isn't clear how that might best work. The size-tracking options almost certainly need to change, but I do not know exactly what the use cases for size-tracking are, so that will need some discussion with people who are interested in that.

I decided to place the gc commands in cargo's `cargo clean` command because I would like to have a single place for users to go for deleting cache artifacts. It may be possible that they get moved to another command, however introducing new subcommands is quite difficult (due to shadowing existing third-party commands). Other options might be `cargo gc`, `cargo maintenance`, `cargo cache`, etc. But there are existing extensions that would interfere with.

There are also more directions to go in the future. For example, we could add a `cargo clean info` subcommand which could be used for querying cache information (like the sizes and such). There is also the rest of the steps in the original proposal at https://hackmd.io/U_k79wk7SkCQ8_dJgIXwJg for rolling out sqlite support.

See #12633 for the tracking issue
@ehuss ehuss added the Z-gc Nightly: garbage collection label Nov 28, 2023
@ehuss
Copy link
Contributor Author

ehuss commented Feb 20, 2024

In https://rust-lang.zulipchat.com/#narrow/stream/246057-t-cargo/topic/Stabilizing.20global.20cache.20tracking/near/422500781 I am proposing to stabilizing just the recording of the cache data as a first step. This doesn't enable automatic or manual gc.

bors added a commit that referenced this issue Feb 27, 2024
Stabilize global cache data tracking.

This stabilizes the global cache last-use data tracking. This does not stabilize automatic or manual gc.

Tracking issue: #12633

## Motivation

The intent is to start getting cargo to collect data so that when we do stabilize automatic gc, there will be a wider range of cargo versions that will be updating the data so the user is less likely to see cache misses due to an over-aggressive gc.

Additionally, this should give us more exposure and time to respond to any problems, such as filesystem issues.

## What is stabilized?

Cargo will now automatically create and update an SQLite database, located at `$CARGO_HOME/.global-cache`. This database tracks timestamps of the last time cargo touched an index, `.crate` file, extracted crate `src` directory, git clone, or git checkout. The schema for this database is [here](https://github.com/rust-lang/cargo/blob/a7e93479261432593cb70aea5099ed02dfd08cf5/src/cargo/core/global_cache_tracker.rs#L233-L307).

Cargo updates this file on any command that needs to touch any of those on-disk caches.

The testsuite for this feature is located in [`global_cache_tracker.rs`](https://github.com/rust-lang/cargo/blob/a7e93479261432593cb70aea5099ed02dfd08cf5/tests/testsuite/global_cache_tracker.rs).

## Stabilization risks

There are some risks to stabilizing, since it commits us to staying compatible with the current design.

The concerns I can think of with stabilizing:

This commits us to using the database schema in the current design.

The code is designed to support both backwards and forwards compatible extensions, so I think it should be fairly flexible. Worst case, if we need to make changes that are fundamentally incompatible, then we can switch to a different database filename or tracking approach.

There are certain kinds of errors that are ignored if cargo fails to save the tracking data (see [`is_silent_error`](https://github.com/rust-lang/cargo/blob/64ccff290fe20e2aa7c04b9c71460a7fd962ea61/src/cargo/core/global_cache_tracker.rs#L1796-L1813)).

The silent errors are only shown with --verbose. This should help deal with read-only filesystem mounts and other issues. Non-silent errors always show just a warning. I don't know if that will be sufficient to avoid problems.

I did a fair bit of testing of performance, and there is a bench suite for this code, but we don't know if there will be pathological problems in the real world. It also incurs an overhead that all builds will have to pay for.

I've done my best to ensure that this should be reliable when used on network or unusual filesystems, but I think those are still a high-risk category. SQLite should be configured to accommodate these cases, as well as the extensive locking code (which has already been enabled).

A call for public testing was announced in December at https://blog.rust-lang.org/2023/12/11/cargo-cache-cleaning.html. At this time, I don't see any issues in https://github.com/rust-lang/cargo/labels/Z-gc that should block this step.
@airstrike
Copy link

I would wager most rust users associate the words "garbage collection" with memory rather than cached files that have gone stale. It's unfortunate that the term is being overloaded here

@ssokolow
Copy link

ssokolow commented Apr 22, 2024

I would wager most rust users associate the words "garbage collection" with memory rather than cached files that have gone stale. It's unfortunate that the term is being overloaded here

It depends. Some of us are familiar with the git gc command.

@epage
Copy link
Contributor

epage commented Apr 23, 2024

The feature is in development and how we present it to the user is not yet decided. In #13060, we are exploring how to present it in the CLI, including looking at prior art from other tools.

@cessen
Copy link

cessen commented Apr 23, 2024

Thanks for redirecting me to the naming discussion here @epage

I think(?) one of the distinguishing characteristics of garbage collection implementations (whether they be for memory, git, nix, etc.) is that they remove things that are "unreachable" in some sense, and thus can be confidently disposed of as not used. That particular characteristic is specifically not true of this feature, as discussed in #13176.

Having said that, in practice I'm skeptical if calling this feature "garbage collection" is actually going to confuse people. Nevertheless, it does seem like one of those "might as well be more accurate" kind of situations. So calling it "cache cleaning" or similar.

@ehuss
Copy link
Contributor Author

ehuss commented Jul 22, 2024

I have proposed to stabilize the automatic side of this feature in #14287.

github-merge-queue bot pushed a commit that referenced this issue Apr 27, 2025
This proposes to stabilize automatic garbage collection of Cargo's
global cache data in the cargo home directory.

### What is being stabilized?

This PR stabilizes automatic garbage collection, which is triggered at
most once per day by default. This automatic gc will delete old, unused
files in cargo's home directory.

It will delete files that need to be downloaded from the network after 3
months, and files that can be generated without network access after 1
month. These thresholds are intended to balance the intent of reducing
cargo's disk usage versus deleting too often forcing cargo to do extra
work when files are missing.

Tracking of the last-use data is stored in a sqlite database in the
cargo home directory. Cargo updates timestamps in that database whenever
it accesses a file in the cache. This part is already stabilized.

This PR also stabilizes the `gc.auto.frequency` configuration option.
The primary use case for when a user may want to set that is to set it
to "never" to disable gc should the need arise to avoid it.

When gc is initiated, and there are files to delete, there will be a
progress bar while it is deleting them. The progress bar will disappear
when it finishes. If the user runs with `-v` verbose option, then cargo
will also display which files it deletes.

If there is an error while cleaning, cargo will only display a warning,
and otherwise continue.

### What is not being stabilized?

The manual garbage collection option (via `cargo clean gc`) is not
proposed to be stabilized at this time. That still needs some design
work. This is tracked in
#13060.

Additionally, there are several low-level config options currently
implemented which define the thresholds for when it will delete files. I
think these options are probably too low-level and specific. This is
tracked in #13061.

Garbage collection of build artifacts is not yet implemented, and
tracked in #13136.

### Background

This feature is tracked in
#12633 and was implemented in a
variety of PRs, primarily #12634.

The tests for this feature are located in
https://github.com/rust-lang/cargo/blob/master/tests/testsuite/global_cache_tracker.rs.

Cargo started tracking the last-use data on stable via
#13492 in 1.78 which was released
2024-05-02. This PR is proposing to stabilize automatic deletion in 1.82
which will be released in 2024-10-17.

### Risks

Users who frequently use versions of Rust older than 1.78 will not have
the last-use data tracking updated. If they infrequently use 1.78 or
newer, and use the same cache files, then the last-use tracking will
only be updated by the newer versions. If that time frame is more than 1
month (or 3 months for downloaded data), then cargo will delete files
that the older versions are still using. This means the next time they
run the older version, it will have to re-download or re-extract the
files.

The effects of deleting cache data in environments where cargo's cache
is modified by external tools is not fully known. For example, CI
caching systems may save and restore cargo's cache. Similarly, things
like Docker images that try to save the cache in a layer, or mount the
cache in a read-only filesystem may have undesirable interactions.

The once-a-day performance hit might be noticeable to some people. I've
been using this for several months, and almost never notice it. However,
slower systems, or situations where there is a lot of data to delete
might take a while (on the order of seconds hopefully).
@tgrushka
Copy link

tgrushka commented May 16, 2025

New (1.3 year or so) Rust developer here. I see this issue is still open. There was a merge of #14287. Hope I can still post a question on this.

I have no knowledge of internals, so please bear with my question if it's completely naive.

Why is this garbage collection by date/timestamp complexity necessary for cargo to clean artifacts automatically? Am I missing something?

In my mind:

  • the Rust compilation unit is the crate
  • Cargo knows what version of a crate is being built (doesn't it?) and what platform it is building it for
    • and possibly its source repo -- whether crates.io, git, path, etc. -- but maybe this part doesn't matter, because it's the crate name, version, and platform that really "counts" for the project, right? e.g. if I override a dep with a particular version, that "replaces" that version in my project as far as Cargo is concerned?
  • Each time Cargo builds an artifact, it is put in a hashed structure under target/** -- so Cargo obviously hashes the artifact somehow, but I have no idea what info this hash represents (the contents of the binary I assume?):
├── deps
│   ├── adler2-e884df5f7f1a4eb1.d
│   ├── ahash-323a9839de8403ad.d
│   ├── aho_corasick-015c2a7e4d760308.d
...
│   ├── libserde_derive-70b20053052b9820.dylib
│   ├── libserde_json-04aeac63bb2c54fb.rlib
│   ├── libserde_json-04aeac63bb2c54fb.rmeta
│   ├── libserde_json-ee256ea891812ead.rmeta
│   ├── libserde_path_to_error-fcc0061e09c623b4.rmeta
...
  • see the 3 libserde_json- files with hashes for example -- leading me to believe the hashes are for the contents of those binaries, and a cargo tree | grep serde_json confirms to me that there's only one version of serde_json in my project, but 3 artifacts.

Maybe it would be a lot of additional work/extra feature, but couldn't Cargo somehow keep track of what crate + version + platform each of these artifacts was compiled for?

Then, the next time Cargo builds that crate and version(s) + platform(s) for that project (maybe this wouldn't work for a shared Cargo directory), knowing its dependency tree, Cargo would automatically delete the old hashed artifact(s) when it builds the new one?

  • EDIT: Maybe it could work for a shared dir, if Cargo "hashed" the source repo, version, and platform of each crate it builds. The worst case scenario in my mind would be two projects depending on the same crate and version + platform but from a git repo main branch, but then you'd want it to rebuild anyway. If in different repos, or different git refs, the "hash" Cargo generates for that combo of crate + version + platform + source would be different.

I guess this would require a file to store the hashes and version combos, but then there's a .fingerprint directory with all these other metadata files:

dep-lib-serde_json	lib-serde_json
invoked.timestamp	lib-serde_json.json

So maybe the necessary info is already there? Of course, unlinking old files would add more build time, but how much? A whole day of cargo build on my project takes about 30 seconds to a minute to clean, with several GB of files. I wonder what's creating so very much cruft?

What am I missing? Isn't trying to determine what should be deleted by access time a brittle and possibly over-complicated way of achieving something that Cargo could itself already be smart enough to solve without this extra guesswork? Couldn't Cargo just sweep itself during compile -- with an optional flag or perhaps per-project or global setting?

EDIT: Oh yeah, if I bump versions, the old artifcats for the old versions would no longer get swept.

So yes, I see the usefulness of timestamp garbage collection as well.

But I don't bump my versions much in one day working a project, if at all, and yet so very much cruft -- GB and GB -- builds up -- maybe from crates that generate a lot of code or bindings or something. No idea. But I would think there could be some way for Cargo to track the crate + version + platform + source hash of each artifact, and if those match, just delete the old files?

@weihanglo
Copy link
Member

There are different ways build cache directory can be filled. #5026 (comment) is about toolchain upgrades. The original hackmd note of this feature also called out some unresolved questions (search for “deps” in it). The #7150 mega tracking issue has more details. One of the current plan is splitting intermediate build artifact directory out from final artifact directory, and then we can have a better space to restructure the layout to be more reasonable and easier to track.

(the contents of the binary I assume?):

No. The hash is determined before any compilation. For what hashes trakc, see https://doc.rust-lang.org/nightly/nightly-rustc/cargo/core/compiler/fingerprint/index.html.

…but couldn't Cargo somehow keep track of what crate + version + platform each of these artifacts was compiled for?

Theoretically yes, with the aid from rustc. The reality is rustc could emit more files than Cargo knows.

and if those match, just delete the old files?

Even if Cargo already tracks everything, Cargo still can't simply by default do that because the build caches directory could be shared between different workspaces. Also switching between different version of toolchain isn't an uncommon workflow. It needs to be considered.

@epage
Copy link
Contributor

epage commented May 17, 2025

New (1.3 year or so) Rust developer here. I see this issue is still open. There was a merge of #14287. Hope I can still post a question on this.

Why is this garbage collection by date/timestamp complexity necessary for cargo to clean artifacts automatically? Am I missing something?

For context, there are multiple resources to garbage collect. What has been stabilized is GC of user-wide resources like the Index, .crate files that were downloaded, and the source extracted from .crate files. We have not done any work on GCing the contents of target/ yet. @weihanglo pointed out different degrees of doing this.

Then, the next time Cargo builds that crate and version(s) + platform(s) for that project (maybe this wouldn't work for a shared Cargo directory), knowing its dependency tree, Cargo would automatically delete the old hashed artifact(s) when it builds the new one?

The compile clears out anything from the incremental compilation cache that wasn't used in the current compilation. This doesn't work well for Cargo because users can switch between different feature combinations that would affect compilation. Maybe crate source + name + version could be used to clean up but that can also make it harder to jump back and forth between versions. There are times I wish we had different caches of local code so I could more easily jump back and forth between commits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-caching Area: caching of dependencies, repositories, and build artifacts C-tracking-issue Category: A tracking issue for something unstable. Command-clean S-waiting-on-feedback Status: An implemented feature is waiting on community feedback for bugs or design concerns. Z-gc Nightly: garbage collection
Projects
Status: In Progress
Development

No branches or pull requests

8 participants