Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Some ideas #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
safinaskar opened this issue Mar 9, 2025 · 1 comment
Closed

Some ideas #9

safinaskar opened this issue Mar 9, 2025 · 1 comment

Comments

@safinaskar
Copy link

Thanks for this project!

Some ideas (note that I merely have read blog post and didn't dig futher):

  • This may be good idea to fully replicate git's CLI. At least as an option. This will help spreading the project
  • Migrate away from SHA1. It is broken. It is one very unfortunate git's design mistake. Also, you should change hashes regularly anyway: https://valerieaurora.org/hash.html . (Well, actual migrating from SHA1 will likely break github compatibility, so, of course, it makes sense to support SHA1 for now. But please support other hashes, too. Don't repeat git's mistake: git simply hardcoded SHA1 everywhere originally.)
  • In the past I spent a lot of time researching CDC-and-deduplication. My findings are here: casync decompresses x1.5 faster than borg on same config (and other benchmarks) borgbackup/borg#7674 . Short overview of FOSS solutions is here: https://lobste.rs/s/0itosu/look_at_rapidcdc_quickcdc#c_ygqxsl . In short, existing solutions are under-optimized, and there is a lot of low handling fruit here. I was able very easily create very small program in Rust, which beats existing deduplication solutions by wide margin (but my program doesn't use CDC). So I suggest reading my ideas and comparing speed of your solution with other solutions
  • Patch-based merging seems to be killer feature (assuming it works well). So, I suggest making it main ad strategy. Linux devs often maintain their patchsets as series of patch files, not as git branches, exactly because git merging doesn't work well. So, reach Linux devs and tell them about your tool. In particular, person number 2 in Linux, Greg KH, maintainer of stable Linux trees, stores his stable trees as series of patch files in git (aaaah!). Here he describes his workflow: http://www.kroah.com/log/blog/2019/08/14/patch-workflow-with-mutt-2019/ . Key parts are these: "The stable kernel tree, while under development, is kept as a series of patches that need to be applied to the previous release. This series of patches is maintained by using a tool called (quilt)... Anyway, the stable patches are kept in a quilt series in a repository that is kept under version control in git (complex, yeah, sorry.) That queue can always be found (here)". Same applies to a lot of Debian packages. For example, gcc (and lots of other Debian packages) is, again, maintained as patches-stored-in-git. See here https://salsa.debian.org/toolchain-team/gcc/-/tree/gcc-14-debian/debian/patches . I think this is, again, because of git merge and git rebase problems. So, spread your xit as tool to solve all these problems. Of course, it helps if you are CLI-compatible with git
  • "If the first byte is 0, it is uncompressed; if it is 1, it is zlib-compressed". I suggest moving to zstd, it is better in every way (faster and smaller). Also, zstd may be good in compressing binary files (at least I hope zstd doesn't do them sufficiently larger). "While xit has compression support, it currently disables it even for text files". Try zstd -0, it is fast enough, while giving substantial compression for text files. If it is too slow, try lz4, it is even faster
  • "Want to find the descendent(s) of a commit? Uhhh...well, you can't". As pointed out on lobsters, you can see descendants: https://lobste.rs/s/mltpfg/xit_is_coming#c_cnwsps . (But I understand your point, i. e. you argue that we need separate data structure for this)

Feel free to ask any questions.

Also: even if you implement all these, I still do not plan to use xit. (I'm not trying to insult you, I just am trying to be honest here about my motivations.)

Also, there is discussion of your project here https://lobste.rs/s/mltpfg/xit_is_coming . If you want, I can give you invite

@radarroark
Copy link
Owner

This may be good idea to fully replicate git's CLI.

Right now I'm trying to reach a nice middle ground between being too different from git's CLI (making switching harder) and being too similar (repeating git's design mistakes). Right now, the differences with git's CLI are only areas where the git CLI is clearly badly designed:

  1. xit doesn't implement the overloaded checkout command, instead opting for switch and restore (which git itself added later)
  2. xit doesn't implement the poorly-named --mixed, --hard, and --soft flags for reset, instead making them separate top level commands: reset, reset-dir, and reset-add
  3. in tag, branch, remote, and config, xit uses the same subcommands: add, rm, and list...in git there is a ton of unnecessary inconsistency here

It will never be a design goal to fully replicate git's CLI to make xit a drop-in replacement for git, because that would require the commands to replicate git's behavior perfectly.

Migrate away from SHA1.

Agreed, and xit already supports SHA256 as well. I'm even running the main tests with it enabled to make sure it works.

However, it is not currently enabled in the CLI tool. The only thing left to do is to make it work with git hosts that doesn't support SHA256 (i.e., pretty much all of them). I will probably have to maintain local maps of SHA256 -> SHA1 and SHA1 -> SHA256 oids, so that I can make it send/receive SHA1s over the wire and translate that to SHA256 locally.

In the past I spent a lot of time researching CDC-and-deduplication.

I'll take a look, thank you for this.

Patch-based merging seems to be killer feature (assuming it works well). So, I suggest making it main ad strategy.

Yeah it's certainly the most unique thing about xit right now. The main downside is that patch generation takes a long time for repos with large histories. I just changed it so this is an optional step that you enable with xit patch on, so you at least don't need to wait for it to happen during the clone. If it is disabled, merge/cherry-pick falls back to a git-style three-way merge.

It already is the most prominent thing in the readme after git compatibility, but I also don't want to over-promise this early. It needs a lot more real world testing to validate it.

I suggest moving to zstd, it is better in every way (faster and smaller).

Are you sure it is always better than zlib? When I was researching, I found this issue which suggests it's actually worse than zlib for small blobs (less than 4KB). Right now, xit makes chunks that are 1-4 KB in size (yes, I should make them larger for really large files...it's on my list).

Even if that problem can be resolved, I also need to wait for the zig standard library to add a zstd compressor. Last I checked, it only had a decompressor.

(But I understand your point, i. e. you argue that we need separate data structure for this)

Right, and in reality any deficit in git's data structures could be solved in theory by adding "just another file" somewhere in your .git dir. My main critique in db.md is that each of these is a special-case solution. Another example is the reftable, which is yet another ad hoc format for storing one particular part of git's state (the refs) in a more efficient format. With a general purpose database, inventing all of these disparate formats wouldn't be necessary.

Repository owner locked and limited conversation to collaborators Mar 10, 2025
@radarroark radarroark converted this issue into discussion #10 Mar 10, 2025

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants