-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Introduce zfs rewrite subcommand #17246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
I've tried to find some kernel APIs to wire this to, but found that plenty of Linux file systems each implement their own IOCTL's for similar purposes. I did the same, except the IOCTL number I took almost arbitrary, since ZFS seems quite rough in this area. I am open to any better ideas before this is committed. |
This looks amazing! Not having to sift through half a dozen shell scripts every time this comes up to see what currently handles the most edge cases correctly is very much appreciated. Especially with RaidZ expansion, being able to direct users to run a built-in command instead of debating what script to send them to would be very nice. Also being able to reliably rewrite a live dataset while it's in use without having to worry about skipped files or mtime conflicts would make the whole process much less of a hassle. With the only thing to really worry about being snapshots/space usage this seems as close to perfect as reasonably possible (without diving deep into internals and messing with snapshot immutability). Bravo! |
thank you. Fixes one of the biggest problems with ZFS. Is there a way to suspend the process? It might be nice to have it run only during off hours. |
It does one file at a time, and should be killable in between. Signal handling within one huge file can probably be added. Though the question of the process restart is on the user. I didn't plan to go that deep into the area within this PR. |
I couldn't find documentation in the files changed, so I have to guess how it actually works. Is it a file at a time? I guess you could feed it with a "find" command. For a system with a billion files, do you have a sense how long this is gong to take? We can do scrubs in a day or two, but rsync is impractically slow. If this is happening at the file system level, that migth be the case here as well. |
This will likely be a good use case for GNU Parallel. |
It can take a directory as an argument and there are some recursive functions and iterators in the code so piping find into it should not be necessary. That avoids some userspace file handling overhead, but it still has to go through the contents of each directory one file at a time. I also don't see any parallel execution or threading (though I'm not too familiar with ZFS internals, maybe some of the primitives used here run asynchronously?). Whether doing parallelism in userspace by just calling it for many files/directories at once or not it should have the required locking to just run in the background and be significantly more elegant than the CP + mtime (or potentially userspace hash) check to make sure files didn't change during the copy process avoiding one of the potential pitfalls of existing solutions. |
I haven't benchmarked it deep yet, but unless the files are tiny, I don't expect there is a major need for parallelism. The code in kernel should handle up to 16MB at a time, plus allows ZFS to do read-ahead and write-back on top of that, so there will be quite a lot in the pipeline to saturate the disks and/or the system, especially if there is some compression/checksuming/encryption. And without need to copy data to/from user-space, the only thread will not be doing too much, I think mostly a decompression from ARC. Bunch of small files on a wide HDD pool I suspect may indeed suffer from read latency, but that in user-space we can optimize/parallelize all day long. |
I gave this a quick test. It's very fast and does exactly what it says 👍
I can already see people writing scripts that go though every dataset, setting the optimal compression, recordsize, etc, and zfs rewrite-ing them. |
Cool! Though the recordsize is one of things it can't change, since it would requite real byte-level copy, not just marking existing blocks dirty. I am not sure it can be done under the load in general. At least it would be much more complicated. |
Umm this is basically same as doing send | recv, isn't it? I mean, in a way, this is already possible to do without any changes, isn't it? Recv will even respect a lower recordsize, if I'm not mistaken - at least when receiving into a pool without large blocks support, it has to do that. I'm thinking whether we can do better, in the original sense of ZFS "better", meaning "automagic" - what do you think of using snapshots, send|recv, in a loop with ever decreasing delta size and then when the delta isn't decreasing anymore, we could swap those datasets and use (perhaps slightly modified) It'd be even cooler if it could coalesce smaller blocks into larger ones, but that potentially implies performance problems with write amplification, I would say if the app writes in smaler chunks that it gets onto disk in such smaller chunks, it's probably for the best to leave them that way. For any practical use-case I could think of though, I would definitely appreciate the ability to split the blocks of a dataset using smaller If there's a way how to make |
send recv has the huge downside of requiring 2x the space, even if you do the delta size thing since it has to send the entire dataset at least once and old data can't be deleted until the new dataset is complete.
Isn't this exactly what rewrite does? Change the options, run it and all the blocks are changed in the background. Without an application even seeing a change to the file. And unlike send recv it only needs a few MB of extra space. Edit: with the only real exception being record size, but recv also solves that only partially at best and it doesn't look like there's a reasonable way to work around that in a wholly transparent fashion. |
|
d23a371
to
c5f4413
Compare
Which release is this game changing enhancement likely to land in? |
@stuartthebruce So far it haven't landed even in master, so anybody who want to speed it up is welcome to test and comment. In general though, when completed, there is no reason why aside of 2.4.0 it can't be ported back to some 2.3.x of the time. |
Good to know there are no obvious blockers from including in a future 2.3.x. Once this hits master I will help by setting up a test system with 1/2PB of 10^9 small files to see if I can break it. Is there any reason to think the code will be sensitive to Linux vs FreeBSD? |
IOCTL interface of the kernels is obviously slightly different, requiring OS-specific shims, as with most of other VFS-related code. But seems like not a big problem, as Tony confirmed it works on Linux too from the first try. |
Since this introduces new IOCTL API, I'd appreciate some feedback before it hit master in case some desired functionality might require API changes aside of the |
OK, I will see if I can find some time this next week to stress test. |
I started testing with a copy of the following home directories on an otherwise idle RL9 test system initially running version 2.3.1, [root@zfsarchive1 ~]# zfs list NAME USED AVAIL REFER MOUNTPOINT jbod17 16.8T 910T 136K /jbod17 jbod17/cal 478G 910T 473G /jbod17/cal jbod17/dqr 4.06T 910T 3.97T /jbod17/dqr jbod17/grb.exttrig 4.99T 910T 4.98T /jbod17/grb.exttrig jbod17/idq 2.51T 910T 2.51T /jbod17/idq jbod17/pe.o4 4.73T 910T 4.63T /jbod17/pe.o4 and then updated to, [root@zfsarchive1 zfs]# git log commit c5f4413446609a2fd3c4b817f7ad7ebceb915991 (HEAD -> rewrite, origin/rewrite) Author: Alexander Motin Date: Tue Apr 15 16:07:05 2025 -0400 Introduce zfs rewrite subcommand This allows to rewrite content of specified file(s) as-is without ... The initial test is to remove compression from one of these datasets with 2M files, [root@zfsarchive1 ~]# zfs get space jbod17/cal NAME PROPERTY VALUE SOURCE jbod17/cal name jbod17/cal - jbod17/cal available 910T - jbod17/cal used 478G - jbod17/cal usedbysnapshots 4.67G - jbod17/cal usedbydataset 473G - jbod17/cal usedbyrefreservation 0B - jbod17/cal usedbychildren 0B - [root@zfsarchive1 ~]# zfs get compression,compressratio jbod17/cal NAME PROPERTY VALUE SOURCE jbod17/cal compression on inherited from jbod17 jbod17/cal compressratio 1.14x - [root@zfsarchive1 ~]# df -i /jbod17/cal Filesystem Inodes IUsed IFree IUse% Mounted on jbod17/cal 1954605040973 1985661 1954603055312 1% /jbod17/cal [root@zfsarchive1 ~]# set compression=off jbod17/cal [root@zfsarchive1 ~]# time zfs rewrite -r /jbod17/cal ... This is currently processing the files at ~100MB/s, [root@zfsarchive1 ~]# zpool iostat jbod17 5 capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- jbod17 20.8T 1.12P 7.28K 367 84.3M 103M jbod17 20.8T 1.12P 7.28K 494 70.4M 181M jbod17 20.8T 1.12P 7.41K 387 87.1M 84.7M jbod17 20.8T 1.12P 9.09K 440 91.8M 117M jbod17 20.8T 1.12P 7.21K 434 75.4M 110M jbod17 20.8T 1.12P 12.8K 515 137M 133M jbod17 20.8T 1.12P 18.5K 996 200M 280M with the zfs process using ~15% of a cpu-core [root@zfsarchive1 ~]# top top - 16:25:06 up 13 min, 3 users, load average: 2.62, 2.41, 1.85 Tasks: 2395 total, 2 running, 2393 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 1.7 sy, 0.0 ni, 97.5 id, 0.7 wa, 0.1 hi, 0.0 si, 0.0 st MiB Mem : 385934.4 total, 306918.4 free, 78558.8 used, 3688.4 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 307375.6 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 56137 root 20 0 16352 5120 5120 D 14.1 0.0 0:54.92 zfs 41603 root 0 -20 0 0 0 S 1.6 0.0 0:03.67 z_wr_int_9 42828 root 0 -20 0 0 0 S 1.6 0.0 0:03.61 z_wr_int_0 and Samples: 95K of event 'cycles:P', 4000 Hz, Event count (approx.): 46282639308 lost: 0/0 drop: 0/0 Children Self Shared Object Symbol + 76.99% 0.72% [kernel] [k] taskq_thread + 57.02% 0.30% [kernel] [k] zio_execute + 47.12% 0.25% [kernel] [k] _raw_spin_lock_irqsave + 46.17% 46.07% [kernel] [k] native_queued_spin_lock_slowpath + 39.39% 0.00% [kernel] [k] ret_from_fork + 39.38% 0.00% [kernel] [k] kthread + 35.26% 0.59% [kernel] [k] zio_done + 30.81% 0.26% [kernel] [k] taskq_dispatch_ent + 12.70% 0.07% [kernel] [k] zio_vdev_io_start + 12.65% 0.08% [kernel] [k] zio_nowait + 10.83% 0.02% [kernel] [k] vdev_mirror_io_start + 7.65% 0.01% [kernel] [k] entry_SYSCALL_64_after_hwframe + 7.64% 0.04% [kernel] [k] do_syscall_64 + 7.61% 0.03% [kernel] [k] vdev_raidz_io_start + 7.50% 0.04% [kernel] [k] zio_write_compress + 7.29% 0.02% [kernel] [k] zfs_lz4_compress ... |
That finished without crashing, [root@zfsarchive1 ~]# time zfs rewrite -r /jbod17/cal real 176m32.645s user 0m3.518s sys 8m56.518s And resulted in, [root@zfsarchive1 ~]# zfs get space jbod17/cal NAME PROPERTY VALUE SOURCE jbod17/cal name jbod17/cal - jbod17/cal available 910T - jbod17/cal used 950G - jbod17/cal usedbysnapshots 477G - jbod17/cal usedbydataset 473G - jbod17/cal usedbyrefreservation 0B - jbod17/cal usedbychildren 0B - [root@zfsarchive1 ~]# zfs get compression,compressratio jbod17/cal NAME PROPERTY VALUE SOURCE jbod17/cal compression on inherited from jbod17 jbod17/cal compressratio 1.14x - [root@zfsarchive1 ~]# df -i /jbod17/cal Filesystem Inodes IUsed IFree IUse% Mounted on jbod17/cal 1953610773572 1985661 1953608787911 1% /jbod17/cal After destroying all of the unmodified snapshots, [root@zfsarchive1 ~]# zfs list -H -o name -t snapshot jbod17/cal | xargs --no-run-if-empty -n1 zfs destroy [root@zfsarchive1 ~]# zfs get space jbod17/cal NAME PROPERTY VALUE SOURCE jbod17/cal name jbod17/cal - jbod17/cal available 910T - jbod17/cal used 473G - jbod17/cal usedbysnapshots 0B - jbod17/cal usedbydataset 473G - jbod17/cal usedbyrefreservation 0B - jbod17/cal usedbychildren 0B - [root@zfsarchive1 ~]# zfs get compression,compressratio jbod17/cal NAME PROPERTY VALUE SOURCE jbod17/cal compression on inherited from jbod17 jbod17/cal compressratio 1.14x - Which appears to have been a big NOOP. It appears I fat fingered something and didn't actually hit return on the command to set [root@zfsarchive1 ~]# zfs set compression=off jbod17/cal [root@zfsarchive1 ~]# zfs get compression,compressratio jbod17/cal NAME PROPERTY VALUE SOURCE jbod17/cal compression off local jbod17/cal compressratio 1.14x - [root@zfsarchive1 ~]# time zfs rewrite -r /jbod17/cal |
It ran faster this time and actually removed compression, [root@zfsarchive1 ~]# time zfs rewrite -r /jbod17/cal real 48m48.924s user 0m2.229s sys 6m59.283s [root@zfsarchive1 ~]# zfs get space jbod17/cal NAME PROPERTY VALUE SOURCE jbod17/cal name jbod17/cal - jbod17/cal available 910T - jbod17/cal used 534G - jbod17/cal usedbysnapshots 0B - jbod17/cal usedbydataset 534G - jbod17/cal usedbyrefreservation 0B - jbod17/cal usedbychildren 0B - [root@zfsarchive1 ~]# zfs get compression,compressratio jbod17/cal NAME PROPERTY VALUE SOURCE jbod17/cal compression off local jbod17/cal compressratio 1.00x - |
Running parallel instances on the other datasets in this test pool appears to have increased the aggregate performance, [root@zfsarchive1 ~]# zpool iostat 5 capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- jbod17 21.0T 1.12P 1.21K 156 15.2M 20.1M jbod17 21.0T 1.12P 49.4K 3.25K 1.11G 1.40G jbod17 21.0T 1.12P 52.7K 2.89K 1.15G 1.26G jbod17 21.0T 1.12P 47.6K 3.22K 1.16G 1.43G jbod17 21.1T 1.12P 39.5K 2.87K 1.01G 1.24G jbod17 21.1T 1.12P 8.55K 626 109M 203M jbod17 21.1T 1.12P 8.87K 735 117M 196M jbod17 21.1T 1.12P 13.0K 1.16K 182M 417M jbod17 21.1T 1.12P 38.5K 4.42K 717M 1.04G jbod17 21.1T 1.12P 57.4K 6.41K 1.12G 1.66G jbod17 21.1T 1.12P 58.1K 6.45K 1.13G 1.67G jbod17 21.1T 1.12P 55.6K 5.14K 1.11G 1.44G jbod17 21.1T 1.12P 46.3K 3.75K 1.24G 1.67G |
Do I understand correctly that while it is safe to interrupt and restart a rewrite there is no state preserved for a restart to resume where it last left off? |
Right. |
Since none of the timstamps returned by stat() are updated be zfs rewrite are there any internal ZFS timestamps that indicate the last time a file (or a block of data in a file) was last written? While I am at it, how about last read, e.g., by zfs send or zpool scrub ? |
I believe the creation time (or at least birth txg) of a record can be queried with zdb and the right combination of -d -b and -v flags. And since records are atomic and rewritten on modification that should be what you're looking for. Whether that's convenient or practical is another question. Reads are (to my knowledge) not logged at record granularity. EDIT: per file those metrics should show up with less -dd and -vv spamming, but I'm pretty sure at least some of those are what's reported to stat() anyway. |
I would like to suggest an interesting RFE for zfs rewrite would be to accept time ranges to allow recursive scanning to process only files with records in that range. For example: after changing compression algorithm for new files and confirming from in situ performance to retroactively change old files by re-writing those last written before XXX, or a periodic defrag daemon that rewrites files last written before YYY, or wanting to back out a change and re-write files last written after ZZZ. |
@stuartthebruce To specify any times we'd at very least need #16853 to land first. |
Added some tests (not yet tested ;)) and opaque arg field into the IOCTLs structure to make it even more future-proof. |
1e50c20
to
12b0e14
Compare
@tonyhutter Added and passed. |
This allows to rewrite content of specified file(s) as-is without modifications, but at a different location, compression, checksum, dedup, copies and other parameter values. It is faster than read plus write, since it does not require data copying to user-space. It is also faster for sync=always datasets, since without data modification it does not require ZIL writing. Also since it is protected by normal range range locks, it can be done under any other load. Also it does not affect file's modification time or other properties. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc.
You'll want to add some input validation to the |
Values beyond the file size are not illegal there. Kernel will rewrite only what is actually there. Can just add a check for non-numeric value, if you prefer. |
We should error out if a user is trying to seek/rewrite pass the end of the file. Maybe they accidentally typed in the wrong offset? Also, I'm wondering if we should not allow |
I don't think we should, considering the code was planned to work under concurrent load. We should not fail if the file just got truncated. We achieved our goal by doing nothing.
It may be a weird combination, but again not illegal. I am actually verifying it in the test, just because I can. |
These have now finished after 90 hours without any obvious problems, [root@zfsarchive1 ~]# zfs list NAME USED AVAIL REFER MOUNTPOINT jbod17 16.8T 910T 136K /jbod17 jbod17/cal 534G 910T 534G /jbod17/cal jbod17/dqr 4.06T 910T 3.97T /jbod17/dqr jbod17/grb.exttrig 4.99T 910T 4.98T /jbod17/grb.exttrig jbod17/idq 2.51T 910T 2.51T /jbod17/idq jbod17/pe.o4 4.73T 910T 4.63T /jbod17/pe.o4 [root@zfsarchive1 ~]# parallel 'zfs set compression=off {} && time zfs rewrite -r /{}' ::: jbod17/dqr jbod17/grb.exttrig jbod17/idq jbod17/pe.o4 real 616m48.074s user 0m18.245s sys 52m49.744s real 1188m48.153s user 0m14.149s sys 59m4.946s real 1544m41.673s user 0m34.218s sys 77m23.340s real 5414m12.459s user 2m19.411s sys 135m19.536s [root@zfsarchive1 ~]# zfs list NAME USED AVAIL REFER MOUNTPOINT jbod17 39.6T 887T 136K /jbod17 jbod17/cal 534G 887T 534G /jbod17/cal jbod17/dqr 8.63T 887T 4.69T /jbod17/dqr jbod17/grb.exttrig 11.4T 887T 6.40T /jbod17/grb.exttrig jbod17/idq 9.13T 887T 6.62T /jbod17/idq jbod17/pe.o4 9.95T 887T 5.23T /jbod17/pe.o4 |
Motivation and Context
For years users were asking for an ability to re-balance pool after vdev addition, de-fragment randomly written files, change some properties for already written files, etc. The closest option would be to either copy and rename a file or send/receive/rename the dataset. Unfortunately all of those options have some downsides.
Description
This change introduces new
zfs rewrite
subcommand, that allows to rewrite content of specified file(s) as-is without modifications, but at a different location, compression, checksum, dedup, copies and other parameter values. It is faster than read plus write, since it does not require data copying to user-space. It is also faster for sync=always datasets, since without data modification it does not require ZIL writing. Also since it is protected by normal range range locks, it can be done under any other load. Also it does not affect file's modification time or other properties.How Has This Been Tested?
Manually tested it on FreeBSD. Linux-specific code is not yet tested.
Types of changes
Checklist:
Signed-off-by
.