Introduce zfs rewrite subcommand #17246

amotin · 2025-04-15T20:33:49Z

Motivation and Context

For years users were asking for an ability to re-balance pool after vdev addition, de-fragment randomly written files, change some properties for already written files, etc. The closest option would be to either copy and rename a file or send/receive/rename the dataset. Unfortunately all of those options have some downsides.

Description

This change introduces new zfs rewrite subcommand, that allows to rewrite content of specified file(s) as-is without modifications, but at a different location, compression, checksum, dedup, copies and other parameter values. It is faster than read plus write, since it does not require data copying to user-space. It is also faster for sync=always datasets, since without data modification it does not require ZIL writing. Also since it is protected by normal range range locks, it can be done under any other load. Also it does not affect file's modification time or other properties.

How Has This Been Tested?

Manually tested it on FreeBSD. Linux-specific code is not yet tested.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

amotin · 2025-04-15T20:53:03Z

I've tried to find some kernel APIs to wire this to, but found that plenty of Linux file systems each implement their own IOCTL's for similar purposes. I did the same, except the IOCTL number I took almost arbitrary, since ZFS seems quite rough in this area. I am open to any better ideas before this is committed.

module/zfs/zfs_vnops.c

HPPinata · 2025-04-15T23:34:16Z

This looks amazing! Not having to sift through half a dozen shell scripts every time this comes up to see what currently handles the most edge cases correctly is very much appreciated. Especially with RaidZ expansion, being able to direct users to run a built-in command instead of debating what script to send them to would be very nice.

Also being able to reliably rewrite a live dataset while it's in use without having to worry about skipped files or mtime conflicts would make the whole process much less of a hassle. With the only thing to really worry about being snapshots/space usage this seems as close to perfect as reasonably possible (without diving deep into internals and messing with snapshot immutability). Bravo!

clhedrick · 2025-04-16T14:20:08Z

thank you. Fixes one of the biggest problems with ZFS.

Is there a way to suspend the process? It might be nice to have it run only during off hours.

amotin · 2025-04-16T14:41:06Z

Is there a way to suspend the process? It might be nice to have it run only during off hours.

It does one file at a time, and should be killable in between. Signal handling within one huge file can probably be added. Though the question of the process restart is on the user. I didn't plan to go that deep into the area within this PR.

clhedrick · 2025-04-16T20:14:16Z

I couldn't find documentation in the files changed, so I have to guess how it actually works. Is it a file at a time? I guess you could feed it with a "find" command. For a system with a billion files, do you have a sense how long this is gong to take? We can do scrubs in a day or two, but rsync is impractically slow. If this is happening at the file system level, that migth be the case here as well.

stuartthebruce · 2025-04-16T20:29:35Z

I guess you could feed it with a "find" command.

This will likely be a good use case for GNU Parallel.

HPPinata · 2025-04-16T20:49:48Z

I couldn't find documentation in the files changed, so I have to guess how it actually works. Is it a file at a time? I guess you could feed it with a "find" command. For a system with a billion files, do you have a sense how long this is gong to take? We can do scrubs in a day or two, but rsync is impractically slow. If this is happening at the file system level, that migth be the case here as well.

It can take a directory as an argument and there are some recursive functions and iterators in the code so piping find into it should not be necessary. That avoids some userspace file handling overhead, but it still has to go through the contents of each directory one file at a time. I also don't see any parallel execution or threading (though I'm not too familiar with ZFS internals, maybe some of the primitives used here run asynchronously?).

Whether doing parallelism in userspace by just calling it for many files/directories at once or not it should have the required locking to just run in the background and be significantly more elegant than the CP + mtime (or potentially userspace hash) check to make sure files didn't change during the copy process avoiding one of the potential pitfalls of existing solutions.

amotin · 2025-04-16T21:26:27Z

I haven't benchmarked it deep yet, but unless the files are tiny, I don't expect there is a major need for parallelism. The code in kernel should handle up to 16MB at a time, plus allows ZFS to do read-ahead and write-back on top of that, so there will be quite a lot in the pipeline to saturate the disks and/or the system, especially if there is some compression/checksuming/encryption. And without need to copy data to/from user-space, the only thread will not be doing too much, I think mostly a decompression from ARC. Bunch of small files on a wide HDD pool I suspect may indeed suffer from read latency, but that in user-space we can optimize/parallelize all day long.

tonyhutter · 2025-04-16T21:31:18Z

I gave this a quick test. It's very fast and does exactly what it says 👍

# Copy ZFS source workspace to pool with compression=off
$ time cp -a ~/zfs /tank2

real	0m0.600s
user	0m0.032s
sys	0m0.519s

$ df -h /tank2
Filesystem      Size  Used Avail Use% Mounted on
tank2           9.3G  893M  8.4G  10% /tank2


# Set compression to 'gzip' and rewrite
$ sudo ./zfs set compression=gzip tank2
$ time sudo ./zfs rewrite -r /tank2

real	0m2.272s
user	0m0.005s
sys	0m0.005s

$ df -h /tank2
Filesystem      Size  Used Avail Use% Mounted on
tank2           9.3G  402M  8.9G   5% /tank2


# Set compression to 'lz4' and rewrite
$ sudo ./zfs set compression=lz4 tank2
$ time sudo ./zfs rewrite -r /tank2
real	0m1.947s
user	0m0.002s
sys	0m0.010s

$ df -h /tank2
Filesystem      Size  Used Avail Use% Mounted on
tank2           9.3G  456M  8.8G   5% /tank2


# Set compression to 'zstd' and rewrite
$ sudo ./zfs set compression=zstd tank2
$ time sudo ./zfs rewrite -r /tank2

real	0m0.616s
user	0m0.003s
sys	0m0.006s

$ df -h /tank2
Filesystem      Size  Used Avail Use% Mounted on
tank2           9.3G  366M  8.9G   4% /tank2

I can already see people writing scripts that go though every dataset, setting the optimal compression, recordsize, etc, and zfs rewrite-ing them.

amotin · 2025-04-16T21:42:51Z

Cool! Though the recordsize is one of things it can't change, since it would requite real byte-level copy, not just marking existing blocks dirty. I am not sure it can be done under the load in general. At least it would be much more complicated.

snajpa · 2025-04-17T16:58:22Z

Umm this is basically same as doing send | recv, isn't it? I mean, in a way, this is already possible to do without any changes, isn't it? Recv will even respect a lower recordsize, if I'm not mistaken - at least when receiving into a pool without large blocks support, it has to do that.

I'm thinking whether we can do better, in the original sense of ZFS "better", meaning "automagic" - what do you think of using snapshots, send|recv, in a loop with ever decreasing delta size and then when the delta isn't decreasing anymore, we could swap those datasets and use (perhaps slightly modified) zfs_resume_fs transparently to the userspace... that way we would get transparent migration into a dataset with different options, that would scratch some itches for people, wouldn't it?

It'd be even cooler if it could coalesce smaller blocks into larger ones, but that potentially implies performance problems with write amplification, I would say if the app writes in smaler chunks that it gets onto disk in such smaller chunks, it's probably for the best to leave them that way. For any practical use-case I could think of though, I would definitely appreciate the ability to split the blocks of a dataset using smaller recordsize.

If there's a way how to make zfs rewrite more automagical, I think it's at least worth considering.

HPPinata · 2025-04-17T17:15:06Z

send recv has the huge downside of requiring 2x the space, even if you do the delta size thing since it has to send the entire dataset at least once and old data can't be deleted until the new dataset is complete.
Also recv doesn't increase block sizes, it only splits them if they are larger than the other pool supports (and iirc. there have even been some issues with that).
Also that idea sounds a lot more complex than simply walking the directory tree and iterating through the files to mark their records as dirty to cause a rewrite.

we would get transparent migration into a dataset with different options, that would scratch some itches for people, wouldn't it?

Isn't this exactly what rewrite does? Change the options, run it and all the blocks are changed in the background. Without an application even seeing a change to the file. And unlike send recv it only needs a few MB of extra space.

Edit: with the only real exception being record size, but recv also solves that only partially at best and it doesn't look like there's a reasonable way to work around that in a wholly transparent fashion.

amotin · 2025-04-19T01:40:34Z

Added -x flag to not cross mount points.
Added signal handling in kernel.
Added man page.

stuartthebruce · 2025-04-19T23:20:48Z

Which release is this game changing enhancement likely to land in?

amotin · 2025-04-20T00:33:04Z

@stuartthebruce So far it haven't landed even in master, so anybody who want to speed it up is welcome to test and comment. In general though, when completed, there is no reason why aside of 2.4.0 it can't be ported back to some 2.3.x of the time.

stuartthebruce · 2025-04-20T00:41:01Z

@stuartthebruce So far it haven't landed even in master, so anybody who want to speed it up is welcome to test and comment. In general though, when completed, there is no reason why aside of 2.4.0 it can't be ported back to some 2.3.x of the time.

Good to know there are no obvious blockers from including in a future 2.3.x. Once this hits master I will help by setting up a test system with 1/2PB of 10^9 small files to see if I can break it. Is there any reason to think the code will be sensitive to Linux vs FreeBSD?

amotin · 2025-04-20T00:46:04Z

Is there any reason to think the code will be sensitive to Linux vs FreeBSD?

IOCTL interface of the kernels is obviously slightly different, requiring OS-specific shims, as with most of other VFS-related code. But seems like not a big problem, as Tony confirmed it works on Linux too from the first try.

amotin · 2025-04-20T00:50:37Z

Once this hits master

Since this introduces new IOCTL API, I'd appreciate some feedback before it hit master in case some desired functionality might require API changes aside of the flags field I already reserved for later extensions. I was thinking about some options to not rewrite in some cases, but didn't want to pollute the code until I am convinced it is required.

stuartthebruce · 2025-04-20T01:00:02Z

Since this introduces new IOCTL API, I'd appreciate some feedback before it hit master in case some desired functionality might require API changes aside of the flags field I already reserved for later extensions. I was thinking about some options to not rewrite in some cases, but didn't want to pollute the code until I am convinced it is required.

OK, I will see if I can find some time this next week to stress test.

stuartthebruce · 2025-04-26T23:32:07Z

OK, I will see if I can find some time this next week to stress test.

I started testing with a copy of the following home directories on an otherwise idle RL9 test system initially running version 2.3.1,

[root@zfsarchive1 ~]# zfs list
NAME                 USED  AVAIL  REFER  MOUNTPOINT
jbod17              16.8T   910T   136K  /jbod17
jbod17/cal           478G   910T   473G  /jbod17/cal
jbod17/dqr          4.06T   910T  3.97T  /jbod17/dqr
jbod17/grb.exttrig  4.99T   910T  4.98T  /jbod17/grb.exttrig
jbod17/idq          2.51T   910T  2.51T  /jbod17/idq
jbod17/pe.o4        4.73T   910T  4.63T  /jbod17/pe.o4

and then updated to,

[root@zfsarchive1 zfs]# git log
commit c5f4413446609a2fd3c4b817f7ad7ebceb915991 (HEAD -> rewrite, origin/rewrite)
Author: Alexander Motin 
Date:   Tue Apr 15 16:07:05 2025 -0400

    Introduce zfs rewrite subcommand
    
    This allows to rewrite content of specified file(s) as-is without
...

The initial test is to remove compression from one of these datasets with 2M files,

[root@zfsarchive1 ~]# zfs get space jbod17/cal
NAME        PROPERTY              VALUE          SOURCE
jbod17/cal  name                  jbod17/cal     -
jbod17/cal  available             910T           -
jbod17/cal  used                  478G           -
jbod17/cal  usedbysnapshots       4.67G          -
jbod17/cal  usedbydataset         473G           -
jbod17/cal  usedbyrefreservation  0B             -
jbod17/cal  usedbychildren        0B             -

[root@zfsarchive1 ~]# zfs get compression,compressratio jbod17/cal
NAME        PROPERTY       VALUE           SOURCE
jbod17/cal  compression    on              inherited from jbod17
jbod17/cal  compressratio  1.14x           -

[root@zfsarchive1 ~]# df -i /jbod17/cal
Filesystem            Inodes   IUsed         IFree IUse% Mounted on
jbod17/cal     1954605040973 1985661 1954603055312    1% /jbod17/cal

[root@zfsarchive1 ~]# set compression=off jbod17/cal

[root@zfsarchive1 ~]# time zfs rewrite -r /jbod17/cal
...

This is currently processing the files at ~100MB/s,

[root@zfsarchive1 ~]# zpool iostat jbod17 5
              capacity     operations     bandwidth 
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
jbod17      20.8T  1.12P  7.28K    367  84.3M   103M
jbod17      20.8T  1.12P  7.28K    494  70.4M   181M
jbod17      20.8T  1.12P  7.41K    387  87.1M  84.7M
jbod17      20.8T  1.12P  9.09K    440  91.8M   117M
jbod17      20.8T  1.12P  7.21K    434  75.4M   110M
jbod17      20.8T  1.12P  12.8K    515   137M   133M
jbod17      20.8T  1.12P  18.5K    996   200M   280M

with the zfs process using ~15% of a cpu-core

[root@zfsarchive1 ~]# top

top - 16:25:06 up 13 min,  3 users,  load average: 2.62, 2.41, 1.85
Tasks: 2395 total,   2 running, 2393 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  1.7 sy,  0.0 ni, 97.5 id,  0.7 wa,  0.1 hi,  0.0 si,  0.0 st
MiB Mem : 385934.4 total, 306918.4 free,  78558.8 used,   3688.4 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 307375.6 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                 
  56137 root      20   0   16352   5120   5120 D  14.1   0.0   0:54.92 zfs                                                                                                                                                     
  41603 root       0 -20       0      0      0 S   1.6   0.0   0:03.67 z_wr_int_9                                                                                                                                              
  42828 root       0 -20       0      0      0 S   1.6   0.0   0:03.61 z_wr_int_0

and perf showing,

Samples: 95K of event 'cycles:P', 4000 Hz, Event count (approx.): 46282639308 lost: 0/0 drop: 0/0
  Children      Self  Shared Object               Symbol
+   76.99%     0.72%  [kernel]                    [k] taskq_thread
+   57.02%     0.30%  [kernel]                    [k] zio_execute
+   47.12%     0.25%  [kernel]                    [k] _raw_spin_lock_irqsave
+   46.17%    46.07%  [kernel]                    [k] native_queued_spin_lock_slowpath
+   39.39%     0.00%  [kernel]                    [k] ret_from_fork
+   39.38%     0.00%  [kernel]                    [k] kthread
+   35.26%     0.59%  [kernel]                    [k] zio_done
+   30.81%     0.26%  [kernel]                    [k] taskq_dispatch_ent
+   12.70%     0.07%  [kernel]                    [k] zio_vdev_io_start
+   12.65%     0.08%  [kernel]                    [k] zio_nowait
+   10.83%     0.02%  [kernel]                    [k] vdev_mirror_io_start
+    7.65%     0.01%  [kernel]                    [k] entry_SYSCALL_64_after_hwframe
+    7.64%     0.04%  [kernel]                    [k] do_syscall_64
+    7.61%     0.03%  [kernel]                    [k] vdev_raidz_io_start
+    7.50%     0.04%  [kernel]                    [k] zio_write_compress
+    7.29%     0.02%  [kernel]                    [k] zfs_lz4_compress
...

stuartthebruce · 2025-04-27T19:33:33Z

I started testing with a copy of the following home directories on an otherwise idle RL9 test system initially running version 2.3.1,

That finished without crashing,

[root@zfsarchive1 ~]# time zfs rewrite -r /jbod17/cal
                                                            
real    176m32.645s
user    0m3.518s
sys     8m56.518s

And resulted in,

[root@zfsarchive1 ~]# zfs get space jbod17/cal
NAME        PROPERTY              VALUE          SOURCE
jbod17/cal  name                  jbod17/cal     -
jbod17/cal  available             910T           -
jbod17/cal  used                  950G           -
jbod17/cal  usedbysnapshots       477G           -
jbod17/cal  usedbydataset         473G           -
jbod17/cal  usedbyrefreservation  0B             -
jbod17/cal  usedbychildren        0B             -

[root@zfsarchive1 ~]# zfs get compression,compressratio jbod17/cal
NAME        PROPERTY       VALUE           SOURCE
jbod17/cal  compression    on              inherited from jbod17
jbod17/cal  compressratio  1.14x           -

[root@zfsarchive1 ~]# df -i /jbod17/cal
Filesystem            Inodes   IUsed         IFree IUse% Mounted on
jbod17/cal     1953610773572 1985661 1953608787911    1% /jbod17/cal

After destroying all of the unmodified snapshots,

[root@zfsarchive1 ~]# zfs list -H -o name -t snapshot jbod17/cal  | xargs --no-run-if-empty -n1 zfs destroy

[root@zfsarchive1 ~]# zfs get space jbod17/cal
NAME        PROPERTY              VALUE          SOURCE
jbod17/cal  name                  jbod17/cal     -
jbod17/cal  available             910T           -
jbod17/cal  used                  473G           -
jbod17/cal  usedbysnapshots       0B             -
jbod17/cal  usedbydataset         473G           -
jbod17/cal  usedbyrefreservation  0B             -
jbod17/cal  usedbychildren        0B             -

[root@zfsarchive1 ~]# zfs get compression,compressratio jbod17/cal
NAME        PROPERTY       VALUE           SOURCE
jbod17/cal  compression    on              inherited from jbod17
jbod17/cal  compressratio  1.14x           -

Which appears to have been a big NOOP. It appears I fat fingered something and didn't actually hit return on the command to set compression=off, which is confirmed from zpool history, so trying again, albeit this time with no snapshots in this dataset,

[root@zfsarchive1 ~]# zfs set compression=off jbod17/cal

[root@zfsarchive1 ~]# zfs get compression,compressratio jbod17/cal
NAME        PROPERTY       VALUE           SOURCE
jbod17/cal  compression    off             local
jbod17/cal  compressratio  1.14x           -

[root@zfsarchive1 ~]# time zfs rewrite -r /jbod17/cal

stuartthebruce · 2025-04-27T21:22:00Z

Which appears to have been a big NOOP. It appears I fat fingered something and didn't actually hit return on the command to set compression=off, which is confirmed from zpool history, so trying again, albeit this time with no snapshots in this dataset,

It ran faster this time and actually removed compression,

[root@zfsarchive1 ~]# time zfs rewrite -r /jbod17/cal
                                                                                         

real    48m48.924s
user    0m2.229s
sys     6m59.283s

[root@zfsarchive1 ~]# zfs get space jbod17/cal
NAME        PROPERTY              VALUE          SOURCE
jbod17/cal  name                  jbod17/cal     -
jbod17/cal  available             910T           -
jbod17/cal  used                  534G           -
jbod17/cal  usedbysnapshots       0B             -
jbod17/cal  usedbydataset         534G           -
jbod17/cal  usedbyrefreservation  0B             -
jbod17/cal  usedbychildren        0B             -

[root@zfsarchive1 ~]# zfs get compression,compressratio jbod17/cal
NAME        PROPERTY       VALUE           SOURCE
jbod17/cal  compression    off             local
jbod17/cal  compressratio  1.00x           -

stuartthebruce · 2025-04-27T21:33:44Z

Running parallel instances on the other datasets in this test pool appears to have increased the aggregate performance,

[root@zfsarchive1 ~]# zpool iostat 5
              capacity     operations     bandwidth 
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
jbod17      21.0T  1.12P  1.21K    156  15.2M  20.1M
jbod17      21.0T  1.12P  49.4K  3.25K  1.11G  1.40G
jbod17      21.0T  1.12P  52.7K  2.89K  1.15G  1.26G
jbod17      21.0T  1.12P  47.6K  3.22K  1.16G  1.43G
jbod17      21.1T  1.12P  39.5K  2.87K  1.01G  1.24G
jbod17      21.1T  1.12P  8.55K    626   109M   203M
jbod17      21.1T  1.12P  8.87K    735   117M   196M
jbod17      21.1T  1.12P  13.0K  1.16K   182M   417M
jbod17      21.1T  1.12P  38.5K  4.42K   717M  1.04G
jbod17      21.1T  1.12P  57.4K  6.41K  1.12G  1.66G
jbod17      21.1T  1.12P  58.1K  6.45K  1.13G  1.67G
jbod17      21.1T  1.12P  55.6K  5.14K  1.11G  1.44G
jbod17      21.1T  1.12P  46.3K  3.75K  1.24G  1.67G

stuartthebruce · 2025-04-27T21:35:35Z

Do I understand correctly that while it is safe to interrupt and restart a rewrite there is no state preserved for a restart to resume where it last left off?

module/zfs/zfs_vnops.c

amotin · 2025-04-28T01:57:51Z

Do I understand correctly that while it is safe to interrupt and restart a rewrite there is no state preserved for a restart to resume where it last left off?

Right.

stuartthebruce · 2025-04-28T20:18:09Z

Do I understand correctly that while it is safe to interrupt and restart a rewrite there is no state preserved for a restart to resume where it last left off?

Right.

Since none of the timstamps returned by stat() are updated be zfs rewrite are there any internal ZFS timestamps that indicate the last time a file (or a block of data in a file) was last written?

While I am at it, how about last read, e.g., by zfs send or zpool scrub ?

HPPinata · 2025-04-28T20:28:50Z

Do I understand correctly that while it is safe to interrupt and restart a rewrite there is no state preserved for a restart to resume where it last left off?

Right.

Since none of the timstamps returned by stat() are updated be zfs rewrite are there any internal ZFS timestamps that indicate the last time a file (or a block of data in a file) was last written?

While I am at it, how about last read, e.g., by zfs send or zpool scrub ?

I believe the creation time (or at least birth txg) of a record can be queried with zdb and the right combination of -d -b and -v flags. And since records are atomic and rewritten on modification that should be what you're looking for. Whether that's convenient or practical is another question. Reads are (to my knowledge) not logged at record granularity.

EDIT: per file those metrics should show up with less -dd and -vv spamming, but I'm pretty sure at least some of those are what's reported to stat() anyway.

stuartthebruce · 2025-04-28T20:46:09Z

I believe the creation time (or at least birth txg) of a record can be queried with zdb and the right combination of -d -b and -v flags.

I would like to suggest an interesting RFE for zfs rewrite would be to accept time ranges to allow recursive scanning to process only files with records in that range. For example: after changing compression algorithm for new files and confirming from in situ performance to retroactively change old files by re-writing those last written before XXX, or a periodic defrag daemon that rewrites files last written before YYY, or wanting to back out a change and re-write files last written after ZZZ.

amotin · 2025-04-29T18:31:47Z

@stuartthebruce To specify any times we'd at very least need #16853 to land first.

amotin · 2025-04-29T18:54:01Z

Added some tests (not yet tested ;)) and opaque arg field into the IOCTLs structure to make it even more future-proof.

amotin · 2025-05-01T01:10:18Z

Test cases would be good

@tonyhutter Added and passed.

cmd/zfs/zfs_main.c

This allows to rewrite content of specified file(s) as-is without modifications, but at a different location, compression, checksum, dedup, copies and other parameter values. It is faster than read plus write, since it does not require data copying to user-space. It is also faster for sync=always datasets, since without data modification it does not require ZIL writing. Also since it is protected by normal range range locks, it can be done under any other load. Also it does not affect file's modification time or other properties. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc.

tonyhutter · 2025-05-01T17:06:02Z

You'll want to add some input validation to the zfs rewrite test case. I was able to pass strings instead of numbers to -o|-l and was also able to pass offsets/lengths beyond the file size.

amotin · 2025-05-01T17:16:27Z

You'll want to add some input validation to the zfs rewrite test case. I was able to pass strings instead of numbers to -o|-l and was also able to pass offsets/lengths beyond the file size.

Values beyond the file size are not illegal there. Kernel will rewrite only what is actually there. Can just add a check for non-numeric value, if you prefer.

tonyhutter · 2025-05-01T17:46:47Z

Values beyond the file size are not illegal there. Kernel will rewrite only what is actually there.

We should error out if a user is trying to seek/rewrite pass the end of the file. Maybe they accidentally typed in the wrong offset? dd will print an error if you pass a skip value past the end of an input file, for example.

Also, I'm wondering if we should not allow -r with -o|-l. It's likely a mistake if someone is doing a recursive rewrite with -o|-l, and we should give that feedback to the user. We may also want to error out if they pass [-o|-l] with a directory.

amotin · 2025-05-01T17:52:31Z

Values beyond the file size are not illegal there. Kernel will rewrite only what is actually there.

We should error out if a user is trying to seek/rewrite pass the end of the file. Maybe they accidentally typed in the wrong offset? dd will print an error if you pass a skip value past the end of an input file, for example.

I don't think we should, considering the code was planned to work under concurrent load. We should not fail if the file just got truncated. We achieved our goal by doing nothing.

Also, I'm wondering if we should not allow -r with -o|-l. It's likely a mistake if someone is doing a recursive rewrite with -o|-l, and we should give that feedback to the user. We may also want to error out if they pass [-o|-l] with a directory.

It may be a weird combination, but again not illegal. I am actually verifying it in the test, just because I can.

stuartthebruce · 2025-05-01T17:56:58Z

Running parallel instances on the other datasets in this test pool appears to have increased the aggregate performance,

These have now finished after 90 hours without any obvious problems,

[root@zfsarchive1 ~]# zfs list
NAME                 USED  AVAIL  REFER  MOUNTPOINT
jbod17              16.8T   910T   136K  /jbod17
jbod17/cal           534G   910T   534G  /jbod17/cal
jbod17/dqr          4.06T   910T  3.97T  /jbod17/dqr
jbod17/grb.exttrig  4.99T   910T  4.98T  /jbod17/grb.exttrig
jbod17/idq          2.51T   910T  2.51T  /jbod17/idq
jbod17/pe.o4        4.73T   910T  4.63T  /jbod17/pe.o4

[root@zfsarchive1 ~]# parallel 'zfs set compression=off {} && time zfs rewrite -r /{}' ::: jbod17/dqr jbod17/grb.exttrig jbod17/idq jbod17/pe.o4                                                                    

real    616m48.074s
user    0m18.245s
sys     52m49.744s

real    1188m48.153s
user    0m14.149s
sys     59m4.946s

real    1544m41.673s
user    0m34.218s
sys     77m23.340s

real    5414m12.459s
user    2m19.411s
sys     135m19.536s

[root@zfsarchive1 ~]# zfs list
NAME                 USED  AVAIL  REFER  MOUNTPOINT
jbod17              39.6T   887T   136K  /jbod17
jbod17/cal           534G   887T   534G  /jbod17/cal
jbod17/dqr          8.63T   887T  4.69T  /jbod17/dqr
jbod17/grb.exttrig  11.4T   887T  6.40T  /jbod17/grb.exttrig
jbod17/idq          9.13T   887T  6.62T  /jbod17/idq
jbod17/pe.o4        9.95T   887T  5.23T  /jbod17/pe.o4

github-actions bot added the Status: Work in Progress Not yet ready for general review label Apr 15, 2025

amotin force-pushed the rewrite branch from e6a1719 to eec53cf Compare April 15, 2025 20:48

gmelikov reviewed Apr 15, 2025

View reviewed changes

module/zfs/zfs_vnops.c Outdated Show resolved Hide resolved

amotin force-pushed the rewrite branch from eec53cf to 90451a8 Compare April 16, 2025 02:28

amotin added the Status: Design Review Needed Architecture or design is under discussion label Apr 16, 2025

amotin force-pushed the rewrite branch from 90451a8 to a846f69 Compare April 19, 2025 01:36

amotin force-pushed the rewrite branch 4 times, most recently from d23a371 to c5f4413 Compare April 19, 2025 22:49

amotin marked this pull request as ready for review April 20, 2025 20:39

adamdmoss reviewed Apr 27, 2025

View reviewed changes

module/zfs/zfs_vnops.c Show resolved Hide resolved

amotin force-pushed the rewrite branch from 4bc5586 to bc0afc1 Compare April 29, 2025 18:40

amotin removed the Status: Design Review Needed Architecture or design is under discussion label Apr 29, 2025

amotin force-pushed the rewrite branch 7 times, most recently from 1e50c20 to 12b0e14 Compare April 30, 2025 18:40

tonyhutter reviewed May 1, 2025

View reviewed changes

cmd/zfs/zfs_main.c Outdated Show resolved Hide resolved

amotin force-pushed the rewrite branch from 12b0e14 to 35cbff7 Compare May 1, 2025 16:30

Introduce zfs rewrite subcommand #17246

Are you sure you want to change the base?

Introduce zfs rewrite subcommand #17246

Conversation

amotin commented Apr 15, 2025

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

amotin commented Apr 15, 2025

HPPinata commented Apr 15, 2025

clhedrick commented Apr 16, 2025

amotin commented Apr 16, 2025 • edited Loading

clhedrick commented Apr 16, 2025

stuartthebruce commented Apr 16, 2025

HPPinata commented Apr 16, 2025

amotin commented Apr 16, 2025

tonyhutter commented Apr 16, 2025 • edited Loading

amotin commented Apr 16, 2025

snajpa commented Apr 17, 2025 • edited Loading

HPPinata commented Apr 17, 2025 • edited Loading

amotin commented Apr 19, 2025

stuartthebruce commented Apr 19, 2025

amotin commented Apr 20, 2025

stuartthebruce commented Apr 20, 2025

amotin commented Apr 20, 2025

amotin commented Apr 20, 2025

stuartthebruce commented Apr 20, 2025

stuartthebruce commented Apr 26, 2025

stuartthebruce commented Apr 27, 2025

stuartthebruce commented Apr 27, 2025

stuartthebruce commented Apr 27, 2025

stuartthebruce commented Apr 27, 2025

amotin commented Apr 28, 2025

stuartthebruce commented Apr 28, 2025

HPPinata commented Apr 28, 2025 • edited Loading

stuartthebruce commented Apr 28, 2025

amotin commented Apr 29, 2025

amotin commented Apr 29, 2025

amotin commented May 1, 2025

tonyhutter commented May 1, 2025

amotin commented May 1, 2025

tonyhutter commented May 1, 2025

amotin commented May 1, 2025

stuartthebruce commented May 1, 2025

amotin commented Apr 16, 2025 •

edited

Loading

tonyhutter commented Apr 16, 2025 •

edited

Loading

snajpa commented Apr 17, 2025 •

edited

Loading

HPPinata commented Apr 17, 2025 •

edited

Loading

HPPinata commented Apr 28, 2025 •

edited

Loading