Skip to content

Implement dynamic gang header sizes #17004

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

pcd1193182
Copy link
Contributor

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Motivation and Context

ZFS gang block headers are currently fixed at 512 bytes. This is increasingly wasteful in the era of larger disk sector sizes. This PR allows any size allocation to work as a gang header. It also contains supporting changes to ZDB to make gang headers easier to work with.

Description

The ZDB changes are first; basically, there are just some small tweaks to make it easier to work with gang blocks. First, the compact blkptr printer now notes which DVAs have the gang bit set. There is also a fix of a bug that's been around since 2009; ZDB gang block header printing has been broken since then. The problem is that if you do a zio_read of a BP with the gang bit set, you don't get back the header, you get back the underlying data. The fix is to just not set the gang bit.

The way dynamic sized gang headers work is that the amount of space we allow ourselves to use for the gang header is equal to the size of the smallest possible allocation on the vdevs we got back when we allocated it. This is necessary to work around the fact that the ASIZE for a gang BP isn't the allocated size of the gang header, but of the entire tree under it. Because of that fact, when reading, claiming, or freeing, the allocated size of a gang header must be able to be determined from no other information than the vdev it was allocated on. We reuse the existing vdev_gang_header_asize for this. We take this minimum space, and pack the block full of blkptrs, up to a tail with an embedded checksum. This allows us to store many more gang children per header, leading to much shallower gang trees if we're forced into intense ganging.

One important wrinkle is that if the pool we're using has old gang headers on it, they may be smaller than the smallest allocation of the relevant vdevs. We have a workaround for this in zio_checksum_error, which will now try the checksum again in the case of a gang block of size above 512 bytes, with the size reduced to 512. This should not pose a significant performance problem, since 1) calculating a checksum for a block that small doesn't take very long, and 2) hopefully the number of gang blocks on systems is not too high. This also only applies to systems that have both old gang headers and the feature enabled.

Much of the remaining changes are just tweaking the gang tree orchestration logic to work with a variable number of gang block pointers. The final interesting change is to the logic that determines the size of the gang children. When only 3 gang children were possible, it didn't really matter what the sector size was; you were going to be allocating at most 3 of them, so the space waste was limited. With large sector sizes, however, it could get expensive: A 128k allocation that gangs on a system with 64k sectors will, without changes, consume 256 64k sectors, with 512 bytes of data each, wasting 128x as much space as the original allocation. The fix is to use the spa_min_ashift rather than the MINBLOCKSIZE when calculating the lsize.

How Has This Been Tested?

In addition to the ZTS and zloop, extensive manual tests were performed, verifying that 1) the new ZDB functionality works, and 2) that larger gang headers are correctly used and allocated when applicable. Mixed ashift testing occured, as well as the full compatibility matrix of old pools, new pools without the feature enabled running against old and new code.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

zfeature_register(SPA_FEATURE_DYNAMIC_GANG_HEADER,
"com.klarasystems:dynamic_gang_header", "dynamic_gang_header",
"Support for dynamically sized gang headers",
ZFEATURE_FLAG_ACTIVATE_ON_ENABLE, ZFEATURE_TYPE_BOOLEAN, NULL,
Copy link
Member

@amotin amotin Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I also thought about this gang block issue, it also immediately became obvious that it would require read-incompatible pool feature. I am very unhappy about it. It would require updates to at least GRUB and FreeBSD loader to not break booting or careful compatibility settings from every user and distribution. The "activate on enable" makes it even worse, while I bet in many cases it could be avoided, for example for any pool with at least one ashift=9 vdev.

Copy link
Contributor Author

@pcd1193182 pcd1193182 Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACTIVATE_ON_ENABLE can certainly be avoided. If nothing else, we could have a check in zio_write_gang_block that only activates it if we get allocations back that let us use a header of size > 512. In theory, we could modify the feature refcount for every single gang block created and freed, but that feels like a bit much, since each time you need to create a dmu_tx_t, assign it to the io's txg, etc. Doing it once you actually write a new-style gang block for the first time and never deactivating it should mean that for most pools, it never gets activated, and for those with vdevs with 512 byte sectors, it also wouldn't get activated.

read-incompatible, on the other hand, is definitely unavoidable. For now at least, people can avoid enabling the feature on their root pools, so the bootloader issue isn't as pressing.

Copy link
Member

@amotin amotin Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"avoid activating the feature" means never running zpool upgrade for existing pools, or setting right compatibility property on creation or at least before upgrading. At this time we don't even support disabling features that were enabled by mistake and that are not active. We should implement that at some point BTW. This might be a good motivation, if we accept the incompatibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out a slightly bigger hammer than expected is needed here, so we definitely don't want to update this for every single gang header. It turns out that 1) I forgot that this featureflag needed the MOS flag and 2) MOS features basically have to be updated from syncing context. This means we have to spin out a synctask to do the job for us.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "activate on enabled" is likely risky here, because that would mean someone without upgraded bootcode would work fine until one day they birthed a gang block, and now it doesn't boot.

Explicitly requiring a zpool upgrade to enable the feature feels far safer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"avoid activating the feature" means never running zpool upgrade for existing pools, or setting right compatibility property on creation or at least before upgrading. At this time we don't even support disabling features that were enabled by mistake and that are not active. We should implement that at some point BTW. This might be a good motivation, if we accept the incompatibility.

I think this is possible with zhack, but doing it as a real feature (zpool set feature@name=disabled $pool) does seem useful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the latest version the feature isn't ACTIVATE_ON_ENABLE anymore. In addition, even when it was, you would still need a zpool upgrade to enable the feature before it would get enabled (and therefore activated), right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, although most people blindly zpool upgrade

@pcd1193182 pcd1193182 force-pushed the dyn_gang branch 2 times, most recently from 2ff3989 to 9843def Compare February 4, 2025 21:13
@pcd1193182
Copy link
Contributor Author

Here's a prototype of the changes to do variable leaf size allocations and use those for gang allocations.

It has three primary parts. The first is changes to the allocation code to support allocation ranges, instead of specific sizes. The second is the concept of a "preallocated" write zio, which uses the normal write code path but skips parts of it since the allocation has already been done. The third part is changes to the gang creation code to take advantage of both of the above.

I also added some tests for gang related code, since the test suite doesn't have that much in the way of those.

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the approach to variable size allocation, though code may use some polish. But in pre-allocated writes you've lost me. I haven't looked very close yet, but if we allocate space upfront, I wonder why it can't be implemented with existing rewrite zios?

More comments below, but I also wondering if we could implement variable size allocations first within existing gang blocks without introducing the new pool feature. I think it could help significantly by itself without breaking compatibility.

Comment on lines 1776 to 1889
offset = metaslab_block_picker(rt, cursor, size, max_size,
metaslab_df_max_search, found_size);
if (max_size != size && offset == -1) {
align = size & -size;
cursor = &msp->ms_lbas[highbit64(align) - 1];
offset = metaslab_block_picker(rt, cursor, max_size,
max_size, metaslab_df_max_search, found_size);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this part. First you are looking into the area of maximum-sized requests, but allowing for shorter one, which is OK. But then you are looking in the area of shorter requests strictly for a large one, which makes no sense to me. If we don't know what we'll get allocated, all this allocation optimization logic does not make sense. I am generally not sure all this magic with areas makes much sense these days and does not just produce more fragmentation and head seeks, but this added code I just don't understand.

module/zfs/zio.c Outdated
Comment on lines 3227 to 3247
/*
* De-gang case: We got an allocation big enough to
* satisfy the original allocation. Just do that
* instead of ganging.
*/
for (int d = 0; d < BP_GET_NDVAS(zio->io_bp); d++) {
dva_t *dva = &zio->io_bp->blk_dva[d];
metaslab_unalloc_dva(spa,
dva, txg);
metaslab_group_alloc_decrement(spa,
DVA_GET_VDEV(dva), pio, flags,
pio->io_allocator, B_FALSE);
}
metaslab_trace_move(&cio_list, &pio->io_alloc_list);
metaslab_group_alloc_increment_all(spa, bp, pio, flags,
pio->io_allocator);
if (pio->io_flags & ZIO_FLAG_IO_ALLOCATING) {
metaslab_class_throttle_unreserve(mc,
gbh_copies - copies, pio->io_allocator,
pio);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my #17020 I am removing this allocator/throttling accounting. We could merge that first before complicating this.

module/zfs/zio.c Outdated
Comment on lines 3248 to 3264
zio->io_bp = pio->io_bp;
*zio->io_bp = zio->io_bp_orig = *bp;

if (zio->io_abd != NULL)
abd_free(zio->io_abd);
zio->io_orig_abd = zio->io_abd = pio->io_abd;

zio->io_size = zio->io_orig_size = allocated_size;
zio->io_lsize = allocated_size;
zio->io_done = NULL;
zio->io_orig_pipeline = zio->io_pipeline =
(zio->io_pipeline & ~ZIO_GANG_STAGES) |
ZIO_STAGE_WRITE_COMPRESS | ZIO_STAGE_DVA_ALLOCATE;
zio->io_flags |= ZIO_FLAG_PREALLOCATED;
zp.zp_type = zp.zp_storage_type = pio->io_prop.zp_type;
zp.zp_level = pio->io_prop.zp_level;
zio->io_prop = zp;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are gang blocks with only one child illegal? Because this mess with live zio conversion from rewrite to regular write IMHO should be. Alternatively I think could be cleaner to replace the rewrite zio pipeline with ZIO_INTERLOCK_STAGES and let it quietly complete without doing anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One-child gang blocks aren't illegal, but they are pretty gross. I didn't think this code was too bad, but I agree it's definitely not very clean. I'm fine stripping that out if people don't think it has value. I think just replacing it with the INTERLOCK stages will probably break the throttle accounting, but that will all change once your updated space accounting lands, so you may have a better idea about than I do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about this more, it might actually work fine with a few tweaks. Let me try it out and see.

@pcd1193182
Copy link
Contributor Author

Rather than implementing disabling features, which is a little tricky, I am playing with a lighter-weight solution here. This PR adds the idea of a feature that isn't enabled by upgrade by default. You can still enable it manually, and if it's in your compat file it will be enabled. There are some quirks here; I'm not sure exactly how to handle the code in upgrade that prints all un-enabled features; should it display this one, even though it won't be added by upgrade? Should it skip this one, even though it isn't enabled? For now, I left it as the former; there are some upgrade tests that had to be tweaked to work with this approach. I'm curious if anyone has feelings about this idea, or preferences for what the behavior should be for listing unenabled features.

@github-actions github-actions bot added Status: Work in Progress Not yet ready for general review and removed Status: Code Review Needed Ready for review and testing labels Mar 8, 2025
@pcd1193182 pcd1193182 force-pushed the dyn_gang branch 2 times, most recently from f85a90b to 7a9513d Compare April 1, 2025 20:31
@pcd1193182
Copy link
Contributor Author

This is a rebased version of the changes; the first commit is the entirety of #17111 squashed down. I haven't run the tests or anything, just trying to keep this from getting too far out of sync with master.

@allanjude
Copy link
Contributor

re:

One important wrinkle is that if the pool we're using has old gang headers on it, they may be smaller than the smallest allocation of the relevant vdevs. We have a workaround for this in zio_checksum_error, which will now try the checksum again in the case of a gang block of size above 512 bytes, with the size reduced to 512. This should not pose a significant performance problem, since 1) calculating a checksum for a block that small doesn't take very long, and 2) hopefully the number of gang blocks on systems is not too high. This also only applies to systems that have both old gang headers and the feature enabled.

Does the feature activation refcount stuff track the TXG of when a feature was activated? Would we be able to use the birth_txg of the gang block to tell if it should be 512b vs minimum asize? or does that involve looking inside the block we are trying to read?

@pcd1193182
Copy link
Contributor Author

Does the feature activation refcount stuff track the TXG of when a feature was activated? Would we be able to use the birth_txg of the gang block to tell if it should be 512b vs minimum asize? or does that involve looking inside the block we are trying to read?

I believe we only track the TXG at which the feature was enabled. Because we use the small gang header unless we can't fit the necessary leaves, we can't just rely on that. A gang block created after the feature was enabled might use the small size, in order to avoid activating the feature unnecessarily.

If we had activation_txg as well, we would also need to tweak the code to always use the larger header once the feature is active. Not a big change, but worth noting.

When forced to resort to ganging, ZFS currently allocates three child
blocks, each one third of the size of the original. This is true
regardless of whether larger allocations could be made, which would
allow us to have fewer gang leaves. This improves performance when
fragmentation is high enough to require ganging, but not so high that
all the free ranges are only just big enough to hold a third of the
recordsize. This is also useful for improving the behavior of a future
change to allow larger gang headers.

We add the ability for the allocation codepath to allocate a range of
sizes instead of a single fixed size. We then use this to pre-allocate
the DVAs for the gang children. If those allocations fail, we fall back
to the normal write path, which will likely re-gang.

Signed-off-by: Paul Dagnelie <[email protected]>
ZFS gang block headers are currently fixed at 512 bytes. This is
increasingly wasteful in the era of larger disk sector sizes. This PR
allows any size allocation to work as a gang header. It also contains
supporting changes to ZDB to make gang headers easier to work with.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Paul Dagnelie <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Work in Progress Not yet ready for general review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants