[float8] add _auto_filter_for_recipe for float8 training #1319

danielvegamyhre · 2025-06-18T21:52:22Z

Problem

float8 rowwise + vanilla TP in torchtitan had flat perf with respect to bfloat16 (see float8 rowwise vanilla TP low throughput #1207).
RCA In float8 rowwise vanilla TP low throughput #1207 found attention.wk and attention.wv layers were so small that float8 rowwise conversion resulted in approx ~40% slowdown for those layers, which nullified the perf benefits from fp8 rowwise conversion on larger linears.
This is because the default filter_fqns for float8 model conversion are fine for the fp8 tensorwise recipe, but bad for the float8 rowwise recipe.

Solution

This has been a footgun for various users as well (including Poolside), so I created an "auto filter" (pytorch/ao#2410) which automatically filters Linears for a given float8 recipe, by checking for the following criteria:

dims not divisible by 16 (hardware requirement for float8)
dim sizes below thresholds that may result in worse perf for that given recipe, using simple heuristics based on the linked recipe perf tables above.
fqn matches one of the user defined filter_fqns

It prevents users from hitting this common footgun, while also preserving the flexibility to define their model-specific fqns.

Results

Benchmarks show a ~10% TPS improvement for TP and ~15% TPS improvement for async TP (over bf16 TP baseline).

Llama3 70b on 256 H100s with FSDP=32, TP=8, torch.compile, full AC, local batch size 16:

bfloat16 baseline = ~597TPS
fp8 rowwise WITH attention.wk, attention.wv converted = ~600 TPS
fp8 rowwise WITHOUT attention.wk, attention.wv converted = ~660 TPS
fp8 rowwise + async TP WITH attention.wk, attention.wv converted = ~625 TPS
fp8 rowwise + async TP WITHOUT attention.wk, attention.wv converted = ~695 TPS

danielvegamyhre · 2025-06-24T20:55:08Z

cc @tianyu @vkuzo for review + thoughts on if this would be useful to add as the default module filter for float8 in torchtitan

tianyu-l

Sounds good to me. Thank you for the studies and efforts!

Let's also modify helper message to reflect this change
https://github.com/pytorch/torchtitan/blob/main/torchtitan/config_manager.py#L504

tianyu-l · 2025-06-25T02:37:08Z

torchtitan/components/quantization/float8.py

@@ -25,9 +24,9 @@ class Float8Converter(ModelConverter):
    def __init__(self, job_config: JobConfig, parallel_dims: ParallelDims):
        self.enabled = False

-        float8_config: Float8 = job_config.float8
+        self.float8_config: Float8 = job_config.float8


Having both self.float8_config and self.config sounds confusing.
Can we define self.filter_fn in __init__() so that we don't need self.float8_config or self.filter_fqns?

Makes sense, updated.

tianyu-l · 2025-06-25T02:37:50Z

torchtitan/components/quantization/float8.py

+            from torchao.float8 import _auto_filter_for_recipe
+
+            # Mutates the model inplace replacing instances of nn.Linear with Float8Linear
+            filter_fn = _auto_filter_for_recipe(


how about mx quantization? would it also suffer from the issue / benefit from auto filtering?

Probably, but we don't have finalized perf numbers to reference to make an autofilter function for it (like the one added here https://github.com/pytorch/ao/pull/2312/files). We should add an auto filter option like this for mxfp8 once we can though.

vkuzo · 2025-06-25T13:21:57Z

I think it's better to have this off by default and make it easy to enable, to keep the defaults dead simple. Some challenges with this filtering is that it is not aware of M, it is not aware of the underlying hardware, and it will behave unexpectedly on the debug model. How about we just make this easy to enable and add documentation recommending to enable it?

danielvegamyhre · 2025-06-27T04:42:01Z

I think it's better to have this off by default and make it easy to enable, to keep the defaults dead simple. Some challenges with this filtering is that it is not aware of M, it is not aware of the underlying hardware, and it will behave unexpectedly on the debug model. How about we just make this easy to enable and add documentation recommending to enable it?

Makes sense. How about this API to enable the auto filter:

torchtitan/train.py ... --float8.filter_fqns="auto_filter"

toml:

[float8]
filter_fqns = ["auto_filter"]

What do you think? This string could theoretically be part of a FQN but I think it's unlikely and we could document it clearly.

vkuzo · 2025-06-27T13:58:55Z

docs/float8.md

@@ -17,6 +17,8 @@ CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_trai
 * `--float8.enable_fsdp_float8_all_gather`: cast `Float8Linear.weight` from high precision to float8 before FSDP all-gather so we can communicate in float8 to save bandwidth.
 * `--float8.precompute_float8_dynamic_scale_for_fsdp` (optional): communicate AMAX/scales efficiently in a single all-reduce for all parameters instead of doing many small all-reduce for each parameter.
 * `--float8.force_recompute_fp8_weight_in_bwd` (optional): force recomputation of fp8 weights during backward pass, preventing unsharded fp8 weights from being saved for backward.
+* `--float8.filter_fqns="..."` (optional): a comma separated list of fully qualified names of modules not to convert to float8 training. Example: `--float8.filter_fqns="attention.wk,attention.wv"`. You can determine which layers to convert by looking at the microbenchmarks in the [performance section](https://github.com/pytorch/ao/tree/main/torchao/float8#performance) of the torchao documentation for the float8 recipe you're using.
+    * **Auto-filter**: use `--float8.filter_fqns="auto_filter"` to enable automatic module filtering, which will automatically not convert linear layers that are not large enough to benefit from float8 training. The thresholds for conversion are based on microbenchmarks measured on NVIDIA H100 GPUs. For best performance, you should still manually filter out layers that are too small to benefit from float8 training.


nit 1: would be good to enable the user to filter out module foo and then filter out other modules with the auto filter
nit 2: would be good to make the flag name more specific, for example auto_filter_low_kn instead of auto_filter. I guess this applies to torchao as well, sorry for not catching in initial review.

nit 2: would be good to make the flag name more specific, for example auto_filter_low_kn instead of auto_filter. I guess this applies to torchao as well, sorry for not catching in initial review.

Made the name more explicit: auto_filter_small_kn

nit 1: would be good to enable the user to filter out module foo and then filter out other modules with the auto filter

I agree, I updated it so the API is to just include "auto_filter_small_kn" flag as one of the FQNs, instead of the only one. This way, the rest of the FQNs specified are processed as usual for filtering.

add float auto_filter_for_recipe

fd1ad02

danielvegamyhre requested review from tianyu-l, fegin and wwwjn as code owners June 18, 2025 21:52

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 18, 2025

danielvegamyhre mentioned this pull request Jun 18, 2025

float8 rowwise vanilla TP low throughput #1207

Open

lint

b954430

danielvegamyhre changed the title ~~[WIP] [float8] add float auto_filter_for_recipe~~ [float8] add float auto_filter_for_recipe Jun 18, 2025

danielvegamyhre mentioned this pull request Jun 18, 2025

[float8] add _auto_filter_for_recipe to float8 pytorch/ao#2410

Merged

danielvegamyhre marked this pull request as draft June 18, 2025 22:13

danielvegamyhre changed the title ~~[float8] add float auto_filter_for_recipe~~ [WIP] [float8] add float auto_filter_for_recipe Jun 18, 2025

danielvegamyhre added 3 commits June 18, 2025 16:05

preserve BC

0cf737a

lint

4758006

fix typo

3cafe89

danielvegamyhre marked this pull request as ready for review June 23, 2025 21:14

use private api

fc6b141

danielvegamyhre changed the title ~~[WIP] [float8] add float auto_filter_for_recipe~~ [float8] add float auto_filter_for_recipe Jun 24, 2025

danielvegamyhre changed the title ~~[float8] add float auto_filter_for_recipe~~ [float8] add float8 _auto_filter_for_recipe Jun 24, 2025

danielvegamyhre changed the title ~~[float8] add float8 _auto_filter_for_recipe~~ [float8] add _auto_filter_for_recipe for float8 training Jun 24, 2025

lint

f0af111

tianyu-l reviewed Jun 25, 2025

View reviewed changes

address comments

d18fa0e

danielvegamyhre force-pushed the auto_filter branch 4 times, most recently from c8d811f to 78d91f3 Compare June 27, 2025 06:41

add documentation

89044d4

danielvegamyhre force-pushed the auto_filter branch from 78d91f3 to 89044d4 Compare June 27, 2025 06:44

vkuzo reviewed Jun 27, 2025

View reviewed changes

change api

8a3a3de

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[float8] add _auto_filter_for_recipe for float8 training #1319

[float8] add _auto_filter_for_recipe for float8 training #1319

danielvegamyhre commented Jun 18, 2025 •

edited

Loading

Uh oh!

danielvegamyhre commented Jun 24, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l Jun 25, 2025

Uh oh!

danielvegamyhre Jun 27, 2025 •

edited

Loading

Uh oh!

tianyu-l Jun 25, 2025 •

edited

Loading

Uh oh!

danielvegamyhre Jun 27, 2025

Uh oh!

vkuzo commented Jun 25, 2025

Uh oh!

danielvegamyhre commented Jun 27, 2025 •

edited

Loading

Uh oh!

vkuzo Jun 27, 2025

Uh oh!

danielvegamyhre Jun 27, 2025

Uh oh!

Uh oh!

[float8] add _auto_filter_for_recipe for float8 training #1319

Are you sure you want to change the base?

[float8] add _auto_filter_for_recipe for float8 training #1319

Conversation

danielvegamyhre commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Results

Uh oh!

danielvegamyhre commented Jun 24, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

vkuzo commented Jun 25, 2025

Uh oh!

danielvegamyhre commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkuzo Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danielvegamyhre commented Jun 18, 2025 •

edited

Loading

danielvegamyhre Jun 27, 2025 •

edited

Loading

tianyu-l Jun 25, 2025 •

edited

Loading

danielvegamyhre commented Jun 27, 2025 •

edited

Loading