Can FP8 GEMM be enabled via module hooks instead of module swapping? #1887

zigzagcai · 2025-03-14T10:06:54Z

Hi developers,

Thanks for such a great project!

I want to integrate torchao FP8 GEMM into our training framework. But in my framework, the linear layers are defined in customized modules (where we implement Tensor Parallel or ZeRO3 weight parallel), so it is hard to directly swap the linear layers with torchao Float8Linear.

So, can FP8 GEMM enabled via a more friendly way, such like module hooks? Since module swapping is not so flexible

The text was updated successfully, but these errors were encountered:

danielvegamyhre · 2025-03-14T17:13:34Z

For float8 training, module swap is currently the only supported method.

However, you could implement a module hook which uses our float8 quantization primitives. Then in the forward/backward pass, any mm/matmul ops operating on these float8 tensors would be handled by torch._scaled_mm, which dispatches to cublas (for tensorwise scales) or cutlass (for rowwise scales) for the actual GEMMs.

We could potentially provide such a hook in torchao, though it is not currently planned. I think it could be useful though for use cases like this. @vkuzo @drisspg any thoughts on this?

vkuzo · 2025-03-14T18:02:39Z

We started with module swapping because it's the easiest way to iterate on performance/accuracy/usability. If we can cover more important use cases with alternate UX options, I'm in favor.

@zigzagcai , a couple of questions for you.

are you using torch.compile / are you open to using torch.compile
could you share a pointer to your code if you have it

From what you have shared so far, a tensor subclass weight wrapper sounds like the right UX, but would be good to see the callsites to confirm that.

zigzagcai · 2025-03-17T08:41:42Z

We started with module swapping because it's the easiest way to iterate on performance/accuracy/usability. If we can cover more important use cases with alternate UX options, I'm in favor.

@zigzagcai , a couple of questions for you.

are you using torch.compile / are you open to using torch.compile

could you share a pointer to your code if you have it

From what you have shared so far, a tensor subclass weight wrapper sounds like the right UX, but would be good to see the callsites to confirm that.

Hi @vkuzo @danielvegamyhre

Thanks for your follow-up!

we are open to use torch.compile
The code of our ZeRO3 weight parallel implementation is here: https://github.com/InternLM/InternEvo/blob/feat/refactor-impl/internlm/model/model_ops/modules/linear.py#L171-L315

The basic idea of our ZeRO3 weight parallel implementation:
In WPFusedDenseFunc, we all-gather weights in the fwd pass, then all-gather weights and reduce-scatter gradients in bwd pass. And we just apply this customized autograd function to https://github.com/InternLM/InternEvo/blob/feat/refactor-impl/internlm/model/model_ops/modules/linear.py#L532-L678

So, I just wander how could we integrate torchao FP8 with our customized ZeRO3 weight parallel implementation?

vkuzo · 2025-03-18T13:16:13Z

thanks, @zigzagcai . I think a tensor subclass weight wrapper is promising for your use case.

We have a prototype feature for quantized training with int8 with a tensor subclass weight wrapping UX here: https://github.com/pytorch/ao/tree/main/torchao/prototype/quantized_training#int8-mixed-precision-training . Would you be up for trying this and seeing if it works with your use case? If yes, that would be great signal for adding an option for this UX for float8.

zigzagcai · 2025-03-18T13:42:00Z

thanks, @zigzagcai . I think a tensor subclass weight wrapper is promising for your use case.

We have a prototype feature for quantized training with int8 with a tensor subclass weight wrapping UX here: https://github.com/pytorch/ao/tree/main/torchao/prototype/quantized_training#int8-mixed-precision-training . Would you be up for trying this and seeing if it works with your use case? If yes, that would be great signal for adding an option for this UX for float8.

Thanks @vkuzo
Could you please share a pointer to the tensor subclass wrapper?
And I see it is for INT8 training,so how could it be adjusted to FP8 training.

vkuzo · 2025-03-18T13:59:58Z

Thanks @vkuzo

Could you please share a pointer to the tensor subclass wrapper?

Yes, when you do the following

quantize_(model, int8_mixed_precision_training())

Then the weights of torch.nn.Linear modules will be swapped with Int8MixedPrecisionTrainingLinearWeight (code: https://github.com/pytorch/ao/blob/main/torchao/prototype/quantized_training/int8_mixed_precision.py#L300C22-L300C60). You may need to adjust filter_fn to apply this to your custom linear modules, something like

quantize_(model, int8_mixed_precision_training(), filter_fn=lambda mod, fqn: return isinstance(mod, YourCustomModule))

And I see it is for INT8 training,so how could it be adjusted to FP8 training.

Oh, I'm just asking if you're up for trying the int8 training feature to see if it already works for the way you set up your custom linear modules. If it does, that will make it easy for us to see if we can add an equivalent float8 UX in the future. If not, we'd love to learn what didn't work, which would help us brainstorm how we can solve your issue.

zigzagcai · 2025-03-20T03:29:06Z

Thanks @vkuzo

Could you please share a pointer to the tensor subclass wrapper?

Yes, when you do the following

quantize_(model, int8_mixed_precision_training())
Then the weights of torch.nn.Linear modules will be swapped with Int8MixedPrecisionTrainingLinearWeight (code: https://github.com/pytorch/ao/blob/main/torchao/prototype/quantized_training/int8_mixed_precision.py#L300C22-L300C60). You may need to adjust filter_fn to apply this to your custom linear modules, something like

quantize_(model, int8_mixed_precision_training(), filter_fn=lambda mod, fqn: return isinstance(mod, YourCustomModule))

And I see it is for INT8 training,so how could it be adjusted to FP8 training.

Oh, I'm just asking if you're up for trying the int8 training feature to see if it already works for the way you set up your custom linear modules. If it does, that will make it easy for us to see if we can add an equivalent float8 UX in the future. If not, we'd love to learn what didn't work, which would help us brainstorm how we can solve your issue.

Thank you @vkuzo !
I will give it a try!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can FP8 GEMM be enabled via module hooks instead of module swapping? #1887

Can FP8 GEMM be enabled via module hooks instead of module swapping? #1887

zigzagcai commented Mar 14, 2025

danielvegamyhre commented Mar 14, 2025 •

edited

Loading

vkuzo commented Mar 14, 2025

zigzagcai commented Mar 17, 2025 •

edited

Loading

vkuzo commented Mar 18, 2025

zigzagcai commented Mar 18, 2025

vkuzo commented Mar 18, 2025

zigzagcai commented Mar 20, 2025

Can FP8 GEMM be enabled via module hooks instead of module swapping? #1887

Can FP8 GEMM be enabled via module hooks instead of module swapping? #1887

Comments

zigzagcai commented Mar 14, 2025

danielvegamyhre commented Mar 14, 2025 • edited Loading

vkuzo commented Mar 14, 2025

zigzagcai commented Mar 17, 2025 • edited Loading

vkuzo commented Mar 18, 2025

zigzagcai commented Mar 18, 2025

vkuzo commented Mar 18, 2025

zigzagcai commented Mar 20, 2025

danielvegamyhre commented Mar 14, 2025 •

edited

Loading

zigzagcai commented Mar 17, 2025 •

edited

Loading