Skip to content

Commit f49f9bd

Browse files
committed
[fbgemm_gpu] Add more docs scaffolding for GenAI
- Add more docs scaffolding for GenAI
1 parent 23fe369 commit f49f9bd

File tree

12 files changed

+121
-22
lines changed

12 files changed

+121
-22
lines changed

.github/scripts/fbgemm_gpu_docs.bash

+8
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,14 @@ build_fbgemm_gpu_docs () {
8282
# shellcheck disable=SC2086
8383
print_exec conda env config vars set ${env_prefix} SPHINX_LINT=1
8484

85+
# print_exec mkdir -p ../fbgemm_gpu/experimental/
86+
# print_exec cp -r ../experimental/gen_ai/gen_ai ../fbgemm_gpu/experimental/ || return 1
87+
# print_exec cp -r ../experimental/gemm/triton_gemm ../fbgemm_gpu/experimental/ || return 1
88+
# print_exec ls -la ../fbgemm_gpu
89+
# print_exec ls -la ../fbgemm_gpu/experimental/
90+
# print_exec ls -la ../fbgemm_gpu/experimental/gen_ai
91+
# print_exec ls -la ../fbgemm_gpu/experimental/triton_gemm
92+
8593
# shellcheck disable=SC2086
8694
if print_exec conda run ${env_prefix} make clean doxygen; then
8795
echo "[DOCS] Doxygen build passed"

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ documentation.
1919
applications. Please see [the documentation](fbgemm_gpu/README.md) for more
2020
information.
2121

22-
* **FBGEMM_GPU GenAI**: A collection of PyTorch GPU operator libraries that are
22+
* **FBGEMM GenAI**: A collection of PyTorch GPU operator libraries that are
2323
designed for generative AI applications, such as FP8 row-wise quantization and
2424
collective communications. Please see [the documentation](fbgemm_gpu/README.md)
2525
for more information.

fbgemm_gpu/docs/Doxyfile.in

+2-1
Original file line numberDiff line numberDiff line change
@@ -2343,7 +2343,8 @@ SEARCH_INCLUDES = YES
23432343
INCLUDE_PATH = "../codegen" \
23442344
"../include" \
23452345
"../src" \
2346-
"../../include/fbgemm"
2346+
"../../include/fbgemm" \
2347+
"../experimental/gen_ai/src"
23472348

23482349
# You can use the INCLUDE_FILE_PATTERNS tag to specify one or more wildcard
23492350
# patterns (like *.h and *.hpp) to filter out the header-files in the

fbgemm_gpu/docs/src/fbgemm_genai/development/BuildInstructions.rst

+17
Original file line numberDiff line numberDiff line change
@@ -26,9 +26,26 @@ Follow the instructions to set up the Conda environment:
2626

2727
#. :ref:`fbgemm-gpu.build.setup.env`
2828
#. :ref:`fbgemm-gpu.build.setup.cuda`
29+
#. :ref:`fbgemm-gpu.build.setup.cutlass`
2930
#. :ref:`fbgemm-gpu.build.setup.tools.install`
3031
#. :ref:`fbgemm-gpu.build.setup.pytorch.install`
3132

33+
.. _fbgemm-gpu.build.setup.cutlass:
34+
35+
Install CUTLASS
36+
~~~~~~~~~~~~~~~
37+
38+
CUTLASS should be already be available in the FBGEMM repository as a git
39+
submodule (see :ref:`fbgemm-gpu.build.prepare`). The following include paths
40+
are already added to the CMake configuration:
41+
42+
.. code:: cmake
43+
44+
set(THIRDPARTY ${FBGEMM}/external)
45+
46+
${THIRDPARTY}/cutlass/include
47+
${THIRDPARTY}/cutlass/tools/util/include
48+
3249
3350
Other Pre-Build Setup
3451
---------------------

fbgemm_gpu/docs/src/fbgemm_genai/development/InstallationInstructions.rst

+3-3
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ Follow the instructions for setting up the runtime environment:
3030

3131

3232
Install the FBGEMM GenAI Package
33-
------------------------------
33+
--------------------------------
3434

3535
Install through PyTorch PIP
3636
~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -43,11 +43,11 @@ PyTorch PIP is the preferred channel for installing FBGEMM GenAI:
4343
4444
# CUDA Nightly
4545
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126/
46-
pip install --pre fbgemm-genai --index-url https://download.pytorch.org/whl/nightly/cu126/
46+
pip install --pre fbgemm-gpu-genai --index-url https://download.pytorch.org/whl/nightly/cu126/
4747
4848
# CUDA Release
4949
pip install torch --index-url https://download.pytorch.org/whl/cu126/
50-
pip install fbgemm-genai --index-url https://download.pytorch.org/whl/cu126/
50+
pip install fbgemm-gpu-genai --index-url https://download.pytorch.org/whl/cu126/
5151
5252
# Test the installation
5353
python -c "import torch; import fbgemm_gpu.experimental.gen_ai"

fbgemm_gpu/docs/src/fbgemm_genai/development/TestInstructions.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ benchmarks (in the ``fbgemm_gpu/experimental/gen_ai/bench/`` directory) provide
66
good examples on how to use FBGEMM GenAI operators.
77

88
Set Up the FBGEMM GenAI Test Environment
9-
---------------------------------------
9+
----------------------------------------
1010

1111
After an environment is available from building / installing the FBGEMM GenAI
1212
package, additional packages need to be installed for tests to run correctly:

fbgemm_gpu/docs/src/fbgemm_genai/index.rst

+12
Original file line numberDiff line numberDiff line change
@@ -16,3 +16,15 @@ FP8 row-wise quantization and collective communications.
1616
development/BuildInstructions
1717
development/InstallationInstructions
1818
development/TestInstructions
19+
20+
.. toctree::
21+
:maxdepth: 2
22+
:caption: FBGEMM GenAI Overview
23+
24+
overview/Overview
25+
26+
.. toctree::
27+
:maxdepth: 2
28+
:caption: FBGEMM GenAI Python API
29+
30+
python-api/quantize_ops
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
FBGEMM GenAI Overview
2+
=====================
3+
4+
High Level Overview
5+
-------------------
6+
7+
FBGEMM FP8 rowwise quantization kernels have been officially adopted in the
8+
[Llama3.1 release](https://fb.workplace.com/groups/221503021668016/permalink/1900301927121442/).
9+
FP8 has been applied across Llama3 models with 8 B, 70 B, and 405 B.
10+
Notably, for the 405 B model, FP8 enables the inference on a single node,
11+
achieving a 2x throughput improvement over the baseline BF16 running on two
12+
nodes with pipeline parallelism. Externally, it has been mentioned in
13+
[Llama3 paper](https://ai.meta.com/research/publications/the-llama-3-herd-of-models/) &
14+
[repo](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/models/llama/quantize_impls.py), [HuggingFace](https://huggingface.co/docs/transformers/main/quantization/fbgemm_fp8), [vLLM](https://blog.vllm.ai/2024/07/23/llama31.html), and [TensorRT-LLM](https://developer.nvidia.com/blog/supercharging-llama-3-1-across-nvidia-platforms/).
15+
16+
FBGEMM GenAI FP8 supports a variety of configurations:
17+
18+
* GEMM Operators: {CUTLASS, CK, Triton} x {BF16, FP8} x {tensor-wise, row-wise, block-wise} x {Nvidia H100, AMD MI300x}.
19+
* High/low Precision Conversion Kernels: (FP32 / BF16 <-> FP8) with scaling options {tensor-wise, row-wise, block-wise} across hardware platforms {Nvidia H100, AMD MI300x} and programming options of {Triton, CUDA/HIP}.
20+
21+
Besides FP8 support, FBGEMM GenAI operators also support:
22+
23+
* Customized AllReduce communications (reduce latency for small message sizes).
24+
* GQA: optimized specifically for decoding cases, as detailed in PyTorch's blog on [INT4 decoding](https://pytorch.org/blog/int4-decoding/).
25+
* KV cache quantizations.
26+
* Rotary Positional Embedding (RoPE).
27+
28+
FP8 Core API Functions
29+
----------------------
30+
31+
.. code:: python
32+
33+
# Rowwise quantize (channel wise) the weight from BF16 to FP8
34+
wq, w_scale = torch.ops.fbgemm.quantize_fp8_per_row(w)
35+
# Rowwise quantize the activation (token wise) from BF16 to FP8
36+
xq, x_scale = torch.ops.fbgemm.quantize_fp8_per_row(
37+
x, num_tokens, activation_scale_ub
38+
)
39+
# Rowwise quantize GEMM with FP8 input and BF16 output
40+
y = torch.ops.fbgemm.f8f8bf16_rowwise(
41+
xq,
42+
wq,
43+
x_scale,
44+
w_scale,
45+
use_fast_accum=True,
46+
)
47+
48+
See :ref:`genai-quantize-ops-stable-api` for more details.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
Quantization Operators
2+
======================
3+
4+
.. automodule:: fbgemm_gpu
5+
6+
.. _genai-quantize-ops-stable-api:
7+
8+
Stable API
9+
----------
10+
11+
.. autofunction:: fbgemm_gpu.experimental.gen_ai.quantize.int4_row_quantize
12+
13+
.. autofunction:: fbgemm_gpu.experimental.gen_ai.quantize.quantize_int4_preshuffle
14+
15+
Other API
16+
---------

fbgemm_gpu/docs/src/fbgemm_gpu/development/BuildInstructions.rst

-15
Original file line numberDiff line numberDiff line change
@@ -151,21 +151,6 @@ cuDNN package for the given CUDA version:
151151
wget -q "${cudnn_url}" -O cudnn.tar.xz
152152
tar -xvf cudnn.tar.xz
153153
154-
Install CUTLASS
155-
~~~~~~~~~~~~~~~
156-
157-
This section is only applicable to building the experimental FBGEMM_GPU GenAI
158-
module. CUTLASS should be already be available in the repository as a git
159-
submodule (see :ref:`fbgemm-gpu.build.prepare`). The following include paths
160-
are already added to the CMake configuration:
161-
162-
.. code:: cmake
163-
164-
set(THIRDPARTY ${FBGEMM}/external)
165-
166-
${THIRDPARTY}/cutlass/include
167-
${THIRDPARTY}/cutlass/tools/util/include
168-
169154
170155
.. _fbgemm-gpu.build.setup.rocm:
171156

fbgemm_gpu/docs/src/index.rst

+12
Original file line numberDiff line numberDiff line change
@@ -121,3 +121,15 @@ Table of Contents
121121
fbgemm_genai/development/BuildInstructions
122122
fbgemm_genai/development/InstallationInstructions
123123
fbgemm_genai/development/TestInstructions
124+
125+
.. toctree::
126+
:maxdepth: 1
127+
:caption: FBGEMM GenAI Overview
128+
129+
fbgemm_genai/overview/Overview
130+
131+
.. toctree::
132+
:maxdepth: 1
133+
:caption: FBGEMM GenAI Python API
134+
135+
fbgemm_genai/python-api/quantize_ops

fbgemm_gpu/fbgemm_gpu/__init__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ def _load_library(filename: str, no_throw: bool = False) -> None:
5656
"experimental/gen_ai/fbgemm_gpu_experimental_gen_ai",
5757
]
5858

59-
# NOTE: While FBGEMM_GPU GenAI is not available for ROCm yet, we would like to
59+
# NOTE: While FBGEMM GenAI is not available for ROCm yet, we would like to
6060
# be able to install the existing CUDA variant of the package onto ROCm systems,
6161
# so that we can at least use the Triton GEMM libraries from experimental/gemm.
6262
# But loading fbgemm_gpu package will trigger load-checking the .SO file for the

0 commit comments

Comments
 (0)