pytorch · q10 · Apr 8, 2025
diff --git a/.github/scripts/fbgemm_gpu_docs.bash b/.github/scripts/fbgemm_gpu_docs.bash
@@ -82,6 +82,14 @@ build_fbgemm_gpu_docs () {
   # shellcheck disable=SC2086
   print_exec conda env config vars set ${env_prefix} SPHINX_LINT=1
 
+  # print_exec mkdir -p ../fbgemm_gpu/experimental/
+  # print_exec cp -r ../experimental/gen_ai/gen_ai ../fbgemm_gpu/experimental/ || return 1
+  # print_exec cp -r ../experimental/gemm/triton_gemm ../fbgemm_gpu/experimental/ || return 1
+  # print_exec ls -la ../fbgemm_gpu
+  # print_exec ls -la ../fbgemm_gpu/experimental/
+  # print_exec ls -la ../fbgemm_gpu/experimental/gen_ai
+  # print_exec ls -la ../fbgemm_gpu/experimental/triton_gemm
+
   # shellcheck disable=SC2086
   if print_exec conda run ${env_prefix} make clean doxygen; then
     echo "[DOCS] Doxygen build passed"

diff --git a/README.md b/README.md
@@ -19,7 +19,7 @@ documentation.
   applications.  Please see [the documentation](fbgemm_gpu/README.md) for more
   information.
 
-* **FBGEMM_GPU GenAI**: A collection of PyTorch GPU operator libraries that are
+* **FBGEMM GenAI**: A collection of PyTorch GPU operator libraries that are
   designed for generative AI applications, such as FP8 row-wise quantization and
   collective communications. Please see [the documentation](fbgemm_gpu/README.md)
   for more information.

diff --git a/fbgemm_gpu/docs/Doxyfile.in b/fbgemm_gpu/docs/Doxyfile.in
@@ -2343,7 +2343,8 @@ SEARCH_INCLUDES        = YES
 INCLUDE_PATH           = "../codegen" \
                          "../include" \
                          "../src" \
-                         "../../include/fbgemm"
+                         "../../include/fbgemm" \
+                         "../experimental/gen_ai/src"
 
 # You can use the INCLUDE_FILE_PATTERNS tag to specify one or more wildcard
 # patterns (like *.h and *.hpp) to filter out the header-files in the

diff --git a/fbgemm_gpu/docs/src/fbgemm_genai/development/BuildInstructions.rst b/fbgemm_gpu/docs/src/fbgemm_genai/development/BuildInstructions.rst
@@ -26,9 +26,26 @@ Follow the instructions to set up the Conda environment:
 
 #. :ref:`fbgemm-gpu.build.setup.env`
 #. :ref:`fbgemm-gpu.build.setup.cuda`
+#. :ref:`fbgemm-gpu.build.setup.cutlass`
 #. :ref:`fbgemm-gpu.build.setup.tools.install`
 #. :ref:`fbgemm-gpu.build.setup.pytorch.install`
 
+.. _fbgemm-gpu.build.setup.cutlass:
+
+Install CUTLASS
+~~~~~~~~~~~~~~~
+
+CUTLASS should be already be available in the FBGEMM repository as a git
+submodule (see :ref:`fbgemm-gpu.build.prepare`).  The following include paths
+are already added to the CMake configuration:
+
+.. code:: cmake
+
+  set(THIRDPARTY ${FBGEMM}/external)
+
+  ${THIRDPARTY}/cutlass/include
+  ${THIRDPARTY}/cutlass/tools/util/include
+
 
 Other Pre-Build Setup
 ---------------------

diff --git a/fbgemm_gpu/docs/src/fbgemm_genai/development/InstallationInstructions.rst b/fbgemm_gpu/docs/src/fbgemm_genai/development/InstallationInstructions.rst
@@ -30,7 +30,7 @@ Follow the instructions for setting up the runtime environment:
 
 
 Install the FBGEMM GenAI Package
-------------------------------
+--------------------------------
 
 Install through PyTorch PIP
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -43,11 +43,11 @@ PyTorch PIP is the preferred channel for installing FBGEMM GenAI:
 
   # CUDA Nightly
   pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126/
-  pip install --pre fbgemm-genai --index-url https://download.pytorch.org/whl/nightly/cu126/
+  pip install --pre fbgemm-gpu-genai --index-url https://download.pytorch.org/whl/nightly/cu126/
 
   # CUDA Release
   pip install torch --index-url https://download.pytorch.org/whl/cu126/
-  pip install fbgemm-genai --index-url https://download.pytorch.org/whl/cu126/
+  pip install fbgemm-gpu-genai --index-url https://download.pytorch.org/whl/cu126/
 
   # Test the installation
   python -c "import torch; import fbgemm_gpu.experimental.gen_ai"

diff --git a/fbgemm_gpu/docs/src/fbgemm_genai/development/TestInstructions.rst b/fbgemm_gpu/docs/src/fbgemm_genai/development/TestInstructions.rst
@@ -6,7 +6,7 @@ benchmarks (in the ``fbgemm_gpu/experimental/gen_ai/bench/`` directory) provide
 good examples on how to use FBGEMM GenAI operators.
 
 Set Up the FBGEMM GenAI Test Environment
----------------------------------------
+----------------------------------------
 
 After an environment is available from building / installing the FBGEMM GenAI
 package, additional packages need to be installed for tests to run correctly:

diff --git a/fbgemm_gpu/docs/src/fbgemm_genai/index.rst b/fbgemm_gpu/docs/src/fbgemm_genai/index.rst
@@ -16,3 +16,15 @@ FP8 row-wise quantization and collective communications.
    development/BuildInstructions
    development/InstallationInstructions
    development/TestInstructions
+
+.. toctree::
+   :maxdepth: 2
+   :caption: FBGEMM GenAI Overview
+
+   overview/Overview
+
+.. toctree::
+   :maxdepth: 2
+   :caption: FBGEMM GenAI Python API
+
+   python-api/quantize_ops
diff --git a/fbgemm_gpu/docs/src/fbgemm_genai/overview/Overview.rst b/fbgemm_gpu/docs/src/fbgemm_genai/overview/Overview.rst
@@ -0,0 +1,48 @@
+FBGEMM GenAI Overview
+=====================
+
+High Level Overview
+-------------------
+
+FBGEMM FP8 rowwise quantization kernels have been officially adopted in the
+[Llama3.1 release](https://fb.workplace.com/groups/221503021668016/permalink/1900301927121442/).
+FP8 has been applied across Llama3 models with 8 B, 70 B, and 405 B.
+Notably, for the 405 B model, FP8 enables the inference on a single node,
+achieving a 2x throughput improvement over the baseline BF16 running on two
+nodes with pipeline parallelism. Externally, it has been mentioned in
+[Llama3 paper](https://ai.meta.com/research/publications/the-llama-3-herd-of-models/) &
+[repo](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/models/llama/quantize_impls.py), [HuggingFace](https://huggingface.co/docs/transformers/main/quantization/fbgemm_fp8), [vLLM](https://blog.vllm.ai/2024/07/23/llama31.html), and [TensorRT-LLM](https://developer.nvidia.com/blog/supercharging-llama-3-1-across-nvidia-platforms/).
+
+FBGEMM GenAI FP8 supports a variety of configurations:
+
+* GEMM Operators: {CUTLASS, CK, Triton} x {BF16, FP8} x {tensor-wise, row-wise, block-wise} x {Nvidia H100, AMD MI300x}.
+* High/low Precision Conversion Kernels: (FP32 / BF16 <-> FP8) with scaling options {tensor-wise, row-wise, block-wise} across hardware platforms {Nvidia H100, AMD MI300x} and programming options of {Triton, CUDA/HIP}.
+
+Besides FP8 support, FBGEMM GenAI operators also support:
+
+* Customized AllReduce communications (reduce latency for small message sizes).
+* GQA: optimized specifically for decoding cases, as detailed in PyTorch's blog on [INT4 decoding](https://pytorch.org/blog/int4-decoding/).
+* KV cache quantizations.
+* Rotary Positional Embedding (RoPE).
+
+FP8 Core API Functions
+----------------------
+
+.. code:: python
+
+  # Rowwise quantize (channel wise) the weight from BF16 to FP8
+  wq, w_scale = torch.ops.fbgemm.quantize_fp8_per_row(w)
+  # Rowwise quantize the activation (token wise) from BF16 to FP8
+  xq, x_scale = torch.ops.fbgemm.quantize_fp8_per_row(
+      x, num_tokens, activation_scale_ub
+  )
+  # Rowwise quantize GEMM with FP8 input and BF16 output
+  y = torch.ops.fbgemm.f8f8bf16_rowwise(
+      xq,
+      wq,
+      x_scale,
+      w_scale,
+      use_fast_accum=True,
+  )
+
+See :ref:`genai-quantize-ops-stable-api` for more details.
diff --git a/fbgemm_gpu/docs/src/fbgemm_genai/python-api/quantize_ops.rst b/fbgemm_gpu/docs/src/fbgemm_genai/python-api/quantize_ops.rst
@@ -0,0 +1,16 @@
+Quantization Operators
+======================
+
+.. automodule:: fbgemm_gpu
+
+.. _genai-quantize-ops-stable-api:
+
+Stable API
+----------
+
+.. autofunction:: fbgemm_gpu.experimental.gen_ai.quantize.int4_row_quantize
+
+.. autofunction:: fbgemm_gpu.experimental.gen_ai.quantize.quantize_int4_preshuffle
+
+Other API
+---------
diff --git a/fbgemm_gpu/docs/src/fbgemm_gpu/development/BuildInstructions.rst b/fbgemm_gpu/docs/src/fbgemm_gpu/development/BuildInstructions.rst
@@ -151,21 +151,6 @@ cuDNN package for the given CUDA version:
   wget -q "${cudnn_url}" -O cudnn.tar.xz
   tar -xvf cudnn.tar.xz
 
-Install CUTLASS
-~~~~~~~~~~~~~~~
-
-This section is only applicable to building the experimental FBGEMM_GPU GenAI
-module.  CUTLASS should be already be available in the repository as a git
-submodule (see :ref:`fbgemm-gpu.build.prepare`).  The following include paths
-are already added to the CMake configuration:
-
-.. code:: cmake
-
-  set(THIRDPARTY ${FBGEMM}/external)
-
-  ${THIRDPARTY}/cutlass/include
-  ${THIRDPARTY}/cutlass/tools/util/include
-
 
 .. _fbgemm-gpu.build.setup.rocm:
 

diff --git a/fbgemm_gpu/docs/src/index.rst b/fbgemm_gpu/docs/src/index.rst
@@ -121,3 +121,15 @@ Table of Contents
    fbgemm_genai/development/BuildInstructions
    fbgemm_genai/development/InstallationInstructions
    fbgemm_genai/development/TestInstructions
+
+.. toctree::
+   :maxdepth: 1
+   :caption: FBGEMM GenAI Overview
+
+   fbgemm_genai/overview/Overview
+
+.. toctree::
+   :maxdepth: 1
+   :caption: FBGEMM GenAI Python API
+
+   fbgemm_genai/python-api/quantize_ops
diff --git a/fbgemm_gpu/fbgemm_gpu/__init__.py b/fbgemm_gpu/fbgemm_gpu/__init__.py
@@ -56,7 +56,7 @@ def _load_library(filename: str, no_throw: bool = False) -> None:
     "experimental/gen_ai/fbgemm_gpu_experimental_gen_ai",
 ]
 
-# NOTE: While FBGEMM_GPU GenAI is not available for ROCm yet, we would like to
+# NOTE: While FBGEMM GenAI is not available for ROCm yet, we would like to
 # be able to install the existing CUDA variant of the package onto ROCm systems,
 # so that we can at least use the Triton GEMM libraries from experimental/gemm.
 # But loading fbgemm_gpu package will trigger load-checking the .SO file for the