This change adds support for Intel Gaudi HPUs. #7275

emascarenhas · 2025-03-12T17:29:06Z

Several configuration files are provided in the examples directory for use with Gaudi.

LLaMA-Factory features and optimizations including inferencing, training (sft, dpo, etc.), LoRA fine-tuning, distributed training with DeepSpeed and DDP are working. Please see README for details. Co-authored-by: Yaser Afshar [email protected] Co-authored-by: Edward Mascarenhas [email protected] Co-authored-by: Jianhong-Zhang [email protected] Co-authored-by: Wenbin Chen [email protected]
Co-authored-by: Voas, Tanner [email protected]

Before submitting

[y] Did you read the contributor guideline?
[y] Did you write any new necessary tests?

Several configuration files are provided in the examples directory for use with Gaudi. LLaMA-Factory features and optimizations including inferencing, training (sft, dpo, etc.), LoRA fine-tuning, distributed training with DeepSpeed and DDP are working. Please see README for details. Co-authored-by: Yaser Afshar [email protected] Co-authored-by: Edward Mascarenhas [email protected] Co-authored-by: Jianhong-Zhang [email protected] Co-authored-by: Wenbin Chen [email protected] Co-authored-by: Voas, Tanner [email protected]

hiyouga

Thanks for your contribution, please view the comments

examples/train_lora/qwen2vl_lora_sft_gaudi.yaml

requirements.txt

…or gaudi in separate file.

ehartford · 2025-03-21T19:14:39Z

Hello, I was wondering about the status of this? Is it safe to run a training job with this branch on Gaudi cluster?

emascarenhas · 2025-03-21T20:39:39Z

Hello, I was wondering about the status of this? Is it safe to run a training job with this branch on Gaudi cluster?

Yes you can. Please use the requirements-gaudi.txt to load the requirements in addition to the branch/PR. There are also yaml files in the examples/train_lora directory with _gaudi.yaml suffixes which should work out of the box.

ehartford · 2025-03-21T21:23:39Z

I try to run a job in the docker image:

cd docker/docker-hpu/
docker compose up -d
docker compose exec llamafactory bash

root@271f5f9d6340:/app# llamafactory-cli train train-qwen25.yaml
Traceback (most recent call last):
  File "/usr/local/bin/llamafactory-cli", line 5, in <module>
    from llamafactory.cli import main
  File "/app/src/llamafactory/__init__.py", line 43, in <module>
    from .extras.env import VERSION
  File "/app/src/llamafactory/extras/env.py", line 28, in <module>
    from .misc import is_torch_hpu_available
  File "/app/src/llamafactory/extras/misc.py", line 40, in <module>
    @lru_cache(maxsize=None)
NameError: name 'lru_cache' is not defined

emascarenhas · 2025-03-21T22:46:00Z

@ehartford ,
Sorry about that. The last two merge commits have a couple of changes which are causing this issue and another to do with transformers v 4.49.0.
This specific issue is due to the deletion of "from functools import lru_cache" line in src/llamafactory/extras/misc.py

Could you use the code at this commit in this branch for now?
https://github.com/hiyouga/LLaMA-Factory/tree/a16e3d47a29fb6d3d1b4afad9afca1d91f5b97c3

I will be pushing a commit to fix this after doing some more testing.

This set of instructions should work. Feel free to email me directly at [email protected] and we could also resolve other issues you may encounter.

cd to LLaMA Factory directory
git checkout a16e3d4
load docker as you normally would
pip install -e .
pip install -r requirements-gaudi.txt
llamafactory-cli train examples/train_lora/qwen2vl_lora_sft_gaudi.yaml

hiyouga requested changes Mar 12, 2025

View reviewed changes

examples/train_lora/qwen2vl_lora_sft_gaudi.yaml Outdated Show resolved Hide resolved

requirements.txt Outdated Show resolved Hide resolved

hiyouga added the pending This problem is yet to be addressed label Mar 12, 2025

emascarenhas added 3 commits March 12, 2025 15:28

Delete examples of Gaudi yaml files

3074690

Revert to original requirements.txt and capture transformer version f…

a0f1661

…or gaudi in separate file.

Revert transformers and other versioning

a16e3d4

emascarenhas requested a review from hiyouga March 14, 2025 14:31

emascarenhas force-pushed the feature/hpu branch from 2e17b62 to a16e3d4 Compare March 21, 2025 23:39

emascarenhas added 3 commits March 22, 2025 23:18

Update docker loaded version to 1.20.0

c158c20

Update transformers to 4.49.0

0ab20a8

Set compatible datasets version

46cf93f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This change adds support for Intel Gaudi HPUs. #7275

This change adds support for Intel Gaudi HPUs. #7275

emascarenhas commented Mar 12, 2025

hiyouga left a comment

ehartford commented Mar 21, 2025

emascarenhas commented Mar 21, 2025

ehartford commented Mar 21, 2025

emascarenhas commented Mar 21, 2025 •

edited

Loading

This change adds support for Intel Gaudi HPUs. #7275

Are you sure you want to change the base?

This change adds support for Intel Gaudi HPUs. #7275

Conversation

emascarenhas commented Mar 12, 2025

Before submitting

hiyouga left a comment

Choose a reason for hiding this comment

ehartford commented Mar 21, 2025

emascarenhas commented Mar 21, 2025

ehartford commented Mar 21, 2025

emascarenhas commented Mar 21, 2025 • edited Loading

emascarenhas commented Mar 21, 2025 •

edited

Loading