Skip to content

This change adds support for Intel Gaudi HPUs. #7275

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

emascarenhas
Copy link

Several configuration files are provided in the examples directory for use with Gaudi.

LLaMA-Factory features and optimizations including inferencing, training (sft, dpo, etc.), LoRA fine-tuning, distributed training with DeepSpeed and DDP are working. Please see README for details. Co-authored-by: Yaser Afshar [email protected] Co-authored-by: Edward Mascarenhas [email protected] Co-authored-by: Jianhong-Zhang [email protected] Co-authored-by: Wenbin Chen [email protected]
Co-authored-by: Voas, Tanner [email protected]

Before submitting

	Several configuration files are provided in the examples directory for use with Gaudi.
LLaMA-Factory features and optimizations including inferencing, training (sft, dpo, etc.), LoRA
fine-tuning, distributed training with DeepSpeed and DDP are working. Please see README for details.
Co-authored-by: Yaser Afshar [email protected]
Co-authored-by: Edward Mascarenhas [email protected]
Co-authored-by: Jianhong-Zhang [email protected]
Co-authored-by: Wenbin Chen [email protected]
Co-authored-by: Voas, Tanner [email protected]
Copy link
Owner

@hiyouga hiyouga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution, please view the comments

@hiyouga hiyouga added the pending This problem is yet to be addressed label Mar 12, 2025
@emascarenhas emascarenhas requested a review from hiyouga March 14, 2025 14:31
@ehartford
Copy link

Hello, I was wondering about the status of this? Is it safe to run a training job with this branch on Gaudi cluster?

@emascarenhas
Copy link
Author

Hello, I was wondering about the status of this? Is it safe to run a training job with this branch on Gaudi cluster?

Yes you can. Please use the requirements-gaudi.txt to load the requirements in addition to the branch/PR. There are also yaml files in the examples/train_lora directory with _gaudi.yaml suffixes which should work out of the box.

@ehartford
Copy link

I try to run a job in the docker image:

cd docker/docker-hpu/
docker compose up -d
docker compose exec llamafactory bash
root@271f5f9d6340:/app# llamafactory-cli train train-qwen25.yaml
Traceback (most recent call last):
  File "/usr/local/bin/llamafactory-cli", line 5, in <module>
    from llamafactory.cli import main
  File "/app/src/llamafactory/__init__.py", line 43, in <module>
    from .extras.env import VERSION
  File "/app/src/llamafactory/extras/env.py", line 28, in <module>
    from .misc import is_torch_hpu_available
  File "/app/src/llamafactory/extras/misc.py", line 40, in <module>
    @lru_cache(maxsize=None)
NameError: name 'lru_cache' is not defined

@emascarenhas
Copy link
Author

emascarenhas commented Mar 21, 2025

@ehartford ,
Sorry about that. The last two merge commits have a couple of changes which are causing this issue and another to do with transformers v 4.49.0.
This specific issue is due to the deletion of "from functools import lru_cache" line in src/llamafactory/extras/misc.py

Could you use the code at this commit in this branch for now?
https://github.com/hiyouga/LLaMA-Factory/tree/a16e3d47a29fb6d3d1b4afad9afca1d91f5b97c3

I will be pushing a commit to fix this after doing some more testing.

This set of instructions should work. Feel free to email me directly at [email protected] and we could also resolve other issues you may encounter.

cd to LLaMA Factory directory
git checkout a16e3d4
load docker as you normally would
pip install -e .
pip install -r requirements-gaudi.txt
llamafactory-cli train examples/train_lora/qwen2vl_lora_sft_gaudi.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants