Closed
Description
Describe the bug
Currently, Qwen3 embedding models (introduced in PR #627) only support CUDA devices with flash attention enabled. When trying to use these models on CPU or Metal devices, users get the following error:
"Qwen3 is only supported on Cuda devices in fp16 with flash attention enabled"
Expected behavior
Qwen3 embedding models should be usable on CPU and Metal devices for inference, similar to other embedding models in TEI.
Environment
- Device: CPU or Metal (macOS)
- Model: Any Qwen3 embedding model (e.g.,
Qwen/Qwen3-Embedding-0.6B
)
Proposed solution
Implement a CPU-compatible Qwen3Model
alongside the existing FlashQwen3Model
to enable Qwen3 embedding inference on non-CUDA devices.
Additional context
This limitation significantly reduces the accessibility of Qwen3 models for users without CUDA GPUs. Adding CPU support would make these models available to a much broader user base.
Metadata
Metadata
Assignees
Labels
No labels