Skip to content

Qwen3 models only support CUDA devices with flash attention #630

Closed
@randomm

Description

@randomm

Describe the bug

Currently, Qwen3 embedding models (introduced in PR #627) only support CUDA devices with flash attention enabled. When trying to use these models on CPU or Metal devices, users get the following error:

"Qwen3 is only supported on Cuda devices in fp16 with flash attention enabled"

Expected behavior

Qwen3 embedding models should be usable on CPU and Metal devices for inference, similar to other embedding models in TEI.

Environment

  • Device: CPU or Metal (macOS)
  • Model: Any Qwen3 embedding model (e.g., Qwen/Qwen3-Embedding-0.6B)

Proposed solution

Implement a CPU-compatible Qwen3Model alongside the existing FlashQwen3Model to enable Qwen3 embedding inference on non-CUDA devices.

Additional context

This limitation significantly reduces the accessibility of Qwen3 models for users without CUDA GPUs. Adding CPU support would make these models available to a much broader user base.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions