Skip to content

[Model] support modernbert #16648

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Apr 16, 2025
Merged

Conversation

xsank
Copy link
Contributor

@xsank xsank commented Apr 15, 2025

Support modernBert, test passed on Alibaba-NLP/gte-reranker-modernbert-base.

FIX #11347

eg:

curl -X 'POST' \
  'http://127.0.0.1:8000/v1/score' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "Alibaba-NLP/gte-reranker-modernbert-base",
  "text_1": ["what is the capital of China?","how to implement quick sort in python?","how to implement quick sort in python?"],
  "text_2": ["Beijing","Introduction of quick sort","The weather is nice today"]
}'
{"id":"score-808110e90ba2472f904721ccd20034ce","object":"list","created":1744705317,"model":"Alibaba-NLP/gte-reranker-modernbert-base","data":[{"index":0,"object":"score","score":0.89453125},{"index":1,"object":"score","score":0.92138671875},{"index":2,"object":"score","score":0.1572265625}],"usage":{"prompt_tokens":43,"total_tokens":43,"completion_tokens":0,"prompt_tokens_details":null}}

Signed-off-by: xsank [email protected]

Signed-off-by: 唯勤 <[email protected]>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: 唯勤 <[email protected]>
@DarkLight1337
Copy link
Member

Thanks for adding this model! To get the model to pass CI, please add this model to the test files as mentioned here: https://docs.vllm.ai/en/latest/contributing/model/tests.html

唯勤 added 2 commits April 15, 2025 18:01
Signed-off-by: 唯勤 <[email protected]>
Signed-off-by: 唯勤 <[email protected]>
@xsank
Copy link
Contributor Author

xsank commented Apr 15, 2025

@DarkLight1337 , I do not know why the lint-and-deploy stage failed, nor can i find any useful information from the detail log...

@DarkLight1337
Copy link
Member

Retrying

@xsank xsank changed the title [Mdoel] support modernbert [Model] support modernbert Apr 16, 2025
唯勤 added 2 commits April 16, 2025 12:48
唯勤 added 3 commits April 16, 2025 15:07
Signed-off-by: 唯勤 <[email protected]>
Signed-off-by: 唯勤 <[email protected]>
@mergify mergify bot added the documentation Improvements or additions to documentation label Apr 16, 2025
唯勤 added 3 commits April 16, 2025 16:36
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this model!

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) April 16, 2025 10:04
@xsank
Copy link
Contributor Author

xsank commented Apr 17, 2025

hi @xsank thanks for adding this. what are the options to run the modern bert model? if i look in https://huggingface.co/Alibaba-NLP/gte-reranker-modernbert-base/tree/main - there is model.safetensors (tensorflow) and onnx.

pr will also work with https://huggingface.co/answerdotai/ModernBERT-base ?

In fact, this commit support the common architectures of ModernBertForSequenceClassification, you could have a try. if you found any thing not fit well, please tell me.

@geraldstanje
Copy link

geraldstanje commented Apr 17, 2025

@xsank @lionelvillard i fine tuned answerdotai/ModernBERT-base as a binary text classifier - does that fall under ModernBertForSequenceClassification?
edit: i see "architectures": ["ModernBertForMaskedLM"] in https://huggingface.co/answerdotai/ModernBERT-base/blob/main/config.json#L4 - so i assume we need another pr to support ModernBertForMaskedLM?

for when you support architecture "ModernBertForMaskedLM" will you need will you need to call func softmax for the logits returned?

lionelvillard pushed a commit to lionelvillard/vllm that referenced this pull request Apr 17, 2025
@xsank
Copy link
Contributor Author

xsank commented Apr 18, 2025

@xsank @lionelvillard i fine tuned answerdotai/ModernBERT-base as a binary text classifier - does that fall under ModernBertForSequenceClassification? edit: i see "architectures": ["ModernBertForMaskedLM"] in https://huggingface.co/answerdotai/ModernBERT-base/blob/main/config.json#L4 - so i assume we need another pr to support ModernBertForMaskedLM?

for when you support architecture "ModernBertForMaskedLM" will you need will you need to call func softmax for the logits returned?

It seems that moderBert series model have some tiny difference, let me see how to support all the features in the same class. Finally, i feel so sorry that i made a mistake, which i only support a part, there is still some work todo.

@geraldstanje
Copy link

@xsank thanks for your reply - could you also add it for architectore ModernBertForMaskedLM? that would be amazing.

yangw-dev pushed a commit to yangw-dev/vllm that referenced this pull request Apr 21, 2025
Signed-off-by: 唯勤 <[email protected]>
Co-authored-by: 唯勤 <[email protected]>
Signed-off-by: Yang Wang <[email protected]>
@mhillebrand
Copy link

@xsank What about the /pooling endpoint for classification?

jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
adobrzyn pushed a commit to HabanaAI/vllm-fork that referenced this pull request Apr 30, 2025
Signed-off-by: 唯勤 <[email protected]>
Co-authored-by: 唯勤 <[email protected]>
Signed-off-by: Agata Dobrzyniewicz <[email protected]>
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
Signed-off-by: 唯勤 <[email protected]>
Co-authored-by: 唯勤 <[email protected]>
Signed-off-by: Mu Huai <[email protected]>
@geraldstanje
Copy link

hi @xsank can you also add ModernBertForMaskedLM?

@geraldstanje
Copy link

hi @mhillebrand the pooling is needed to support /classify and has not been added to V0 and V1?

@DarkLight1337
Copy link
Member

Yes, /classify is for classification models which is a subset of pooling models.

@DarkLight1337
Copy link
Member

You might need to set --task classify to use the model as a classification model though.

@geraldstanje
Copy link

geraldstanje commented May 24, 2025

@DarkLight1337 yes /classify work now for online serving when using openai vllm container from nightly build.

but when i try to use /score i get this error - any idea?
server logs:

--task score is not supported by the V1 Engine. Falling back to V0.
...
INFO 05-23 19:56:56 [logger.py:42] Received request score-fca6853bd4ea46e5b73dd334a86afd19-0: prompt: 'I love machine learning[SEP]', params: PoolingParams(dimensions=None, additional_metadata=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.

server reply to the client:

{"object":"error","message":"pooled_data should be a scalar score","type":"BadRequestError","param":null,"code":400}%

@DarkLight1337
Copy link
Member

Can you show the command you used to serve the model?

@geraldstanje
Copy link

geraldstanje commented May 24, 2025

@DarkLight1337
model serving

docker run --gpus all \
  -v $(pwd)/modernbert:/model \
  -p 8000:8000 \
  --ipc=host \
  --entrypoint python3 \
  public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:1068556b2ca6c136000fa48db7d62ce1b5250dea \
  -m vllm.entrypoints.openai.api_server --model /model --task score

client

curl -s -X POST http://localhost:8000/score \
  -H "Content-Type: application/json" \
  -d '{"text_1": ["I love machine learning"], "text_2": ["foo"]}'

ful server logs (including startup)

WARNING 05-23 22:03:15 [__init__.py:221] Platform plugin tpu function's return value is None
INFO 05-23 22:03:15 [__init__.py:220] Platform plugin cuda loaded.
WARNING 05-23 22:03:15 [__init__.py:221] Platform plugin cuda function's return value is None
WARNING 05-23 22:03:15 [__init__.py:221] Platform plugin rocm function's return value is None
WARNING 05-23 22:03:15 [__init__.py:221] Platform plugin hpu function's return value is None
WARNING 05-23 22:03:15 [__init__.py:221] Platform plugin xpu function's return value is None
WARNING 05-23 22:03:15 [__init__.py:221] Platform plugin cpu function's return value is None
WARNING 05-23 22:03:15 [__init__.py:221] Platform plugin neuron function's return value is None
INFO 05-23 22:03:15 [__init__.py:246] Automatically detected platform cuda.
INFO 05-23 22:03:19 [__init__.py:30] Available plugins for group vllm.general_plugins:
INFO 05-23 22:03:19 [__init__.py:32] name=lora_filesystem_resolver, value=vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
INFO 05-23 22:03:19 [__init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 05-23 22:03:19 [__init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 05-23 22:03:19 [__init__.py:44] plugin lora_filesystem_resolver loaded.
INFO 05-23 22:03:20 [api_server.py:1289] vLLM API server version 0.9.1.dev36+g1068556b2
INFO 05-23 22:03:20 [cli_args.py:300] non-default args: {'model': '/model', 'task': 'score'}
INFO 05-23 22:03:20 [config.py:3116] Downcasting torch.float32 to torch.float16.
WARNING 05-23 22:03:30 [arg_utils.py:1591] --task score is not supported by the V1 Engine. Falling back to V0. 
INFO 05-23 22:03:30 [api_server.py:257] Started engine process with PID 43
WARNING 05-23 22:03:34 [__init__.py:221] Platform plugin tpu function's return value is None
INFO 05-23 22:03:34 [__init__.py:220] Platform plugin cuda loaded.
WARNING 05-23 22:03:34 [__init__.py:221] Platform plugin cuda function's return value is None
WARNING 05-23 22:03:34 [__init__.py:221] Platform plugin rocm function's return value is None
WARNING 05-23 22:03:34 [__init__.py:221] Platform plugin hpu function's return value is None
WARNING 05-23 22:03:34 [__init__.py:221] Platform plugin xpu function's return value is None
WARNING 05-23 22:03:34 [__init__.py:221] Platform plugin cpu function's return value is None
WARNING 05-23 22:03:34 [__init__.py:221] Platform plugin neuron function's return value is None
INFO 05-23 22:03:34 [__init__.py:246] Automatically detected platform cuda.
INFO 05-23 22:03:37 [__init__.py:30] Available plugins for group vllm.general_plugins:
INFO 05-23 22:03:37 [__init__.py:32] name=lora_filesystem_resolver, value=vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
INFO 05-23 22:03:37 [__init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 05-23 22:03:37 [__init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 05-23 22:03:37 [__init__.py:44] plugin lora_filesystem_resolver loaded.
INFO 05-23 22:03:37 [llm_engine.py:240] Initializing a V0 LLM engine (v0.9.1.dev36+g1068556b2) with config: model='/model', speculative_config=None, tokenizer='/model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/model, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=PoolerConfig(pooling_type=None, normalize=None, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"compile_sizes": [], "inductor_compile_config": {"enable_auto_functionalized_v2": false}, "cudagraph_capture_sizes": [256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], "max_capture_size": 256}, use_cached_outputs=True, 
INFO 05-23 22:03:38 [cuda.py:291] Using Flash Attention backend.
INFO 05-23 22:03:38 [parallel_state.py:1101] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 05-23 22:03:38 [model_runner.py:1170] Starting to load model /model...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.95s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.95s/it]

INFO 05-23 22:03:40 [default_loader.py:280] Loading weights took 2.07 seconds
INFO 05-23 22:03:41 [model_runner.py:1202] Model loading took 0.7647 GiB and 2.226047 seconds
INFO 05-23 22:03:41 [api_server.py:1336] Starting vLLM API server on http://0.0.0.0:8000
INFO 05-23 22:03:41 [launcher.py:28] Available routes are:
INFO 05-23 22:03:41 [launcher.py:36] Route: /openapi.json, Methods: HEAD, GET
INFO 05-23 22:03:41 [launcher.py:36] Route: /docs, Methods: HEAD, GET
INFO 05-23 22:03:41 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 05-23 22:03:41 [launcher.py:36] Route: /redoc, Methods: HEAD, GET
INFO 05-23 22:03:41 [launcher.py:36] Route: /health, Methods: GET
INFO 05-23 22:03:41 [launcher.py:36] Route: /load, Methods: GET
INFO 05-23 22:03:41 [launcher.py:36] Route: /ping, Methods: POST
INFO 05-23 22:03:41 [launcher.py:36] Route: /ping, Methods: GET
INFO 05-23 22:03:41 [launcher.py:36] Route: /tokenize, Methods: POST
INFO 05-23 22:03:41 [launcher.py:36] Route: /detokenize, Methods: POST
INFO 05-23 22:03:41 [launcher.py:36] Route: /v1/models, Methods: GET
INFO 05-23 22:03:41 [launcher.py:36] Route: /version, Methods: GET
INFO 05-23 22:03:41 [launcher.py:36] Route: /v1/chat/completions, Methods: POST
INFO 05-23 22:03:41 [launcher.py:36] Route: /v1/completions, Methods: POST
INFO 05-23 22:03:41 [launcher.py:36] Route: /v1/embeddings, Methods: POST
INFO 05-23 22:03:41 [launcher.py:36] Route: /pooling, Methods: POST
INFO 05-23 22:03:41 [launcher.py:36] Route: /classify, Methods: POST
INFO 05-23 22:03:41 [launcher.py:36] Route: /score, Methods: POST
INFO 05-23 22:03:41 [launcher.py:36] Route: /v1/score, Methods: POST
INFO 05-23 22:03:41 [launcher.py:36] Route: /v1/audio/transcriptions, Methods: POST
INFO 05-23 22:03:41 [launcher.py:36] Route: /rerank, Methods: POST
INFO 05-23 22:03:41 [launcher.py:36] Route: /v1/rerank, Methods: POST
INFO 05-23 22:03:41 [launcher.py:36] Route: /v2/rerank, Methods: POST
INFO 05-23 22:03:41 [launcher.py:36] Route: /invocations, Methods: POST
INFO 05-23 22:03:41 [launcher.py:36] Route: /metrics, Methods: GET
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO 05-23 22:04:09 [logger.py:42] Received request score-383b6a952fc948d59c1510b843dd232e-0: prompt: 'I love machine learning[SEP]foo', params: PoolingParams(dimensions=None, additional_metadata=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 05-23 22:04:09 [engine.py:313] Added request score-383b6a952fc948d59c1510b843dd232e-0.
INFO 05-23 22:04:09 [metrics.py:486] Avg prompt throughput: 0.9 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     172.17.0.1:50324 - "POST /score HTTP/1.1" 400 Bad Request
INFO 05-23 22:04:19 [metrics.py:486] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

@DarkLight1337
Copy link
Member

@xsank can you help look into this?

@geraldstanje
Copy link

geraldstanje commented May 24, 2025

@DarkLight1337 when using /classify is there a way to add a post processing step using a threshold before deciding the class?

@DarkLight1337
Copy link
Member

No, to achieve that I suggest getting the logits/probabilities and processing them by yourself

@geraldstanje
Copy link

geraldstanje commented May 24, 2025

you refer to using the probs field from the classify output?

{
  "id": "classify-d72c9764b4cd476081c7d336a26ed724",
  "object": "list",
  "created": 1748063869,
  "model": "/model",
  "data": [
    {
      "index": 0,
      "label": "LABEL_0",
      "probs": [
        5.93359375,
        -6.9453125
      ],
      "num_classes": 2
    }
  ],
  "usage": {
    "prompt_tokens": 6,
    "total_tokens": 6,
    "completion_tokens": 0,
    "prompt_tokens_details": null
  }
}

@DarkLight1337
Copy link
Member

Yes

@geraldstanje
Copy link

geraldstanje commented May 24, 2025

@DarkLight1337 alternative would be to add a def classifyAndApplyThreshold to vllm and rebuild the container...

@noooop
Copy link
Contributor

noooop commented Jun 18, 2025

@xsank

The results of the Alibaba-NLP/gte-reranker-modernbert-base model seem to slight differ from sentence-transformers.

import pytest
import torch
from sentence_transformers import CrossEncoder

from vllm import LLM

model_name = "Alibaba-NLP/gte-reranker-modernbert-base"

st_model = CrossEncoder(model_name,
                        model_kwargs={"torch_dtype": torch.float32})
vllm_model = LLM(model_name, task="score", dtype="float32")

sentences = [
    ("ping", "pong"),
    ("ping", "pong" * 16),
    ("ping", "pong" * 24),
    ("ping", "pong" * 32),
    ("ping", "pong" * 48),
    ("ping", "pong" * 64),
    ("ping", "pong" * 128),
]

st_scores = st_model.predict(sentences)

texts_1 = [x[0] for x in sentences]
texts_2 = [x[1] for x in sentences]
outputs = vllm_model.score(texts_1, texts_2)
vllm_scores = [output.outputs.score for output in outputs]


def test_close(s1, s2):
    return float(s1) == pytest.approx(float(s2), rel=0.01)

print(
    [test_close(st_scores[i], vllm_scores[i]) for i in range(len(st_scores))])

output

[True, True, True, True, False, False, False]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[New Model]: answerdotai/ModernBERT-large
6 participants