[BFCL] add support for microsoft/Phi-4-mini-instruct #967

RobotSail · 2025-03-29T19:48:42Z

This PR introduces support for the newly-released Phi-4-mini-instruct model from Microsoft:

Phi-4-mini-instruct

The results for this were initially evaluated against f81063; however, the model had a few issues so this PR was developed after rebasing on top of d0299e.

The results obtained from this model are as follows:

+--------+----------------------------------------+---------------+--------------------+---------------------+------------+------------------+-------------+---------------+
|   Rank | Model                                  | Overall Acc   | Non-Live AST Acc   | Non-Live Exec Acc   | Live Acc   | Multi Turn Acc   | Relevance   | Irrelevance   |
+========+========================================+===============+====================+=====================+============+==================+=============+===============+
|      1 | Phi-4-mini-instruct (Function Calling) | 31.02%        | 2.92%              | 63.63%              | 53.84%     | 0.00%            | 38.89%      | 90.39%        |
+--------+----------------------------------------+---------------+--------------------+---------------------+------------+------------------+-------------+---------------+

Note: It seems like Phi-4-mini-instruct on vLLM currently has a bug with batched inference when using vLLM==0.7.3 and torch==2.5.1. This PR tested results using vLLM==0.7.3 and torch==2.6.0 which seemed to work fine.

HuanzhiMao

Thanks a lot for the PR @RobotSail !

I made a few changes to the parsing logics in phi_fc.py. Namely, Phi-4-mini-instruct-FC is not doing a great job in instruction following. I think the lack of an official function calling guide for phi models might be causing all these troubles :/

Improve handling of parallel tool call scenario. Sometimes the model will give the parallel calls without wrapping them in a list, (like {\"name\": \"calculate_sales_tax\", \"arguments\": {\"purchase_amount\": 30.45, \"city\": \"Chicago\", \"state\": \"Illinois\"}}, {\"name\": \"calculate_sales_tax\", \"arguments\": {\"purchase_amount\": 52.33, \"city\": \"Sacramento\", \"state\": \"California\"}}, {\"name\": \"calculate_sales_tax\", \"arguments\": {\"purchase_amount\": 11.23, \"city\": \"Portland\", \"state\": \"Oregon\"}}).
Normally we expect the tool call to be wrapped in the tags (like <|tool_call|>[{\"name\": \"calculate_final_velocity\", \"arguments\": {\"height\": 150, \"initial_velocity\": 0}}]<|/tool_call|>), however, many times the closing tag might be missing (like <|tool_call|>[{\"name\": \"calculate_final_velocity\", \"arguments\": {\"height\": 150, \"initial_velocity\": 0}}]). In such cases, I still honored it as a valid tool call because the starting tag is present, and the model response ended at the end of the tool call.

With these changes, Phi-4-mini-instruct-FC can achieve 0.85 on Simple, 0.71 on Parallel, and 0.865 on Multiple (non-live ast).

Would also love to know what you think :D

berkeley-function-call-leaderboard/bfcl/model_handler/local_inference/phi_fc.py

HuanzhiMao · 2025-04-05T09:03:58Z

Btw, does this PR plan to add support for the phi-4 model (non-mini version)? If so, I believe the phi.py needs to be updated because phi-4 uses a chat template different from the one currently implemented there.

We could retire the support for old phi-3 series models to keep the codebase simple.

RobotSail · 2025-04-05T14:52:34Z

Thank you so much for taking a look at this PR @HuanzhiMao !

Btw, does this PR plan to add support for the phi-4 model

Unfortunately the Phi-4 model doesn't officially support function calling, it's currently only limited to Phi-4-mini.

This PR was testing Phi-4-mini-instruct, but we can also add tests for Phi-4-mini as well.

RobotSail · 2025-04-05T14:56:54Z

Thank you so much for fixing up the function-calling logic.

Normally we expect the tool call to be wrapped in the tags (like <|tool_call|>[{"name": "calculate_final_velocity", "arguments": {"height": 150, "initial_velocity": 0}}]<|/tool_call|>), however, many times the closing tag might be missing (like <|tool_call|>[{"name": "calculate_final_velocity", "arguments": {"height": 150, "initial_velocity": 0}}]). In such cases, I still honored it as a valid tool call because the starting tag is present, and the model response ended at the end of the tool call.

Yes, I observed this as well. Originally I was handling it, but it seemed like a failure scenario since the model design is to have the closing tags, and this is how MCP clients would rely on parsing this. However; if you think it makes sense to handle it then I think that's perfectly fine.

Thanks again for all of the time you spent looking at my PR, I greatly appreciate it!

cedricvidal · 2025-04-05T15:05:45Z

Thanks a lot for the PR @cedricvidal !

I made a few changes to the parsing logics in phi_fc.py. Namely, Phi-4-mini-instruct-FC is not doing a great job in instruction following. I think the lack of an official function calling guide for phi models might be causing all these troubles :/

Improve handling of parallel tool call scenario. Sometimes the model will give the parallel calls without wrapping them in a list, (like {\"name\": \"calculate_sales_tax\", \"arguments\": {\"purchase_amount\": 30.45, \"city\": \"Chicago\", \"state\": \"Illinois\"}}, {\"name\": \"calculate_sales_tax\", \"arguments\": {\"purchase_amount\": 52.33, \"city\": \"Sacramento\", \"state\": \"California\"}}, {\"name\": \"calculate_sales_tax\", \"arguments\": {\"purchase_amount\": 11.23, \"city\": \"Portland\", \"state\": \"Oregon\"}}).

Normally we expect the tool call to be wrapped in the tags (like <|tool_call|>[{\"name\": \"calculate_final_velocity\", \"arguments\": {\"height\": 150, \"initial_velocity\": 0}}]<|/tool_call|>), however, many times the closing tag might be missing (like <|tool_call|>[{\"name\": \"calculate_final_velocity\", \"arguments\": {\"height\": 150, \"initial_velocity\": 0}}]). In such cases, I still honored it as a valid tool call because the starting tag is present, and the model response ended at the end of the tool call.

With these changes, Phi-4-mini-instruct-FC can achieve 0.85 on Simple, 0.71 on Parallel, and 0.865 on Multiple (non-live ast).

Would also love to know what you think :D

Hey, really appreciate the kind words! But I think the credit here goes to @RobotSail — I don’t recall contributing to this one myself 😅

RobotSail · 2025-04-05T15:07:06Z

berkeley-function-call-leaderboard/bfcl/model_handler/local_inference/phi_fc.py

+    def _is_tool_call_response_format(input: str) -> bool:
+        """
+        This is a helper method to detect if the tool call extracted by `_extract_tool_calls` is actually a tool call.
+        This is important because the model might return a dictionary that looks like a tool call, but is not. It sometimes returns the function document.


This is really smart

RobotSail · 2025-04-05T15:10:50Z

berkeley-function-call-leaderboard/bfcl/model_handler/local_inference/phi_fc.py

        pattern = r"<\|tool_call\|>(.*?)<\|/tool_call\|>"
        matches = re.findall(pattern, input_string, re.DOTALL)

+        # Often the model will miss the `<|/tool_call|>`: <|tool_call|>[{\"name\": \"calculate_final_velocity\", \"arguments\": {\"height\": 150, \"initial_velocity\": 0}}]
+        # Since `<|tool_call|>` is still present, we consider this a valid case


I do agree with this logic in general, and I believe Llama 3 also behaves this way. More generally (and not related to this PR at all) I wonder how MCP clients will handle all of this logic across different models & behaviors. Seems like a massive headache!

Yea, this scenario is a bit tricky in my opinion. I see how people could argue both ways. In such cases, the important thing for a benchmark like BFCL is to set a consistent standard (do the right thing) and follow it.
After careful consideration, I agree with you that, from a parsing perspective, this should be a failure scenario since the closing tags are expected but not supplied. The model is not following instructions.

@HuanzhiMao Either way works. I can see the other perspective of -- if we can parse out the noise the model generated, is the underlying function call correct?

When I have some time this week I'll also update this PR to include results for Phi-4-mini as well (and maybe Phi-4 if I have time to get to it). I think it will be interesting to see the differences.

That would be awesome. Thank you! I will leave this PR open for now.

HuanzhiMao · 2025-04-07T22:54:08Z

Thanks a lot for the PR @cedricvidal !
I made a few changes to the parsing logics in phi_fc.py. Namely, Phi-4-mini-instruct-FC is not doing a great job in instruction following. I think the lack of an official function calling guide for phi models might be causing all these troubles :/

Improve handling of parallel tool call scenario. Sometimes the model will give the parallel calls without wrapping them in a list, (like {\"name\": \"calculate_sales_tax\", \"arguments\": {\"purchase_amount\": 30.45, \"city\": \"Chicago\", \"state\": \"Illinois\"}}, {\"name\": \"calculate_sales_tax\", \"arguments\": {\"purchase_amount\": 52.33, \"city\": \"Sacramento\", \"state\": \"California\"}}, {\"name\": \"calculate_sales_tax\", \"arguments\": {\"purchase_amount\": 11.23, \"city\": \"Portland\", \"state\": \"Oregon\"}}).

Normally we expect the tool call to be wrapped in the tags (like <|tool_call|>[{\"name\": \"calculate_final_velocity\", \"arguments\": {\"height\": 150, \"initial_velocity\": 0}}]<|/tool_call|>), however, many times the closing tag might be missing (like <|tool_call|>[{\"name\": \"calculate_final_velocity\", \"arguments\": {\"height\": 150, \"initial_velocity\": 0}}]). In such cases, I still honored it as a valid tool call because the starting tag is present, and the model response ended at the end of the tool call.

With these changes, Phi-4-mini-instruct-FC can achieve 0.85 on Simple, 0.71 on Parallel, and 0.865 on Multiple (non-live ast).
Would also love to know what you think :D

Hey, really appreciate the kind words! But I think the credit here goes to @RobotSail — I don’t recall contributing to this one myself 😅

Sorry, that's a typo; I tagged the wrong person 😅

HuanzhiMao · 2025-04-07T23:03:29Z

Thank you so much for taking a look at this PR @HuanzhiMao !

Btw, does this PR plan to add support for the phi-4 model

Unfortunately the Phi-4 model doesn't officially support function calling, it's currently only limited to Phi-4-mini.

This PR was testing Phi-4-mini-instruct, but we can also add tests for Phi-4-mini as well.

I see. If this PR is not handling phi-4, then we should remove these lines (such as here, and here). They could mislead people that phi-4 is supported while it is actually not.

RobotSail force-pushed the add-phi-4 branch from 2ad9d06 to 0a92504 Compare March 29, 2025 19:54

RobotSail changed the title ~~add support for microsoft/phi-4, microsoft/Phi-4-mini-instruct~~ [WIP] add support for microsoft/phi-4, microsoft/Phi-4-mini-instruct Mar 29, 2025

RobotSail changed the title ~~[WIP] add support for microsoft/phi-4, microsoft/Phi-4-mini-instruct~~ [BCFL] [WIP] add support for microsoft/phi-4, microsoft/Phi-4-mini-instruct Mar 29, 2025

RobotSail changed the title ~~[BCFL] [WIP] add support for microsoft/phi-4, microsoft/Phi-4-mini-instruct~~ [BFCL] [WIP] add support for microsoft/phi-4, microsoft/Phi-4-mini-instruct Mar 29, 2025

RobotSail force-pushed the add-phi-4 branch from 0a92504 to c268d52 Compare March 31, 2025 00:39

RobotSail changed the title ~~[BFCL] [WIP] add support for microsoft/phi-4, microsoft/Phi-4-mini-instruct~~ [BFCL] add support for microsoft/phi-4, microsoft/Phi-4-mini-instruct Mar 31, 2025

add support for microsoft/phi-4, microsoft/Phi-4-mini-instruct

b2d78b3

RobotSail force-pushed the add-phi-4 branch from c268d52 to b2d78b3 Compare March 31, 2025 00:52

RobotSail changed the title ~~[BFCL] add support for microsoft/phi-4, microsoft/Phi-4-mini-instruct~~ [BFCL] add support for microsoft/Phi-4-mini-instruct Mar 31, 2025

HuanzhiMao added 2 commits April 3, 2025 13:39

Merge remote-tracking branch 'upstream/main' into pr/RobotSail/967

de292c1

refine tool call parsing logic

6e13a60

HuanzhiMao force-pushed the add-phi-4 branch from 3245c7b to 6e13a60 Compare April 5, 2025 09:00

HuanzhiMao reviewed Apr 5, 2025

View reviewed changes

berkeley-function-call-leaderboard/bfcl/model_handler/local_inference/phi_fc.py Outdated Show resolved Hide resolved

HuanzhiMao added the BFCL-New Model Add New Model to BFCL label Apr 5, 2025

RobotSail commented Apr 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BFCL] add support for microsoft/Phi-4-mini-instruct #967

[BFCL] add support for microsoft/Phi-4-mini-instruct #967

RobotSail commented Mar 29, 2025 •

edited

Loading

HuanzhiMao left a comment •

edited

Loading

HuanzhiMao commented Apr 5, 2025

RobotSail commented Apr 5, 2025

RobotSail commented Apr 5, 2025

cedricvidal commented Apr 5, 2025

RobotSail Apr 5, 2025

RobotSail Apr 5, 2025

HuanzhiMao Apr 7, 2025

RobotSail Apr 8, 2025

HuanzhiMao Apr 8, 2025

HuanzhiMao commented Apr 7, 2025

HuanzhiMao commented Apr 7, 2025

[BFCL] add support for microsoft/Phi-4-mini-instruct #967

Are you sure you want to change the base?

[BFCL] add support for microsoft/Phi-4-mini-instruct #967

Conversation

RobotSail commented Mar 29, 2025 • edited Loading

HuanzhiMao left a comment • edited Loading

Choose a reason for hiding this comment

HuanzhiMao commented Apr 5, 2025

RobotSail commented Apr 5, 2025

RobotSail commented Apr 5, 2025

cedricvidal commented Apr 5, 2025

RobotSail Apr 5, 2025

Choose a reason for hiding this comment

RobotSail Apr 5, 2025

Choose a reason for hiding this comment

HuanzhiMao Apr 7, 2025

Choose a reason for hiding this comment

RobotSail Apr 8, 2025

Choose a reason for hiding this comment

HuanzhiMao Apr 8, 2025

Choose a reason for hiding this comment

HuanzhiMao commented Apr 7, 2025

HuanzhiMao commented Apr 7, 2025

RobotSail commented Mar 29, 2025 •

edited

Loading

HuanzhiMao left a comment •

edited

Loading