Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BFCL] add support for microsoft/Phi-4-mini-instruct #967

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

RobotSail
Copy link

@RobotSail RobotSail commented Mar 29, 2025

This PR introduces support for the newly-released Phi-4-mini-instruct model from Microsoft:

The results for this were initially evaluated against f81063; however, the model had a few issues so this PR was developed after rebasing on top of d0299e.

The results obtained from this model are as follows:

+--------+----------------------------------------+---------------+--------------------+---------------------+------------+------------------+-------------+---------------+
|   Rank | Model                                  | Overall Acc   | Non-Live AST Acc   | Non-Live Exec Acc   | Live Acc   | Multi Turn Acc   | Relevance   | Irrelevance   |
+========+========================================+===============+====================+=====================+============+==================+=============+===============+
|      1 | Phi-4-mini-instruct (Function Calling) | 31.02%        | 2.92%              | 63.63%              | 53.84%     | 0.00%            | 38.89%      | 90.39%        |
+--------+----------------------------------------+---------------+--------------------+---------------------+------------+------------------+-------------+---------------+

Note: It seems like Phi-4-mini-instruct on vLLM currently has a bug with batched inference when using vLLM==0.7.3 and torch==2.5.1. This PR tested results using vLLM==0.7.3 and torch==2.6.0 which seemed to work fine.

@RobotSail RobotSail changed the title add support for microsoft/phi-4, microsoft/Phi-4-mini-instruct [WIP] add support for microsoft/phi-4, microsoft/Phi-4-mini-instruct Mar 29, 2025
@RobotSail RobotSail changed the title [WIP] add support for microsoft/phi-4, microsoft/Phi-4-mini-instruct [BCFL] [WIP] add support for microsoft/phi-4, microsoft/Phi-4-mini-instruct Mar 29, 2025
@RobotSail RobotSail changed the title [BCFL] [WIP] add support for microsoft/phi-4, microsoft/Phi-4-mini-instruct [BFCL] [WIP] add support for microsoft/phi-4, microsoft/Phi-4-mini-instruct Mar 29, 2025
@RobotSail RobotSail changed the title [BFCL] [WIP] add support for microsoft/phi-4, microsoft/Phi-4-mini-instruct [BFCL] add support for microsoft/phi-4, microsoft/Phi-4-mini-instruct Mar 31, 2025
@RobotSail RobotSail changed the title [BFCL] add support for microsoft/phi-4, microsoft/Phi-4-mini-instruct [BFCL] add support for microsoft/Phi-4-mini-instruct Mar 31, 2025
Copy link
Collaborator

@HuanzhiMao HuanzhiMao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR @RobotSail !

I made a few changes to the parsing logics in phi_fc.py. Namely, Phi-4-mini-instruct-FC is not doing a great job in instruction following. I think the lack of an official function calling guide for phi models might be causing all these troubles :/

  1. Improve handling of parallel tool call scenario. Sometimes the model will give the parallel calls without wrapping them in a list, (like {\"name\": \"calculate_sales_tax\", \"arguments\": {\"purchase_amount\": 30.45, \"city\": \"Chicago\", \"state\": \"Illinois\"}}, {\"name\": \"calculate_sales_tax\", \"arguments\": {\"purchase_amount\": 52.33, \"city\": \"Sacramento\", \"state\": \"California\"}}, {\"name\": \"calculate_sales_tax\", \"arguments\": {\"purchase_amount\": 11.23, \"city\": \"Portland\", \"state\": \"Oregon\"}}).

  2. Normally we expect the tool call to be wrapped in the tags (like <|tool_call|>[{\"name\": \"calculate_final_velocity\", \"arguments\": {\"height\": 150, \"initial_velocity\": 0}}]<|/tool_call|>), however, many times the closing tag might be missing (like <|tool_call|>[{\"name\": \"calculate_final_velocity\", \"arguments\": {\"height\": 150, \"initial_velocity\": 0}}]). In such cases, I still honored it as a valid tool call because the starting tag is present, and the model response ended at the end of the tool call.

With these changes, Phi-4-mini-instruct-FC can achieve 0.85 on Simple, 0.71 on Parallel, and 0.865 on Multiple (non-live ast).

Would also love to know what you think :D

@HuanzhiMao
Copy link
Collaborator

Btw, does this PR plan to add support for the phi-4 model (non-mini version)? If so, I believe the phi.py needs to be updated because phi-4 uses a chat template different from the one currently implemented there.

We could retire the support for old phi-3 series models to keep the codebase simple.

@HuanzhiMao HuanzhiMao added the BFCL-New Model Add New Model to BFCL label Apr 5, 2025
@RobotSail
Copy link
Author

Thank you so much for taking a look at this PR @HuanzhiMao !

Btw, does this PR plan to add support for the phi-4 model

Unfortunately the Phi-4 model doesn't officially support function calling, it's currently only limited to Phi-4-mini.

This PR was testing Phi-4-mini-instruct, but we can also add tests for Phi-4-mini as well.

@RobotSail
Copy link
Author

Thank you so much for fixing up the function-calling logic.

Normally we expect the tool call to be wrapped in the tags (like <|tool_call|>[{"name": "calculate_final_velocity", "arguments": {"height": 150, "initial_velocity": 0}}]<|/tool_call|>), however, many times the closing tag might be missing (like <|tool_call|>[{"name": "calculate_final_velocity", "arguments": {"height": 150, "initial_velocity": 0}}]). In such cases, I still honored it as a valid tool call because the starting tag is present, and the model response ended at the end of the tool call.

Yes, I observed this as well. Originally I was handling it, but it seemed like a failure scenario since the model design is to have the closing tags, and this is how MCP clients would rely on parsing this. However; if you think it makes sense to handle it then I think that's perfectly fine.

Thanks again for all of the time you spent looking at my PR, I greatly appreciate it!

@cedricvidal
Copy link
Contributor

Thanks a lot for the PR @cedricvidal !

I made a few changes to the parsing logics in phi_fc.py. Namely, Phi-4-mini-instruct-FC is not doing a great job in instruction following. I think the lack of an official function calling guide for phi models might be causing all these troubles :/

  1. Improve handling of parallel tool call scenario. Sometimes the model will give the parallel calls without wrapping them in a list, (like {\"name\": \"calculate_sales_tax\", \"arguments\": {\"purchase_amount\": 30.45, \"city\": \"Chicago\", \"state\": \"Illinois\"}}, {\"name\": \"calculate_sales_tax\", \"arguments\": {\"purchase_amount\": 52.33, \"city\": \"Sacramento\", \"state\": \"California\"}}, {\"name\": \"calculate_sales_tax\", \"arguments\": {\"purchase_amount\": 11.23, \"city\": \"Portland\", \"state\": \"Oregon\"}}).
  2. Normally we expect the tool call to be wrapped in the tags (like <|tool_call|>[{\"name\": \"calculate_final_velocity\", \"arguments\": {\"height\": 150, \"initial_velocity\": 0}}]<|/tool_call|>), however, many times the closing tag might be missing (like <|tool_call|>[{\"name\": \"calculate_final_velocity\", \"arguments\": {\"height\": 150, \"initial_velocity\": 0}}]). In such cases, I still honored it as a valid tool call because the starting tag is present, and the model response ended at the end of the tool call.

With these changes, Phi-4-mini-instruct-FC can achieve 0.85 on Simple, 0.71 on Parallel, and 0.865 on Multiple (non-live ast).

Would also love to know what you think :D

Hey, really appreciate the kind words! But I think the credit here goes to @RobotSail — I don’t recall contributing to this one myself 😅

def _is_tool_call_response_format(input: str) -> bool:
"""
This is a helper method to detect if the tool call extracted by `_extract_tool_calls` is actually a tool call.
This is important because the model might return a dictionary that looks like a tool call, but is not. It sometimes returns the function document.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really smart

pattern = r"<\|tool_call\|>(.*?)<\|/tool_call\|>"
matches = re.findall(pattern, input_string, re.DOTALL)

# Often the model will miss the `<|/tool_call|>`: <|tool_call|>[{\"name\": \"calculate_final_velocity\", \"arguments\": {\"height\": 150, \"initial_velocity\": 0}}]
# Since `<|tool_call|>` is still present, we consider this a valid case
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree with this logic in general, and I believe Llama 3 also behaves this way. More generally (and not related to this PR at all) I wonder how MCP clients will handle all of this logic across different models & behaviors. Seems like a massive headache!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this scenario is a bit tricky in my opinion. I see how people could argue both ways. In such cases, the important thing for a benchmark like BFCL is to set a consistent standard (do the right thing) and follow it.
After careful consideration, I agree with you that, from a parsing perspective, this should be a failure scenario since the closing tags are expected but not supplied. The model is not following instructions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HuanzhiMao Either way works. I can see the other perspective of -- if we can parse out the noise the model generated, is the underlying function call correct?

When I have some time this week I'll also update this PR to include results for Phi-4-mini as well (and maybe Phi-4 if I have time to get to it). I think it will be interesting to see the differences.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be awesome. Thank you! I will leave this PR open for now.

@HuanzhiMao
Copy link
Collaborator

Thanks a lot for the PR @cedricvidal !
I made a few changes to the parsing logics in phi_fc.py. Namely, Phi-4-mini-instruct-FC is not doing a great job in instruction following. I think the lack of an official function calling guide for phi models might be causing all these troubles :/

  1. Improve handling of parallel tool call scenario. Sometimes the model will give the parallel calls without wrapping them in a list, (like {\"name\": \"calculate_sales_tax\", \"arguments\": {\"purchase_amount\": 30.45, \"city\": \"Chicago\", \"state\": \"Illinois\"}}, {\"name\": \"calculate_sales_tax\", \"arguments\": {\"purchase_amount\": 52.33, \"city\": \"Sacramento\", \"state\": \"California\"}}, {\"name\": \"calculate_sales_tax\", \"arguments\": {\"purchase_amount\": 11.23, \"city\": \"Portland\", \"state\": \"Oregon\"}}).
  2. Normally we expect the tool call to be wrapped in the tags (like <|tool_call|>[{\"name\": \"calculate_final_velocity\", \"arguments\": {\"height\": 150, \"initial_velocity\": 0}}]<|/tool_call|>), however, many times the closing tag might be missing (like <|tool_call|>[{\"name\": \"calculate_final_velocity\", \"arguments\": {\"height\": 150, \"initial_velocity\": 0}}]). In such cases, I still honored it as a valid tool call because the starting tag is present, and the model response ended at the end of the tool call.

With these changes, Phi-4-mini-instruct-FC can achieve 0.85 on Simple, 0.71 on Parallel, and 0.865 on Multiple (non-live ast).
Would also love to know what you think :D

Hey, really appreciate the kind words! But I think the credit here goes to @RobotSail — I don’t recall contributing to this one myself 😅

Sorry, that's a typo; I tagged the wrong person 😅

@HuanzhiMao
Copy link
Collaborator

Thank you so much for taking a look at this PR @HuanzhiMao !

Btw, does this PR plan to add support for the phi-4 model

Unfortunately the Phi-4 model doesn't officially support function calling, it's currently only limited to Phi-4-mini.

This PR was testing Phi-4-mini-instruct, but we can also add tests for Phi-4-mini as well.

I see. If this PR is not handling phi-4, then we should remove these lines (such as here, and here). They could mislead people that phi-4 is supported while it is actually not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFCL-New Model Add New Model to BFCL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants