-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BFCL] add support for microsoft/Phi-4-mini-instruct #967
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the PR @RobotSail !
I made a few changes to the parsing logics in phi_fc.py
. Namely, Phi-4-mini-instruct-FC
is not doing a great job in instruction following. I think the lack of an official function calling guide for phi models might be causing all these troubles :/
-
Improve handling of parallel tool call scenario. Sometimes the model will give the parallel calls without wrapping them in a list, (like
{\"name\": \"calculate_sales_tax\", \"arguments\": {\"purchase_amount\": 30.45, \"city\": \"Chicago\", \"state\": \"Illinois\"}}, {\"name\": \"calculate_sales_tax\", \"arguments\": {\"purchase_amount\": 52.33, \"city\": \"Sacramento\", \"state\": \"California\"}}, {\"name\": \"calculate_sales_tax\", \"arguments\": {\"purchase_amount\": 11.23, \"city\": \"Portland\", \"state\": \"Oregon\"}}
). -
Normally we expect the tool call to be wrapped in the tags (like
<|tool_call|>[{\"name\": \"calculate_final_velocity\", \"arguments\": {\"height\": 150, \"initial_velocity\": 0}}]<|/tool_call|>
), however, many times the closing tag might be missing (like<|tool_call|>[{\"name\": \"calculate_final_velocity\", \"arguments\": {\"height\": 150, \"initial_velocity\": 0}}]
). In such cases, I still honored it as a valid tool call because the starting tag is present, and the model response ended at the end of the tool call.
With these changes, Phi-4-mini-instruct-FC
can achieve 0.85 on Simple, 0.71 on Parallel, and 0.865 on Multiple (non-live ast).
Would also love to know what you think :D
berkeley-function-call-leaderboard/bfcl/model_handler/local_inference/phi_fc.py
Outdated
Show resolved
Hide resolved
Btw, does this PR plan to add support for the We could retire the support for old |
Thank you so much for taking a look at this PR @HuanzhiMao !
Unfortunately the Phi-4 model doesn't officially support function calling, it's currently only limited to Phi-4-mini. This PR was testing Phi-4-mini-instruct, but we can also add tests for Phi-4-mini as well. |
Thank you so much for fixing up the function-calling logic.
Yes, I observed this as well. Originally I was handling it, but it seemed like a failure scenario since the model design is to have the closing tags, and this is how MCP clients would rely on parsing this. However; if you think it makes sense to handle it then I think that's perfectly fine. Thanks again for all of the time you spent looking at my PR, I greatly appreciate it! |
Hey, really appreciate the kind words! But I think the credit here goes to @RobotSail — I don’t recall contributing to this one myself 😅 |
def _is_tool_call_response_format(input: str) -> bool: | ||
""" | ||
This is a helper method to detect if the tool call extracted by `_extract_tool_calls` is actually a tool call. | ||
This is important because the model might return a dictionary that looks like a tool call, but is not. It sometimes returns the function document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really smart
pattern = r"<\|tool_call\|>(.*?)<\|/tool_call\|>" | ||
matches = re.findall(pattern, input_string, re.DOTALL) | ||
|
||
# Often the model will miss the `<|/tool_call|>`: <|tool_call|>[{\"name\": \"calculate_final_velocity\", \"arguments\": {\"height\": 150, \"initial_velocity\": 0}}] | ||
# Since `<|tool_call|>` is still present, we consider this a valid case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree with this logic in general, and I believe Llama 3 also behaves this way. More generally (and not related to this PR at all) I wonder how MCP clients will handle all of this logic across different models & behaviors. Seems like a massive headache!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, this scenario is a bit tricky in my opinion. I see how people could argue both ways. In such cases, the important thing for a benchmark like BFCL is to set a consistent standard (do the right thing) and follow it.
After careful consideration, I agree with you that, from a parsing perspective, this should be a failure scenario since the closing tags are expected but not supplied. The model is not following instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HuanzhiMao Either way works. I can see the other perspective of -- if we can parse out the noise the model generated, is the underlying function call correct?
When I have some time this week I'll also update this PR to include results for Phi-4-mini as well (and maybe Phi-4 if I have time to get to it). I think it will be interesting to see the differences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be awesome. Thank you! I will leave this PR open for now.
Sorry, that's a typo; I tagged the wrong person 😅 |
I see. If this PR is not handling phi-4, then we should remove these lines (such as here, and here). They could mislead people that phi-4 is supported while it is actually not. |
This PR introduces support for the newly-released Phi-4-mini-instruct model from Microsoft:
The results for this were initially evaluated against f81063; however, the model had a few issues so this PR was developed after rebasing on top of d0299e.
The results obtained from this model are as follows:
Note: It seems like Phi-4-mini-instruct on vLLM currently has a bug with batched inference when using vLLM==0.7.3 and torch==2.5.1. This PR tested results using vLLM==0.7.3 and torch==2.6.0 which seemed to work fine.