Skip to content

[BFCL] Retire Executable Categories from Leaderboard #943

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Apr 10, 2025
7 changes: 0 additions & 7 deletions berkeley-function-call-leaderboard/.env.example
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
# [OPTIONAL] Required for LLM generation step
# Provide the API key for the model(s) you intend to use
OPENAI_API_KEY=sk-XXXXXX
MISTRAL_API_KEY=
Expand All @@ -22,12 +21,6 @@ AWS_SECRET_ACCESS_KEY=
DATABRICKS_API_KEY=
DATABRICKS_AZURE_ENDPOINT_URL=

# [OPTIONAL] Required for evaluation of `exec` test group
RAPID_API_KEY=
EXCHANGERATE_API_KEY=
OMDB_API_KEY=
GEOCODE_API_KEY=

# [OPTIONAL] For local vllm/sglang server configuration
# Defaults to localhost port 1053 if not provided
VLLM_ENDPOINT=localhost
Expand Down
6 changes: 6 additions & 0 deletions berkeley-function-call-leaderboard/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@

All notable changes to the Berkeley Function Calling Leaderboard will be documented in this file.

- [Apr 9, 2025] [#943](https://github.com/ShishirPatil/gorilla/pull/943): Retire the executable categories from the leaderboard. The following categories will be excluded from the evaluation pipeline:
- `rest`
- `exec_simple`
- `exec_parallel`
- `exec_multiple`
- `exec_parallel_multiple`
- [Apr 9, 2025] [#972](https://github.com/ShishirPatil/gorilla/pull/972): Add the following new models to the leaderboard:
- `Salesforce/Llama-xLAM-2-70b-fc-r`
- `Salesforce/Llama-xLAM-2-8b-fc-r`
Expand Down
1 change: 0 additions & 1 deletion berkeley-function-call-leaderboard/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@ berkeley-function-call-leaderboard/
| ├── constants/ # Global constants and configuration values
│ ├── eval_checker/ # Evaluation modules
│ │ ├── ast_eval/ # AST-based evaluation
│ │ ├── executable_eval/ # Evaluation by execution
│ │ ├── multi_turn_eval/ # Multi-turn evaluation
│ ├── model_handler/ # All model-specific handlers
│ │ ├── local_inference/ # Handlers for locally-hosted models
Expand Down
31 changes: 1 addition & 30 deletions berkeley-function-call-leaderboard/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@
- [Basic Installation](#basic-installation)
- [Extra Dependencies for Self-Hosted Models](#extra-dependencies-for-self-hosted-models)
- [Setting up Environment Variables](#setting-up-environment-variables)
- [API Keys for Executable Test Categories](#api-keys-for-executable-test-categories)
- [Running Evaluations](#running-evaluations)
- [Generating LLM Responses](#generating-llm-responses)
- [Selecting Models and Test Categories](#selecting-models-and-test-categories)
Expand All @@ -19,7 +18,6 @@
- [For Pre-existing OpenAI-compatible Endpoints](#for-pre-existing-openai-compatible-endpoints)
- [(Alternate) Script Execution for Generation](#alternate-script-execution-for-generation)
- [Evaluating Generated Responses](#evaluating-generated-responses)
- [(Optional) API Sanity Check](#optional-api-sanity-check)
- [Output Structure](#output-structure)
- [(Optional) WandB Evaluation Logging](#optional-wandb-evaluation-logging)
- [(Alternate) Script Execution for Evaluation](#alternate-script-execution-for-evaluation)
Expand Down Expand Up @@ -93,27 +91,6 @@ cp .env.example .env

If you are running any proprietary models, make sure the model API keys are included in your `.env` file. Models like GPT, Claude, Mistral, Gemini, Nova, will require them.

### API Keys for Executable Test Categories

If you want to run executable test categories, you must provide API keys. Add the keys to your `.env` file, so that the placeholder values used in questions/params/answers can be replaced with real data.
There are 4 API keys to include:

1. RAPID-API Key: <https://rapidapi.com/hub>

- Yahoo Finance: <https://rapidapi.com/sparior/api/yahoo-finance15>
- Real Time Amazon Data : <https://rapidapi.com/letscrape-6bRBa3QguO5/api/real-time-amazon-data>
- Urban Dictionary: <https://rapidapi.com/community/api/urban-dictionary>
- Covid 19: <https://rapidapi.com/api-sports/api/covid-193>
- Time zone by Location: <https://rapidapi.com/BertoldVdb/api/timezone-by-location>

All the Rapid APIs we use have free tier usage. You need to **subscribe** to those API providers in order to have the executable test environment setup but it will be _free of charge_!

2. Exchange Rate API: <https://www.exchangerate-api.com>
3. OMDB API: <http://www.omdbapi.com/apikey.aspx>
4. Geocode API: <https://geocode.maps.co/>

The evaluation script will automatically search for dataset files in the default `./data/` directory and replace the placeholder values with the actual API keys you provided in the `.env` file.

---

## Running Evaluations
Expand All @@ -128,7 +105,7 @@ The evaluation script will automatically search for dataset files in the default
You can provide multiple models or test categories by separating them with commas. For example:

```bash
bfcl generate --model claude-3-5-sonnet-20241022-FC,gpt-4o-2024-11-20-FC --test-category parallel,multiple,exec_simple
bfcl generate --model claude-3-5-sonnet-20241022-FC,gpt-4o-2024-11-20-FC --test-category simple,parallel,multiple,multi_turn
```

#### Output and Logging
Expand Down Expand Up @@ -199,12 +176,6 @@ If in the previous step you stored the model responses in a custom directory, yo
> Note: For unevaluated test categories, they will be marked as `N/A` in the evaluation result csv files.
> For summary columns (e.g., `Overall Acc`, `Non_Live Overall Acc`, `Live Overall Acc`, and `Multi Turn Overall Acc`), the score reported will treat all unevaluated categories as 0 during calculation.

> For executable categories, if the API Keys are not provided, the evaluation process will skip those categories and treat them as if they were not evaluated.

#### (Optional) API Sanity Check

If any of your test categories involve executable tests (e.g., category name contains `exec` or `rest`), you can set the `--api-sanity-check` flag (or `-c` for short) to have the evaluation process perform a sanity check on all REST API endpoints involved. If any of them are not behaving as expected, you will be alerted in the console; the evaluation process will continue regardless.

#### Output Structure

Evaluation scores are stored in `./score/`, mirroring the structure of `./result/`: `score/MODEL_NAME/BFCL_v3_TEST_CATEGORY_score.json`
Expand Down
24 changes: 3 additions & 21 deletions berkeley-function-call-leaderboard/TEST_CATEGORIES.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
## Specifying Test Categories
# Specifying Test Categories

When running tests, you can use the optional `--test-category` parameter to define which categories of tests to execute. You can provide multiple categories by separating them with spaces. If no category is specified, all available tests will run by default.

### Available Test Groups
## Available Test Groups

You can specify a broad category (test group) to run multiple related tests at once (you can also use `bfcl test-categories` command to see this list):

Expand All @@ -12,13 +12,10 @@ You can specify a broad category (test group) to run multiple related tests at o
- `single_turn`: All single-turn test categories.
- `live`: All user-contributed live test categories.
- `non_live`: All not-user-contributed test categories (the opposite of `live`).
- `ast`: Abstract Syntax Tree tests.
- `executable`: Executable code evaluation tests.
- `python`: Tests specific to Python code.
- `non_python`: Tests for code in languages other than Python, such as Java and JavaScript.
- `python_ast`: Python Abstract Syntax Tree tests.

### Available Individual Test Categories
## Available Individual Test Categories

If you prefer more granular control, you can specify individual categories:

Expand All @@ -28,11 +25,6 @@ If you prefer more granular control, you can specify individual categories:
- `parallel_multiple`: Multiple function calls in parallel and in sequence.
- `java`: Java function calls.
- `javascript`: JavaScript function calls.
- `exec_simple`: Executable function calls.
- `exec_parallel`: Executable multiple function calls in parallel.
- `exec_multiple`: Executable multiple function calls in parallel.
- `exec_parallel_multiple`: Executable multiple function calls in parallel and in sequence.
- `rest`: REST API function calls.
- `irrelevance`: Function calls with irrelevant function documentation.
- `live_simple`: User-contributed simple function calls.
- `live_multiple`: User-contributed multiple function calls in sequence.
Expand All @@ -44,13 +36,3 @@ If you prefer more granular control, you can specify individual categories:
- `multi_turn_miss_func`: Multi-turn function calls with missing function.
- `multi_turn_miss_param`: Multi-turn function calls with missing parameter.
- `multi_turn_long_context`: Multi-turn function calls with long context.

### Important Notes on REST API Testing

If you intend to run the following categories or groups—`all`, `single_turn`, `non_live`, `executable`, `python`, or `rest`—ensure that you have configured your REST API keys in the `.env` file. These categories test the model’s output against real-world APIs.

If you prefer not to provide REST API keys, select a test category that does not involve executable tests.

### API Sanity Checks

By adding the `--api-sanity-check` (or `-c`) flag, the evaluation process will perform preliminary REST API endpoint checks whenever executable test categories (those whose names contain `exec`) are included. If any endpoints fail to respond as expected, they will be flagged, but the testing will continue regardless.
8 changes: 1 addition & 7 deletions berkeley-function-call-leaderboard/bfcl/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -226,12 +226,6 @@ def evaluate(
help="A list of test categories to run the evaluation on.",
callback=handle_multiple_input
),
api_sanity_check: bool = typer.Option(
False,
"--api-sanity-check",
"-c",
help="Perform the REST API status sanity check before running the evaluation.",
),
result_dir: str = typer.Option(
None,
"--result-dir",
Expand All @@ -248,7 +242,7 @@ def evaluate(
"""

load_dotenv(dotenv_path=DOTENV_PATH, verbose=True, override=True) # Load the .env file
evaluation_main(model, test_category, api_sanity_check, result_dir, score_dir)
evaluation_main(model, test_category, result_dir, score_dir)


@cli.command()
Expand Down

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,11 +1,9 @@
import argparse
import json
import os
import time
from concurrent.futures import ThreadPoolExecutor
from copy import deepcopy

from bfcl._apply_function_credential_config import apply_function_credential_config
from bfcl.constants.category_mapping import (
MULTI_TURN_FUNC_DOC_FILE_MAPPING,
TEST_FILE_MAPPING,
Expand All @@ -21,8 +19,6 @@
from bfcl.model_handler.handler_map import HANDLER_MAP
from bfcl.model_handler.model_style import ModelStyle
from bfcl.utils import (
check_api_key_supplied,
is_executable,
is_multi_turn,
parse_test_category_argument,
sort_key,
Expand Down Expand Up @@ -70,67 +66,35 @@ def build_handler(model_name, temperature):

def get_involved_test_entries(test_category_args, run_ids):
all_test_file_paths, all_test_categories, all_test_entries_involved = [], [], []
api_key_supplied = check_api_key_supplied()
skipped_categories = []
rest_test_credential_applied = False

if run_ids:
with open(TEST_IDS_TO_GENERATE_PATH) as f:
test_ids_to_generate = json.load(f)
for category, test_ids in test_ids_to_generate.items():
if len(test_ids) == 0:
continue
test_file_path = TEST_FILE_MAPPING[category]

is_exec = is_executable(category)

# Skip executable test category if api key is not provided in the .env file
if is_exec and not api_key_supplied:
skipped_categories.append(category)
continue

# Apply function credential config if any of the test categories are executable
if is_exec and not rest_test_credential_applied:
apply_function_credential_config(input_path=PROMPT_PATH)
rest_test_credential_applied = True

all_test_categories.append(category)
all_test_file_paths.append(test_file_path)
all_test_entries_involved.extend(
[
entry
for entry in load_file(PROMPT_PATH / test_file_path)
if entry["id"] in test_ids
]
)
all_test_categories.append(category)
all_test_file_paths.append(test_file_path)

else:
all_test_file_paths, all_test_categories = parse_test_category_argument(test_category_args)
# Make a copy here since we are removing list elemenets inside the for loop
for test_category, file_to_open in zip(
all_test_categories[:], all_test_file_paths[:]
):
is_exec = is_executable(test_category)

# Skip executable test category and remove corresponding fiels if api key is not provided in the .env file
if is_exec and not api_key_supplied:
all_test_categories.remove(test_category)
all_test_file_paths.remove(file_to_open)
skipped_categories.append(test_category)
continue

# Apply function credential config if any of the test categories are executable
if is_exec and not rest_test_credential_applied:
apply_function_credential_config(input_path=PROMPT_PATH)
rest_test_credential_applied = True

all_test_entries_involved.extend(load_file(PROMPT_PATH / file_to_open))

return (
all_test_file_paths,
all_test_categories,
all_test_entries_involved,
skipped_categories,
)


Expand All @@ -141,8 +105,6 @@ def collect_test_cases(
model_result_dir = args.result_dir / model_name_dir

existing_result = []
existing_ids = []

for test_category, file_to_open in zip(all_test_categories, all_test_file_paths):

result_file_path = model_result_dir / file_to_open.replace(".json", "_result.json")
Expand Down Expand Up @@ -307,7 +269,6 @@ def main(args):
all_test_file_paths,
all_test_categories,
all_test_entries_involved,
skipped_categories,
) = get_involved_test_entries(args.test_category, args.run_ids)

print(f"Generating results for {args.model}")
Expand All @@ -316,13 +277,6 @@ def main(args):
else:
print(f"Running full test cases for categories: {all_test_categories}.")

if len(skipped_categories) > 0:
print("----------")
print(
f"❗️ Note: The following executable test category entries will be skipped because they require API Keys to be provided in the .env file: {skipped_categories}.\n Please refer to the README.md 'API Keys for Executable Test Categories' section for details.\n The model response for other categories will still be generated."
)
print("----------")

if args.result_dir is not None:
args.result_dir = PROJECT_ROOT / args.result_dir
else:
Expand Down
Loading