ShishirPatil · HuanzhiMao · Apr 10, 2025 · Apr 3, 2025 · Apr 3, 2025 · Apr 3, 2025
diff --git a/berkeley-function-call-leaderboard/.env.example b/berkeley-function-call-leaderboard/.env.example
@@ -1,4 +1,3 @@
-# [OPTIONAL] Required for LLM generation step
 # Provide the API key for the model(s) you intend to use
 OPENAI_API_KEY=sk-XXXXXX
 MISTRAL_API_KEY=
@@ -22,12 +21,6 @@ AWS_SECRET_ACCESS_KEY=
 DATABRICKS_API_KEY=
 DATABRICKS_AZURE_ENDPOINT_URL=
 
-# [OPTIONAL] Required for evaluation of `exec` test group 
-RAPID_API_KEY=
-EXCHANGERATE_API_KEY=
-OMDB_API_KEY=
-GEOCODE_API_KEY=
-
 # [OPTIONAL] For local vllm/sglang server configuration
 # Defaults to localhost port 1053 if not provided
 VLLM_ENDPOINT=localhost

diff --git a/berkeley-function-call-leaderboard/CHANGELOG.md b/berkeley-function-call-leaderboard/CHANGELOG.md
@@ -2,6 +2,12 @@
 
 All notable changes to the Berkeley Function Calling Leaderboard will be documented in this file.
 
+- [Apr 9, 2025] [#943](https://github.com/ShishirPatil/gorilla/pull/943): Retire the executable categories from the leaderboard. The following categories will be excluded from the evaluation pipeline:
+  - `rest`
+  - `exec_simple`
+  - `exec_parallel`
+  - `exec_multiple`
+  - `exec_parallel_multiple`
 - [Apr 9, 2025] [#972](https://github.com/ShishirPatil/gorilla/pull/972): Add the following new models to the leaderboard:
   - `Salesforce/Llama-xLAM-2-70b-fc-r`
   - `Salesforce/Llama-xLAM-2-8b-fc-r`

diff --git a/berkeley-function-call-leaderboard/CONTRIBUTING.md b/berkeley-function-call-leaderboard/CONTRIBUTING.md
@@ -21,7 +21,6 @@ berkeley-function-call-leaderboard/
 |   ├── constants/                # Global constants and configuration values
 │   ├── eval_checker/             # Evaluation modules
 │   │   ├── ast_eval/             # AST-based evaluation
-│   │   ├── executable_eval/      # Evaluation by execution
 │   │   ├── multi_turn_eval/      # Multi-turn evaluation
 │   ├── model_handler/            # All model-specific handlers
 │   │   ├── local_inference/            # Handlers for locally-hosted models

diff --git a/berkeley-function-call-leaderboard/README.md b/berkeley-function-call-leaderboard/README.md
@@ -9,7 +9,6 @@
     - [Basic Installation](#basic-installation)
     - [Extra Dependencies for Self-Hosted Models](#extra-dependencies-for-self-hosted-models)
     - [Setting up Environment Variables](#setting-up-environment-variables)
-    - [API Keys for Executable Test Categories](#api-keys-for-executable-test-categories)
   - [Running Evaluations](#running-evaluations)
     - [Generating LLM Responses](#generating-llm-responses)
       - [Selecting Models and Test Categories](#selecting-models-and-test-categories)
@@ -19,7 +18,6 @@
         - [For Pre-existing OpenAI-compatible Endpoints](#for-pre-existing-openai-compatible-endpoints)
       - [(Alternate) Script Execution for Generation](#alternate-script-execution-for-generation)
     - [Evaluating Generated Responses](#evaluating-generated-responses)
-      - [(Optional) API Sanity Check](#optional-api-sanity-check)
       - [Output Structure](#output-structure)
       - [(Optional) WandB Evaluation Logging](#optional-wandb-evaluation-logging)
       - [(Alternate) Script Execution for Evaluation](#alternate-script-execution-for-evaluation)
@@ -93,27 +91,6 @@ cp .env.example .env
 
 If you are running any proprietary models, make sure the model API keys are included in your `.env` file. Models like GPT, Claude, Mistral, Gemini, Nova, will require them.
 
-### API Keys for Executable Test Categories
-
-If you want to run executable test categories, you must provide API keys. Add the keys to your `.env` file, so that the placeholder values used in questions/params/answers can be replaced with real data.
-There are 4 API keys to include:
-
-1. RAPID-API Key: <https://rapidapi.com/hub>
-
-   - Yahoo Finance: <https://rapidapi.com/sparior/api/yahoo-finance15>
-   - Real Time Amazon Data : <https://rapidapi.com/letscrape-6bRBa3QguO5/api/real-time-amazon-data>
-   - Urban Dictionary: <https://rapidapi.com/community/api/urban-dictionary>
-   - Covid 19: <https://rapidapi.com/api-sports/api/covid-193>
-   - Time zone by Location: <https://rapidapi.com/BertoldVdb/api/timezone-by-location>
-
-   All the Rapid APIs we use have free tier usage. You need to **subscribe** to those API providers in order to have the executable test environment setup but it will be _free of charge_!
-
-2. Exchange Rate API: <https://www.exchangerate-api.com>
-3. OMDB API: <http://www.omdbapi.com/apikey.aspx>
-4. Geocode API: <https://geocode.maps.co/>
-
-The evaluation script will automatically search for dataset files in the default `./data/` directory and replace the placeholder values with the actual API keys you provided in the `.env` file.
-
 ---
 
 ## Running Evaluations
@@ -128,7 +105,7 @@ The evaluation script will automatically search for dataset files in the default
 You can provide multiple models or test categories by separating them with commas. For example:
 
 ```bash
-bfcl generate --model claude-3-5-sonnet-20241022-FC,gpt-4o-2024-11-20-FC --test-category parallel,multiple,exec_simple
+bfcl generate --model claude-3-5-sonnet-20241022-FC,gpt-4o-2024-11-20-FC --test-category simple,parallel,multiple,multi_turn
 ```
 
 #### Output and Logging
@@ -199,12 +176,6 @@ If in the previous step you stored the model responses in a custom directory, yo
 > Note: For unevaluated test categories, they will be marked as `N/A` in the evaluation result csv files.
 > For summary columns (e.g., `Overall Acc`, `Non_Live Overall Acc`, `Live Overall Acc`, and `Multi Turn Overall Acc`), the score reported will treat all unevaluated categories as 0 during calculation.
 
-> For executable categories, if the API Keys are not provided, the evaluation process will skip those categories and treat them as if they were not evaluated.
-
-#### (Optional) API Sanity Check
-
-If any of your test categories involve executable tests (e.g., category name contains `exec` or `rest`), you can set the `--api-sanity-check` flag (or `-c` for short) to have the evaluation process perform a sanity check on all REST API endpoints involved. If any of them are not behaving as expected, you will be alerted in the console; the evaluation process will continue regardless.
-
 #### Output Structure
 
 Evaluation scores are stored in `./score/`, mirroring the structure of `./result/`: `score/MODEL_NAME/BFCL_v3_TEST_CATEGORY_score.json`

diff --git a/berkeley-function-call-leaderboard/TEST_CATEGORIES.md b/berkeley-function-call-leaderboard/TEST_CATEGORIES.md
@@ -1,8 +1,8 @@
-## Specifying Test Categories
+# Specifying Test Categories
 
 When running tests, you can use the optional `--test-category` parameter to define which categories of tests to execute. You can provide multiple categories by separating them with spaces. If no category is specified, all available tests will run by default.
 
-### Available Test Groups
+## Available Test Groups
 
 You can specify a broad category (test group) to run multiple related tests at once (you can also use `bfcl test-categories` command to see this list):
 
@@ -12,13 +12,10 @@ You can specify a broad category (test group) to run multiple related tests at o
 - `single_turn`: All single-turn test categories.
 - `live`: All user-contributed live test categories.
 - `non_live`: All not-user-contributed test categories (the opposite of `live`).
-- `ast`: Abstract Syntax Tree tests.
-- `executable`: Executable code evaluation tests.
 - `python`: Tests specific to Python code.
 - `non_python`: Tests for code in languages other than Python, such as Java and JavaScript.
-- `python_ast`: Python Abstract Syntax Tree tests.
 
-### Available Individual Test Categories
+## Available Individual Test Categories
 
 If you prefer more granular control, you can specify individual categories:
 
@@ -28,11 +25,6 @@ If you prefer more granular control, you can specify individual categories:
 - `parallel_multiple`: Multiple function calls in parallel and in sequence.
 - `java`: Java function calls.
 - `javascript`: JavaScript function calls.
-- `exec_simple`: Executable function calls.
-- `exec_parallel`: Executable multiple function calls in parallel.
-- `exec_multiple`: Executable multiple function calls in parallel.
-- `exec_parallel_multiple`: Executable multiple function calls in parallel and in sequence.
-- `rest`: REST API function calls.
 - `irrelevance`: Function calls with irrelevant function documentation.
 - `live_simple`: User-contributed simple function calls.
 - `live_multiple`: User-contributed multiple function calls in sequence.
@@ -44,13 +36,3 @@ If you prefer more granular control, you can specify individual categories:
 - `multi_turn_miss_func`: Multi-turn function calls with missing function.
 - `multi_turn_miss_param`: Multi-turn function calls with missing parameter.
 - `multi_turn_long_context`: Multi-turn function calls with long context.
-
-### Important Notes on REST API Testing
-
-If you intend to run the following categories or groups—`all`, `single_turn`, `non_live`, `executable`, `python`, or `rest`—ensure that you have configured your REST API keys in the `.env` file. These categories test the model’s output against real-world APIs.
-
-If you prefer not to provide REST API keys, select a test category that does not involve executable tests.
-
-### API Sanity Checks
-
-By adding the `--api-sanity-check` (or `-c`) flag, the evaluation process will perform preliminary REST API endpoint checks whenever executable test categories (those whose names contain `exec`) are included. If any endpoints fail to respond as expected, they will be flagged, but the testing will continue regardless.
diff --git a/berkeley-function-call-leaderboard/bfcl/__main__.py b/berkeley-function-call-leaderboard/bfcl/__main__.py
@@ -226,12 +226,6 @@ def evaluate(
         help="A list of test categories to run the evaluation on.",
         callback=handle_multiple_input
     ),
-    api_sanity_check: bool = typer.Option(
-        False,
-        "--api-sanity-check",
-        "-c",
-        help="Perform the REST API status sanity check before running the evaluation.",
-    ),
     result_dir: str = typer.Option(
         None,
         "--result-dir",
@@ -248,7 +242,7 @@ def evaluate(
     """
 
     load_dotenv(dotenv_path=DOTENV_PATH, verbose=True, override=True)  # Load the .env file
-    evaluation_main(model, test_category, api_sanity_check, result_dir, score_dir)
+    evaluation_main(model, test_category, result_dir, score_dir)
 
 
 @cli.command()

diff --git a/berkeley-function-call-leaderboard/bfcl/_apply_function_credential_config.py b/berkeley-function-call-leaderboard/bfcl/_apply_function_credential_config.py
diff --git a/berkeley-function-call-leaderboard/bfcl/_llm_response_generation.py b/berkeley-function-call-leaderboard/bfcl/_llm_response_generation.py
@@ -1,11 +1,9 @@
 import argparse
 import json
-import os
 import time
 from concurrent.futures import ThreadPoolExecutor
 from copy import deepcopy
 
-from bfcl._apply_function_credential_config import apply_function_credential_config
 from bfcl.constants.category_mapping import (
     MULTI_TURN_FUNC_DOC_FILE_MAPPING,
     TEST_FILE_MAPPING,
@@ -21,8 +19,6 @@
 from bfcl.model_handler.handler_map import HANDLER_MAP
 from bfcl.model_handler.model_style import ModelStyle
 from bfcl.utils import (
-    check_api_key_supplied,
-    is_executable,
     is_multi_turn,
     parse_test_category_argument,
     sort_key,
@@ -70,67 +66,35 @@ def build_handler(model_name, temperature):
 
 def get_involved_test_entries(test_category_args, run_ids):
     all_test_file_paths, all_test_categories, all_test_entries_involved = [], [], []
-    api_key_supplied = check_api_key_supplied()
-    skipped_categories = []
-    rest_test_credential_applied = False
-
     if run_ids:
         with open(TEST_IDS_TO_GENERATE_PATH) as f:
             test_ids_to_generate = json.load(f)
         for category, test_ids in test_ids_to_generate.items():
             if len(test_ids) == 0:
                 continue
             test_file_path = TEST_FILE_MAPPING[category]
-
-            is_exec = is_executable(category)
-
-            # Skip executable test category if api key is not provided in the .env file
-            if is_exec and not api_key_supplied:
-                skipped_categories.append(category)
-                continue
-
-            # Apply function credential config if any of the test categories are executable
-            if is_exec and not rest_test_credential_applied:
-                apply_function_credential_config(input_path=PROMPT_PATH)
-                rest_test_credential_applied = True
-
-            all_test_categories.append(category)
-            all_test_file_paths.append(test_file_path)
             all_test_entries_involved.extend(
                 [
                     entry
                     for entry in load_file(PROMPT_PATH / test_file_path)
                     if entry["id"] in test_ids
                 ]
             )
+            all_test_categories.append(category)
+            all_test_file_paths.append(test_file_path)
 
     else:
         all_test_file_paths, all_test_categories = parse_test_category_argument(test_category_args)
         # Make a copy here since we are removing list elemenets inside the for loop
         for test_category, file_to_open in zip(
             all_test_categories[:], all_test_file_paths[:]
         ):
-            is_exec = is_executable(test_category)
-
-            # Skip executable test category and remove corresponding fiels if api key is not provided in the .env file
-            if is_exec and not api_key_supplied:
-                all_test_categories.remove(test_category)
-                all_test_file_paths.remove(file_to_open)
-                skipped_categories.append(test_category)
-                continue
-
-            # Apply function credential config if any of the test categories are executable
-            if is_exec and not rest_test_credential_applied:
-                apply_function_credential_config(input_path=PROMPT_PATH)
-                rest_test_credential_applied = True
-
             all_test_entries_involved.extend(load_file(PROMPT_PATH / file_to_open))
 
     return (
         all_test_file_paths,
         all_test_categories,
         all_test_entries_involved,
-        skipped_categories,
     )
 
 
@@ -141,8 +105,6 @@ def collect_test_cases(
     model_result_dir = args.result_dir / model_name_dir
 
     existing_result = []
-    existing_ids = []
-
     for test_category, file_to_open in zip(all_test_categories, all_test_file_paths):
 
         result_file_path = model_result_dir / file_to_open.replace(".json", "_result.json")
@@ -307,7 +269,6 @@ def main(args):
         all_test_file_paths,
         all_test_categories,
         all_test_entries_involved,
-        skipped_categories,
     ) = get_involved_test_entries(args.test_category, args.run_ids)
 
     print(f"Generating results for {args.model}")
@@ -316,13 +277,6 @@ def main(args):
     else:
         print(f"Running full test cases for categories: {all_test_categories}.")
 
-    if len(skipped_categories) > 0:
-        print("----------")
-        print(
-            f"❗️ Note: The following executable test category entries will be skipped because they require API Keys to be provided in the .env file: {skipped_categories}.\n Please refer to the README.md 'API Keys for Executable Test Categories' section for details.\n The model response for other categories will still be generated."
-        )
-        print("----------")
-
     if args.result_dir is not None:
         args.result_dir = PROJECT_ROOT / args.result_dir
     else: