LLMs and Planning

This repo has the code for three papers:

The code in 'plan-bench' subdirectory belongs to the paper "PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change"
The code in 'llm_planning_analysis' subdirectory belongs to the paper "On the Planning Abilities of Large Language Models--A Critical Investigation"
NEW: 'llm_planning_analysis' subdirectory also contains the code for the paper "Planning in Strawberry Fields: Evaluating and Improving the Planning and Scheduling Capabilities of LRM o1"

PlanBench Static Test Set Leaderboard

The leaderboard below shows the performance of the models on the PlanBench static test set with zero-shot prompting. Check out llm_planning_analysis/results/ folder for the detailed files. For Blocksworld Hard, the results are included in results/backprompting/ folder.

Model Name	Model Type	Blocksworld - NL - 600 instances	Mystery Blocksworld - NL - 600 instances	Randomized Mystery Blocksworld - NL - 600 instances	Blocksworld Hard - PDDL - 110 instances
Deepseek R1	LRM	99.1%	43.3%	25.8%	53.6%
o1-preview	LRM	97.8%	52.8%	37.3%	23.65%
o1-mini	LRM	56.6%	19.1%	3.5%	10%
Claude-3.5 Sonnet	LLM	54.8%	0%	-	-
GPT-4o	LLM	35.5%	0%	-	-
LLaMA-3.1 405B	LLM	62.6%	0.8%	-	-
Claude 3 Opus	LLM	59.3%	0%	-	-
LLaMA-3 70B	LLM	34.16%	0%	-	-
GPT-4	LLM	34.6%	0%	-	-
Gemini 1.5 Pro	LLM	23.8%	-	-	-

Note: LLM = Large Language Model, LRM = Language Reasoning Model, NL = Natural Language Prompting, PDDL = Planning Domain Definition Language Prompting

Submitting to the Leaderboard

Kindly submit results of any new models by submitting a pull request with the result file and the leaderboard will be updated.

Citation(s)

PlanBench - NeurIPS 2023 Datasets and Benchmarks Track:

@article{valmeekam2023planbench,
  title={Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change},
  author={Valmeekam, Karthik and Marquez, Matthew and Olmo, Alberto and Sreedharan, Sarath and Kambhampati, Subbarao},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  pages={38975--38987},
  year={2023}
}

On the Planning Abilities of Large Language Models - NeurIPS 2023 Spotlight:

@article{valmeekam2023planning,
  title={On the planning abilities of large language models-a critical investigation},
  author={Valmeekam, Karthik and Marquez, Matthew and Sreedharan, Sarath and Kambhampati, Subbarao},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  pages={75993--76005},
  year={2023}
}

Planning in Strawberry Fields - A version to appear in TMLR:

@article{valmeekam2024planning,
  title={Planning in Strawberry Fields: Evaluating and Improving the Planning and Scheduling Capabilities of LRM o1},
  author={Valmeekam, Karthik and Stechly, Kaya and Gundawar, Atharva and Kambhampati, Subbarao},
  journal={arXiv preprint arXiv:2410.02162},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
llm_planning_analysis		llm_planning_analysis
plan-bench		plan-bench
planner_tools		planner_tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLMs and Planning

PlanBench Static Test Set Leaderboard

Submitting to the Leaderboard

Citation(s)

Star History

About

Releases

Packages

Contributors 2

Languages

License

karthikv792/LLMs-Planning

Folders and files

Latest commit

History

Repository files navigation

LLMs and Planning

PlanBench Static Test Set Leaderboard

Submitting to the Leaderboard

Citation(s)

Star History

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages