Skip to content

A comprehensive code domain benchmark review of LLM researches.

License

Notifications You must be signed in to change notification settings

tongye98/Awesome-Code-Benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Ā 

History

44 Commits
Ā 
Ā 
Ā 
Ā 
Ā 
Ā 

Repository files navigation

šŸ‘Øā€šŸ’» Awesome Code Benchmark

Awesome PRs Welcome

A comprehensive code domain benchmark review of LLM researches.

Oryx Video-ChatGPT

News

  • šŸ”„šŸ”„ [2025-03-29] We have crawled all the articles related to code benchmarks in the past five years.Ā 
  • šŸ”„šŸ”„ [2025-03-17] We add Code Version (Version-specific code generation) benchmarks.Ā 
  • šŸ”„šŸ”„ [2025-03-16] A thorough review of code domain benchmarks for LLM research has been released.Ā 

alt text

šŸš€ Top Code Benchmark

Code Completion & Code Generation

Benchmark Paper Date Github Dataset & Website & LeaderBoard
HumanEval Evaluating Large Language Models Trained on Code Arxiv 2021/07 Github šŸ¤—Dataset
MBPP Program Synthesis with Large Language Models Arxiv 2021/08 šŸ¤—Dataset
APPS Measuring Coding Challenge Competence With APPS NeurIPS 2021 Github šŸ¤—Dataset
CodeContests Competition-Level Code Generation with AlphaCode Science 2022 Github
MultiPL-E MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation TSE 2023 Github šŸ¤—Dataset
MCoNaLa MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages EACL 2023 Findings Github šŸ¤—Dataset
LCC LongCoder: A Long-Range Pre-trained Language Model for Code Completion ICML 2023 Github
ReCode ReCode: Robustness Evaluation of Code Generation Models ACL 2023 Github
CodeClarQA Python Code Generation by Asking Clarification Questions ACL 2023 Github Dataset
EvalPlus Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation NeurIPS 2023 Github šŸ¤—Dataset
CrossCodeEval CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion NeurIPS 2023 Github
PLPilot PLPilot: Benchmark an Automated Programming Language Design Framework Enabled by LLMs NeurIPS 2023 Github
Buggy-HumanEval & Buggy-FixEval Large Language Models of Code Fail at Completing Code with Potential Bugs NeurIPS 2023 Github
ODEX Execution-Based Evaluation for Open-Domain Code Generation EMNLP 2023 Findings Github
- A Static Evaluation of Code Completion by Large Language Models ACL Industry Track 2023
GenCodeSearchNet GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding GenBench Workshop 2023 Github šŸ¤—Dataset
HumanEval-X CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X SIGKDD 2023 Github
Stack-Repo RepoFusion: Training Code Models to Understand Your Repository Arxiv 2023/06 Github šŸ¤—Dataset
COCO COCO: Testing Code Generation Systems via Concretized Instructions Arxiv 2023/08 Github
ML-Bench ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code Arxiv 2023/11 Github šŸ¤—Dataset
RepoBench RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems ICLR 2024
StudentEval StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code ACL 2024 Findings Github šŸ¤—Dataset
DevEval DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories ACL 2024 Github šŸ¤—Dataset
LongBench LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding ACL 2024 Github šŸ¤—Dataset
CommitPack & HumanEvalPack OctoPack: Instruction Tuning Code Large Language Models ICLR 2024 Github
BioCoder BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models Bioinformatics, July 2024 Github šŸ¤—Dataset
MT-Bench-101 MT-Bench: How Good are LLMs at Multi-turn Question Answering ACL 2024 Github
CoderEval CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models ICSE 2024 Github
APPS+ StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback ACL 2024 Github
OOP OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models ACL 2024 Findings Github šŸ¤—Dataset
L2CEval L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models TACL 2024
ICE-Score ICE-Score: Instructing Large Language Models to Evaluate Code EACL 2024 Findings Github
HumanExtension Exploring Language Model's Code Generation Ability with Auxiliary Functions NAACL 2024 Findings Github
R2E-Eval1 R2E: Turning Any GitHub Repository into a Programming Agent Test Environment ICML 2024 Github
InfiBench InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models NeurIPS 2024 Github šŸŒWebsite
CodeBenchGen CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks Arxiv 2024/04 Github
HALLUCODE Exploring and Evaluating Hallucinations in LLM-Powered Code Generation Arxiv 2024/04
LeetCodeEval A Performance Study of LLM-Generated Code on Leetcode EASE 2024
RobustAPI Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation AAAI 2024 Github šŸ¤—Dataset
PYCOMMITS Coeditor: Leveraging Contextual Changes for Multi-round Code Auto-editing ICLR 2024 Github
EvoEval Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM COLM 2024 Github
LLM4Decompile LLM4Decompile: Decompiling Binary Code with Large Language Models EMNLP 2024 Github
CodeAgentBench CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges ACL 2024
SAFIM Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks ICLR 2024 Github šŸ¤—Dataset
MultiNL-H Improving Natural Language Capability of Code Large Language Model Arxiv 2024/01 Github
X-HumanEval-X Exploring Multi-Lingual Bias of Large Code Models in Code Generation Arxiv 2024/04
CatCoder Enhancing Repository-Level Code Generation with Integrated Contextual Information Arxiv 2024/06
AICoderEval AICoderEval: Improving AI Domain Code Generation of Large Language Models Arxiv 2024/06 šŸ¤—Dataset
ConCodeEval ConCodeEval: Evaluating Large Language Models for Code Constraints in Domain-Specific Languages Arxiv 2024/07
RealWorld-Bench What's Wrong with Your Code Generated by Large Language Models? An Extensive Study Arxiv 2024/07
AssertionBench AssertionBench: A Benchmark to Evaluate Large-Language Models for Assertion Generation NAACL 2025 Github
CodeHalu CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification (Hallucination benchmark) AAAI 2025 Github
REval Evaluating Large Language Models with Runtime Behavior of Program Execution ICSE 2025 Github šŸ“ŠLeaderBoard
BigCodeBench BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions ICLR 2025 Github šŸ“ŠLeaderBoard
EvoCodeBench EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories NeurIPS 2025 Github šŸ¤—Dataset
DynaCode DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation Arxiv 2025/03

Details of Code Completion & Code Generation Benchmarks

  • HumanEval: code completion
  • MBPP: text -> code; code generation
  • APPS: a benchmark for code generation from natural language specifications
  • EvalPlus: extends the HumanEval and MBPP benchmarks
  • MultiPL-E: extends the HumanEval and MBPP benchmarks to 18 languages
  • CodeClarQA: containing pairs of natural language descriptions and code with created synthetic clarification questions and answers.
  • DevEval: repo-level code generation
  • BigCodeBench: complete Split & Instruct Split
  • DynaCode: a dynamic complexity-aware code benchmark
  • Stack-Repo: repo-level code completion of java
  • StudentEval: a benchmark of student-written prompts for code generation evaluation
  • MCoNaLa: code generation from multiple natural languages
  • LCC: long code context code completion
  • RepoBench: repo-level code auto-completion
  • ReCode: a comprehensive robustness evaluation benchmark for code generation
  • LongBench: a bilingual, multitask benchmark for long context understanding
  • CommitPack & HumanEvalPack: multilingual code editing and understanding benchmarks based on Git commits and HumanEval extensions
  • COCO: instruction-level robustness benchmark for code generation
  • ODEX: open-domain, execution-based natural language to code generation
  • BioCoder: bioinformatics code generation
  • CrossCodeEval: diverse and multilingual benchmark for cross-file code completion
  • Buggy-HumanEval & Buggy-FixEval: buggy code completion
  • MT-Bench-101: Multi-turn question answering
  • ML-Bench: repo-level ML task solving benchmark using real-world code
  • PLPilot: Benchmark automating programming language design tasks
  • CoderEval: pragmatic code generation
  • MultiNL-H: multilingual NL-to-code benchmark with keyword-guided generation
  • APPS+: enhanced version of the APPS dataset, designed for reinforcement learning in code generation
  • OOP: object-oriented programming evaluation benchmark of python programs
  • -: static analysis-based evaluation framework for LLM code completions using ASTs
  • L2CEval: multilingual, multi-task NL-to-code benchmark including semantic parsing, math reasoning and Python programming.
  • ICE-Score: an evaluation metric for code quality without test cases or references
  • HumanExtensionļ¼šauxiliary-function-based code generation benchmark
  • R2E-Eval1ļ¼šrepo-level programming agent benchmark from GitHub repos for evaluating static and interactive code generation systems
  • REval: evaluates code LLMsā€™ reasoning and consistency with runtime behavior
  • InfiBench: free-form question-answering benchmark comprising 234 high-quality Stack Overflow questions across 15 programming languages
  • RobustAPI: Java API misuse benchmark from Stack Overflow for evaluating LLM code robustness
  • EvoCodeBench: evolving Python code generation benchmark from real GitHub commits
  • CodeBenchGen: scalable Python code generation benchmark built from GitHub functions and docstrings with execution-based evaluation
  • HALLUCODE: evaluate LLMs' ability to recognize and mitigate hallucinations in code generation
  • LeetCodeEval: Leetcode-based benchmark for comparing LLM vs. human code performance
  • X-HumanEval-X: exploring Multi-Lingual Bias of Large Code Models in Code Generation
  • CodeHalu: systematically evaluate code hallucinations in LLMs through execution-based verification
  • PYCOMMITS: multi-round Python code editing benchmark from real commit histories
  • CodeContests: complex programming task
  • EvoEval: a comprehensive evaluation of LLMs' coding abilities across diverse domains
  • LLM4Decompile: benchmark for evaluating binary-to-C decompilation on real-world open-source binaries
  • CatCoder: a framework for repo-level code generation in statically typed languages using code and type context
  • AICoderEval: AI task-specific code generation benchmark for LLMs in NLP, CV, and multimodal learning
  • CodeAgentBench: repo-level code generation benchmark with tool-integrated agents for real-world tasks
  • AssertionBench: assertion generation for hardware design verification
  • SAFIM: syntax-aware code completion benchmark focusing on code blocks and conditional expressionsā€‹
  • GenCodeSearchNet: benchmark for evaluating LLM generalization in programming language understanding across tasks and languages
  • ConCodeEval: benchmark for assessing LLMs' understanding of code constraints in domain-specific languages like JSON and YAML

  • HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization
  • CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation
  • XCodeEval: An Execution-based Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval
  • Fine-tuning Language Models for Joint Rewriting and Completion of Code with Potential Bugs | ACL 2024 Findings |
  • PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs https://aclanthology.org/2024.findings-emnlp.996.pdf
  • ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code (It covers multiple aspects, including tasks such as code generation, code completion, API recommendation, and test case generation, and aims to comprehensively evaluate the performance of large language models in complex code scenarios.)
  • JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language Models
  • HumanEvalComm: Benchmarking the Communication Competence of Code Generation for LLMs and LLM Agent https://arxiv.org/abs/2406.00215
  • CoderUJB: An Executable and Unified Java Benchmark for Practical Programming Scenarios https://arxiv.org/abs/2403.19287
  • CodeScore: Evaluating Code Generation by Learning Code Execution (MBPP-ET)
  • CrossCodeBench: Benchmarking Cross-Task Generalization of Source Code Models
  • Teaching Code LLMs to Use Autocompletion Tools in Repository-Level Code Generation

Code Efficiency

Benchmark Paper Date Github Dataset & Website & LeaderBoard
EvalPerf Evaluating Language Models for Efficient Code Generation COLM 2024 Github šŸ¤—Dataset
EffiBench EffiBench: Benchmarking the Efficiency of Automatically Generated Code NeurIPS 2024 Github
Mercury Mercury: A Code Efficiency Benchmark for Code Large Language Models NeurIPS 2024 Github šŸ¤—Dataset
ECCO ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness? EMNLP 2024 Github šŸ¤—Dataset
PIE Learning Performance-Improving Code Edits ICLR 2024 Github šŸŒšŸŒWebsite
ENAMEL How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark ICLR 2025 Github šŸ¤—Dataset

CodeFix & Bug-Fix

Benchmark Paper Date Github Dataset & Website & LeaderBoard
HumanEvalFix OctoPack: Instruction Tuning Code Large Language Models Arxiv 2023/08 Github šŸ¤—Dataset
Socratic-Debugging Socratic Questioning of Novice Debuggers: A Benchmark Dataset and Preliminary Evaluations BEA 2023 Github
TFix & ManySStuBs4J & TSSB-3M Towards Low-Resource Automatic Program Repair with Meta-Learning and Pretrained Language Models EMNLP 2023 Github
SWT-Bench SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents NeurIPS 2024 Github šŸŒWebsite
SWE-bench SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024 Github šŸŒWebsite
GitBug-Java GitBug-Java: A Reproducible Benchmark of Recent Java Bugs MSR 2024 Github šŸŒWebsite šŸ¤—Dataset
GitBug-Actions GitBug-Actions: Building Reproducible Bug-Fix Benchmarks with GitHub Actions ICSE 2024 Demo Github ā–¶ļøVideo
RepoBugs When Large Language Models Confront Repository-Level Automatic Program Repair: How Well They Done? ICSE 2024 Track
RepoFixEval RepoFixEval: A Repository-Level Program Repair Benchmark From Issue Discovering to Bug Fixing Openreview 2024
DebugBench DebugBench: Evaluating Debugging Capability of Large Language Models ACL 2024 Github šŸ¤—Dataset
Multi-Bug Instruct, Not Assist: LLM-based Multi-Turn Planning and Hierarchical Questioning for Socratic Code Debugging EMNLP 2024 Findings Github
Coffee-Gym Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code EMNLP 2024 šŸ¤—Dataset
INTERVENOR INTERVENOR: Prompt the Coding Ability of Large Language Models with the Interactive Chain of Repairing ACL 2024 Findings Github
StatType-SO ZS4C: Zero-Shot Synthesis of Compilable Code for Incomplete Code Snippets using LLMs TOSEM 2024 Github
LiveCodeBench LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code ICLR 2025 Github šŸŒWebsite šŸ¤—Dataset šŸ“ŠLeaderBoard
SWE-bench Multimodal SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? ICLR 2025 Github šŸŒWebsite šŸ¤—Dataset

Details of CodeFix & Bug-Fix

  • HumanEvalFix: code repair capabilitie
  • SWT-Bench: Evaluating LLMs on testing generation for real world software issues
  • SWE-bench: Evaluating LLMs Resolve Real-World GitHub Issues
  • SWE-bench Multimodal: Evaluate LLMs on their ability to fix bugs in visual, user-facing JavaScript software
  • GitBug-Java: automatic program repair and fault localization of Java bugs
  • GitBug-Actions: constructing reproducible bug-fix benchmarks using GitHub Actions
  • LiveCodeBench: dynamic benchmark for contamination-free evaluation of LLMs from real-world platforms
  • RepoBugs: repo-level bug-fix benchmark for evaluating LLM-based program repair with full context
  • RepoFixEval: repository-level program repair benchmark for evaluating LLMs on issue discovery, fault localization, and code fixing
  • DebugBench: evaluating LLMs' debugging capabilities across various bug categories and types
  • Multi-Bug: a dataset for evaluating LLMs on multi-bug code debugging tasks
  • Socratic-Debugging: evaluating LLMs on interactive, dialogue-based bug fixing
  • Coffee-Gym: interactive benchmark environment for evaluating LLMs on NL-guided code repair
  • INTERVENOR: interactive code repair benchmark with multi-turn learnerā€“teacher dialogue
  • TFix & ManySStuBs4J & TSSB-3M: evaluating automatic program repair in JavaScript, Java, and Python
  • StatType-SO: benchmark for resolving imports and types in incomplete Stack Overflow code snippets

Code Reasoning & Understanding

Benchmark Paper Date Github Dataset & Website & LeaderBoard
CRUXEval CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution Arxiv 2024/01 Github šŸ“ŠLeaderBoard
Poor-CodeSumEval How Effectively Do Code Language Models Understand Poor-Readability Code? ASE 2024 Github šŸ¤—Dataset
A Novel Refactoring and Semantic Aware Abstract Syntax Tree Differencing Tool and a Benchmark for Evaluating the Accuracy of Diff Tools TOSEM 2024 Github
CodeMMLU CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs ICLR 2025 Github šŸ¤—Dataset šŸ“ŠLeaderBoard šŸŒ Website
CodeJudge-Eval CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding? COLING 2025 Github

Details of Code Reasoning & Understanding

  • CRUXEval: code reasoning, understanding, and execution capabilities
  • CodeMMLU: code understanding and comprehension
  • CodeQueries: A Dataset of Semantic Queries over Code
  • A Benchmark for Testing the Capabilities of LLMs in Assessing the Quality of Multiple-choice Questions in Introductory Programming Education

Data science

Benchmark Paper Date Github Dataset & Website & LeaderBoard
DS-1000 DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation ICML 2023 Github šŸŒHomePage šŸ¤—Dataset
ARCADE Natural Language to Code Generation in Interactive Data Science Notebooks ACL 2023 Github Dataset
DA-Code DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models EMNLP 2024 Github šŸŒWebsite šŸ¤—Dataset
GeoCodeBench Evaluation of Code LLMs on Geospatial Code Generation GeoAI 2024 Github
MatPlotBench MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization ACL 2024 Findings Github
SensorBench SensorBench: Benchmarking LLMs in Coding-Based Sensor Processing HotMobile 2025 Github

Details of Data science

  • DS-1000: Data Science Code Generation
  • DA-Code: Data science tasks
  • GeoCodeBench (inferred name): evaluation benchmark for testing LLMs on geospatial code generation tasks
  • SensorBench: benchmark for evaluating LLMs on real-world sensor data processing tasks
  • MatPlotBench: evaluating LLMs on scientific data visualization through code generation and visual feedback
  • ARCADE: benchmark of multi-turn NL-to-code generation tasks in data science notebooks
  • Natural Language to Code Generation in Interactive Data Science Notebooks

Text2SQL

Benchmark Paper Date Github Dataset & Website & LeaderBoard
Spider Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task EMNLP 2018 Github šŸŒHomepage
Spider-DK Exploring underexplored limitations of crossdomain text-to-sql generalization EMNLP 2021 Github
Spider-Syn Towards robustness of text-to-SQL models against synonym substitution ACL 2021 Github
Spider-Realistic Structure-Grounded Pretraining for Text-to-SQL NAACL 2021 Dataset
BIRD Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs NeurIPS 2023 Github šŸŒWebsite
Dr.Spider Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness ICLR 2023 Github
ScienceBenchmark ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural Language to SQL Systems VLDB Endowment 2023
BookSQL BookSQL: A Large Scale Text-to-SQL Dataset for Accounting Domain NAACL 2024 Github
Archer Archer: A Human-Labeled Text-to-SQL Dataset with Arithmetic, Commonsense and Hypothetical Reasoning EACL 2024
EHRSQL-2024 Overview of the EHRSQL 2024 Shared Task on Reliable Text-to-SQL Modeling on Electronic Health Records ClinicalNLP 2024 Github
SecureSQL SecureSQL: Evaluating Data Leakage of Large Language Models as Natural Language Interfaces to Databases EMNLP 2024 Findings Github
BULL FinSQL: Model-Agnostic LLMs-based Text-to-SQL Framework for Financial Analysis SIGMOD/PODS 2024 Github
cwd-benchmark-data A Benchmark to Understand the Role of Knowledge Graphs on Large Language Model's Accuracy for Question Answering on Enterprise SQL Databases GRADES-NDA 24 Github
Spider 2.0 Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows ICLR 2025 Github šŸŒWebsite
SNAILS SNAILS: Schema Naming Assessments for Improved LLM-Based SQL Inference PACMMOD 2025
SQL2Text Semantic Captioning: Benchmark Dataset and Graph-Aware Few-Shot In-Context Learning for SQL2Text COLING 2025 Github

Details of Text2SQL

  • Spider: text-to-SQL
  • Spider 2.0: text-to-SQL
  • SNAILS: benchmark for evaluating how schema identifier naturalness affects LLM-based NL-to-SQL performance
  • BIRD: large-scale text-to-SQL benchmark focusing on value comprehension and SQL efficiency in realistic industrial settings
  • SecureSQL: benchmark for evaluating sensitive data leakage risks in LLM-generated SQL
  • SQL2Text: a dataset repurposed from Text-to-SQL resources for evaluating SQL-to-natural language generation tasks
  • Spider-Syn: derived from Spider for evaluating text-to-SQL model robustness to schema-related synonym substitution in NL questions
  • Spider-Realistic: evaluating text-to-SQL models under more realistic text-table alignment conditions
  • Dr.Spider: evaluating text-to-SQL model robustness across NL, SQL, and database variations
  • BookSQL: large-scale text-to-SQL dataset for the accounting and finance domain
  • Archer: bilingual text-to-SQL dataset focused on complex reasoning types across 20 domains
  • EHRSQL-2024: text-to-SQL dataset for question answering over electronic health records focusing on reliability in clinical settings
  • Spider-DK: evaluating text-to-SQL model robustness to rarely observed domain knowledge in NL questions
  • ScienceBenchmark: NL-to-SQL benchmark for complex, domain-specific scientific databases
  • BULL: practical text-to-SQL dataset for financial analysis, covering fund, stock, and macroeconomic databases
  • cwd-benchmark-data: enterprise SQL QA benchmark in the insurance domain for evaluating LLM accuracy in real-world business scenarios

MultiModal Code Generation

Benchmark Paper Date Github Dataset & Website & LeaderBoard
MatPlotBench MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization ACL fingdings 2024 Github šŸ¤—Dataset
MMCode MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems EMNLP 2024 Github šŸ¤—Dataset
Drawing Pandas Drawing Pandas: A Benchmark for LLMs in Generating Plotting Code arXiv 2024 Github šŸ¤—Dataset
Web2Code Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs NeurIPS 2024 Github šŸ¤—Dataset
šŸŒWebsite
CodeScope CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation ACL 2024 Github šŸ“ŠLeaderBoard
šŸ¤—Dataset
Plot2Code Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots Arxiv 2024-5 Github šŸ¤—Dataset
HumanEval-V HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks arXiv 2024-10 Github šŸŒWebsite
šŸ“ŠLeaderBoard
šŸ¤—Dataset
WebSight-Test WAFFLE: Multi-Modal Model for Automated Front-End Development Arxiv 2024-10 Github šŸ¤—Dataset
Sketch2Code Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping Arxiv 2024-10 Github šŸŒWebsite
Interaction2Code Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping Arxiv 2024-11 Github šŸ¤—Dataset
šŸ“ŠLeaderBoard
ScratchEval ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges Arxiv 2024-11 Github šŸ¤—Dataset
MRWeb MRWeb: An Exploration of Generating Multi-Page Resource-Aware Web Code from UI Designs Arxiv 2024-12 Github šŸ¤—Dataset
BigDocs-Bench BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks Arxiv 2024-12 Github šŸ¤—Dataset
šŸŒWebsite
Image2Struct Image2Struct: Benchmarking Structure Extraction for Vision-Language Models NeurIPS 2024 Github šŸŒWebsite
šŸ¤—Dataset
WebCode2M WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs WWW 2025 šŸŒWebsite
šŸ¤—Dataset
Design2Code Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering NAACL 2025 Github šŸ¤—Dataset
DiagramGenBenchmark From Words to Structured Visuals: A Benchmark and Framework for Text-to-Diagram Generation and Editing CVPR 2025 Github šŸŒWebsite
šŸ¤—Dataset
ChartMimic ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation ICLR 2025 Github šŸŒWebsite šŸ¤—Dataset
SVG-Bench StarVector: Generating Scalable Vector Graphics Code from Images and Text CVPR 2025 Github šŸŒWebsite
šŸ¤—Dataset
LLM4SVG Empowering LLMs to Understand and Generate Complex Vector Graphics CVPR 2025 Github šŸŒWebsite
ChartCoder ChartCoder: Advancing Multimodal Large Language Model for Chart-to-Code Generation Arxiv 2025-1 Github šŸ¤—Dataset
SlidesBench AutoPresent: Designing Structured Visuals from Scratch Arxiv 2025-01 Github šŸ¤—Dataset
Flame-React-Eval Advancing vision-language models in front-end development via data synthesis Arxiv 2025-03 Github šŸ¤—Dataset

Details of MultiModal Code Generation

  • ChartMimic : Chart-to-Code Generation

  • MatPlotBench :An LLM-based agent approach for scientific data visualization.

  • MMCodeļ¼šEvaluating Multi-Modal Code Large Language Models with Visually Rich Programming

  • DiagramGenBenchmark:Build the industry's first text-to-diagram task benchmark, covering a variety of diagram types

  • HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks

  • Drawing Pandas:A Benchmark for LLMs in Generating Plotting Code

  • Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs

  • WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs

  • MRWeb: An Exploration of Generating Multi-Page Resource-Aware Web Code from UI Designs

  • Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering

  • WebSight-Test: Multi-Modal Model for Automated Front-End Development

  • Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping

  • Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping

  • ScratchEval:provides a series of very challenging questions designed to test the large multimodal models' (LMM) visual code reasoning ability.

  • Flame-React-Eval:assess syntactic precision, functional correctness, and visual consistency in React

    code generation across a range of design specifications.

  • ChartCoder: Advancing Multimodal Large Language Model for Chart-to-Code Generation

  • Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots

  • ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation

  • CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation

  • Image2Struct: Benchmarking Structure Extraction for Vision-Language Models

  • SLIDESBENCH:the first benchmark for slide generation with 7k training and 585 testing examples derived from 310 slide decks across 10 domains

  • SVG-Bench: across 10 datasets, and 3 tasks: Image-to-SVG, Text-to-SVG generation, and diagram generation

  • LLM4SVG:A multi-modal code generation benchmark for text/image-to-SVG synthesis

Security Code Generation & Test Generation

Benchmark Paper Date Github Dataset & Website & LeaderBoard
RedCode RedCode: Risky Code Execution and Generation Benchmark for Code Agents NeurIPS 2024 Github šŸŒWebsite šŸ“ŠLeaderBoard
CodeWMBench CodeWMBench: An Automated Benchmark for Code Watermarking Evaluation ACM-TURC 2024 Github
RMCBench RMCBench: Benchmarking Large Language Models' Resistance to Malicious Code ASE 2024 Github šŸ¤—Dataset
Tests4Py Tests4Py: A Benchmark for System Testing FSE 2024 Github
PyP4LLMSec Benchmarking the Security Aspect of Large Language Model-Based Code Generation ICSE 2024 Github
LLMSecGuard LLM Security Guard for Code EASE 2024 Github
CyberSecEval 3 CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models Arxiv 2024/08 Github

Details of Security Code Generation & Test Generation

  • RedCode: comprehensive and practical evaluations on the safety of code agents
  • CodeWMBench: benchmark for evaluating code watermarking methods in detecting AI-generated code
  • RMCBench: benchmark to assess LLMs' resistance to generating malicious code
  • Tests4Py: benchmark for evaluating system and unit test generation on real-world Python applications
  • PyP4LLMSec: Python benchmark for evaluating LLM-generated code security across real-world vulnerability types
  • LLMSecGuard: framework integrating static code analyzers with LLMs to enhance code security and benchmark LLMs' security attributes
  • CyberSecEval 3: benchmark suite assessing LLMs' cybersecurity risks and capabilities across eight risk areas

Code Translation

Benchmark Paper Date Github Dataset & Website & LeaderBoard
TransCoder Unsupervised Translation of Programming Languages NeurIPS 2020 Github(deprecated) Github(new)
AVATAR AVATAR: A Parallel Corpus for Java-Python Program Translation ACL fingdings 2023
G-TransEval On the Evaluation of Neural Code Translation: Taxonomy and Benchmark ASE 2023 Github šŸ¤—Dataset
CodeTransOcean: CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation EMNLP 2023 Github šŸ¤—Dataset
xCodeEval XCodeEval: An Execution-based Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval - ACL Anthology ACL 2024 Github šŸ¤—Dataset
RustRepoTrans Repository-level Code Translation Benchmark Targeting Rust Arxiv 2024/11 Github šŸ¤—Dataset
ClassEval-T [2411.06145] Escalating LLM-based Code Translation Benchmarking into the Class-level Era Arxiv 2024-11 Github šŸ¤—Dataset
TRANSREPO-BENCH Skeleton-Guided-Translation: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation Arxiv 2025-1 Github šŸ¤—Dataset

Details of Code Translation

  • TransCoder: code translation in C++, Java, Python
  • AVATAR:. A parallel corpus of Java and Python program translations.
  • G-TransEval:Evaluate the cross-linguistic capabilities of neural code translation models
  • CodeTransOcean:A Comprehensive Multilingual Benchmark for Code Translation
  • RustRepoTrans:repository-level constructed from GitHub projects, focusing on translating code from C, Java, and Python to Rust
  • xCodeEval:Perform multilingual, multitasking code evaluation benchmarks
  • ClassEval-T: the first class-level code translation benchmark with parallel corpora in Python, Java, and C++, featuring practical coding tasks, high test coverage, and rich contextual dependencies.
  • TransRepo-bench:a benchmark for repository-level code translation from Java to C#, featuring high-quality open-source repositories with structural skeletons, unit tests, and build configurations to enable fine-grained quality evaluation

Code Version

Benchmark Paper Date Github Dataset & Website & LeaderBoard
CodeUpdateEval Automatically Recommend Code Updates: Are We There Yet? TOSEM 2024 Github šŸ¤—Dataset
JavaVersionGenBench On the Generalizability of Deep Learning-based Code Completion Across Programming Language Versions ICPC 2024 Github šŸ¤—Dataset
VersiCode VersiCode: Towards Version-controllable Code Generation Arxiv 2024/10 Github šŸŒWebsite šŸ¤—Dataset
GitChameleon GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models Arxiv 2024/11 Github šŸ¤—Dataset
LLM-Deprecated-APl LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-based Code Completion ICSE 2025 Github šŸ¤—Dataset
LibEvolutionEval LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation NAACL 2025
CodeUpdateArena CodeUpdateArena: Benchmarking Knowledge Editing on API Updates Arxiv 2025/02 Github šŸ¤—Dataset
RustEvo2 RustEvo2: An Evolving Benchmark for API Evolution in LLM-based Rust Code Generation Arxiv 2025/03 Github šŸ¤—Dataset

Details of Code Version

  • CodeUpdateEval: code migration with Time-wise dataset
  • JavaVersionGenBench: Code Completion Across Evolving JAVA Versions
  • VersiCode: Version-controllable Code Generation
  • GitChameleon: 116 version-aware Python code-completion problems with unit tests
  • LLM-Deprecated-APl: Deprecated APl mapping and functions code completion
  • LibEvolutionEval: Version-Specifc Code Generation
  • CodeUpdateArena: API Update Knowledge Editing Assessment
  • RustEvo2: API Evolution in LLM-based Rust Code Generation

Industry Code Generation

Benchmark Paper Date Github Dataset & Website & LeaderBoard
VerilogEval VerilogEval Evaluating Large Language Models for Verilog Code Generation ICCAD 2023 Github šŸ¤—Dataset
VGen Benchmarking Large Language Models for Automated Verilog RTL Code Generation DATE 2023 Github šŸ¤—Dataset
RTLLM RTLLM: An Open-Source Benchmark for Design RTL Generation with Large Language Model ASPDAC 2024 Github šŸ¤—Dataset
VHDL-Eval VHDL-Eval: A Framework for Evaluating Large Language Models in VHDL Code Generation LAD 2024
VHDL-Xform Chain-of-Descriptions: Improving Code LLMs for VHDL Code Generation and Summarization MLCAD 2024
Natural language is not enough: Benchmarking multi-modal generative AI for Verilog generation ICCAD 2024 Github šŸ¤—Dataset
LLM4PLC LLM4PLC: Harnessing Large Language Models for Verifiable Programming of PLCs in Industrial Control Systems ICSE 2024 Github šŸŒWebsite
Agents4PLC Agents4PLC: Automating Closed-loop PLC Code Generation and Verification in Industrial Control Systems using LLM-based Agents Arxiv 2024-10 Github šŸ¤—Dataset
A Multi-Agent Framework for Extensible Structured Text Generation in PLCs Arxiv 2024-12
Exploring Code Language Models for Automated HLS-based Hardware Generation: Benchmark, Infrastructure and Analysis ASPDAC 2025
MetRex MetRex: A Benchmark for Verilog Code Metric Reasoning Using LLMs ASPDAC 2025 Github šŸ¤—Dataset

Details of Industry Code Generation

  • VerilogEval:Evaluating Large Language Models for Verilog Code Generation
  • RTLLM:Evaluates LLM-generated RTL designs across syntax, functionality, and quality metrics
  • MetRex:A benchmark for LLM reasoning on Verilog post-synthesis metrics (area, delay, power) using 25K designs with Chain-of-Thought prompts
  • Exploring Code Language Models for Automated HLS-based Hardware Generation: Benchmark, Infrastructure and Analysis
  • VHDL-Eval: A curated dataset of 202 VHDL code problems with self-verifying testbenches to assess LLM-generated hardware designs for functional correctness
  • VHDL-Xform:Chain-of-Descriptions: Improving Code LLMs for VHDL Code Generation and Summarization
  • LLM4PLC:Structured text (ST) dataset processed based on the OSCAT IEC 61131-3 library (636 valid samples), including three types of tasks: Generation, Completion, and Fixing
  • Agents4PLC :comprises 23 programming tasks with 58 properties across industrial control domains , transitioning from natural language requirements to human-verified formal specifications and reference PLC code , enabling rigorous evaluation of syntax correctness and functional verification in industrial control systems
  • OSCAT Library +Siemens LGF Library+Siemens Competition Dataset: suite covering open-source IEC 61131-3 Structured Text (ST) and vendor-specific Siemens SCL variants for evaluating PLC code generation methods.
  • Natural language is not enough: Benchmarking multi-modal generative AI for Verilog generation : A hierarchical benchmark for evaluating multi-modal generative models in synthesizing Verilog code from visual-linguistic inputs, covering simple to complex hardware modules.
  • VGen: comprising 17 Verilog coding problems with varying difficulty levels, accompanied by test benches for functional validation

Multi-Dimension

Benchmark Paper Date Github Dataset & Website & LeaderBoard
LiveCodeBench LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code Arxiv 2024/03 Github šŸ¤—Dataset
RACE Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models Arxiv 2024/07 Github šŸ“ŠLeaderBoard

Details of Multi-Dimension

  • LiveCodeBench: self-repair, code execution, test output prediction, code generation
  • RACE: Readability, Maintainability, Correctness, and Efficiency
  • CODEEDITORBENCH: EVALUATING CODE EDITING CAPABILITY OF LARGE LANGUAGE MODELS
  • AnalogCoder: Analog Circuit Design via Training-Free Code Generation