A comprehensive code domain benchmark review of LLM researches.
- š„š„ [2025-03-29] We have crawled all the articles related to code benchmarks in the past five years.Ā
- š„š„ [2025-03-17] We add Code Version (Version-specific code generation) benchmarks.Ā
- š„š„ [2025-03-16] A thorough review of code domain benchmarks for LLM research has been released.Ā
- HumanEval: code completion
- MBPP: text -> code; code generation
- APPS: a benchmark for code generation from natural language specifications
- EvalPlus: extends the HumanEval and MBPP benchmarks
- MultiPL-E: extends the HumanEval and MBPP benchmarks to 18 languages
- CodeClarQA: containing pairs of natural language descriptions and code with created synthetic clarification questions and answers.
- DevEval: repo-level code generation
- BigCodeBench: complete Split & Instruct Split
- DynaCode: a dynamic complexity-aware code benchmark
- Stack-Repo: repo-level code completion of java
- StudentEval: a benchmark of student-written prompts for code generation evaluation
- MCoNaLa: code generation from multiple natural languages
- LCC: long code context code completion
- RepoBench: repo-level code auto-completion
- ReCode: a comprehensive robustness evaluation benchmark for code generation
- LongBench: a bilingual, multitask benchmark for long context understanding
- CommitPack & HumanEvalPack: multilingual code editing and understanding benchmarks based on Git commits and HumanEval extensions
- COCO: instruction-level robustness benchmark for code generation
- ODEX: open-domain, execution-based natural language to code generation
- BioCoder: bioinformatics code generation
- CrossCodeEval: diverse and multilingual benchmark for cross-file code completion
- Buggy-HumanEval & Buggy-FixEval: buggy code completion
- MT-Bench-101: Multi-turn question answering
- ML-Bench: repo-level ML task solving benchmark using real-world code
- PLPilot: Benchmark automating programming language design tasks
- CoderEval: pragmatic code generation
- MultiNL-H: multilingual NL-to-code benchmark with keyword-guided generation
- APPS+: enhanced version of the APPS dataset, designed for reinforcement learning in code generation
- OOP: object-oriented programming evaluation benchmark of python programs
- -: static analysis-based evaluation framework for LLM code completions using ASTs
- L2CEval: multilingual, multi-task NL-to-code benchmark including semantic parsing, math reasoning and Python programming.
- ICE-Score: an evaluation metric for code quality without test cases or references
- HumanExtensionļ¼auxiliary-function-based code generation benchmark
- R2E-Eval1ļ¼repo-level programming agent benchmark from GitHub repos for evaluating static and interactive code generation systems
- REval: evaluates code LLMsā reasoning and consistency with runtime behavior
- InfiBench: free-form question-answering benchmark comprising 234 high-quality Stack Overflow questions across 15 programming languages
- RobustAPI: Java API misuse benchmark from Stack Overflow for evaluating LLM code robustness
- EvoCodeBench: evolving Python code generation benchmark from real GitHub commits
- CodeBenchGen: scalable Python code generation benchmark built from GitHub functions and docstrings with execution-based evaluation
- HALLUCODE: evaluate LLMs' ability to recognize and mitigate hallucinations in code generation
- LeetCodeEval: Leetcode-based benchmark for comparing LLM vs. human code performance
- X-HumanEval-X: exploring Multi-Lingual Bias of Large Code Models in Code Generation
- CodeHalu: systematically evaluate code hallucinations in LLMs through execution-based verification
- PYCOMMITS: multi-round Python code editing benchmark from real commit histories
- CodeContests: complex programming task
- EvoEval: a comprehensive evaluation of LLMs' coding abilities across diverse domains
- LLM4Decompile: benchmark for evaluating binary-to-C decompilation on real-world open-source binaries
- CatCoder: a framework for repo-level code generation in statically typed languages using code and type context
- AICoderEval: AI task-specific code generation benchmark for LLMs in NLP, CV, and multimodal learning
- CodeAgentBench: repo-level code generation benchmark with tool-integrated agents for real-world tasks
- AssertionBench: assertion generation for hardware design verification
- SAFIM: syntax-aware code completion benchmark focusing on code blocks and conditional expressionsā
- GenCodeSearchNet: benchmark for evaluating LLM generalization in programming language understanding across tasks and languages
- ConCodeEval: benchmark for assessing LLMs' understanding of code constraints in domain-specific languages like JSON and YAML
- HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization
- CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation
- XCodeEval: An Execution-based Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval
- Fine-tuning Language Models for Joint Rewriting and Completion of Code with Potential Bugs | ACL 2024 Findings |
- PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs https://aclanthology.org/2024.findings-emnlp.996.pdf
- ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code (It covers multiple aspects, including tasks such as code generation, code completion, API recommendation, and test case generation, and aims to comprehensively evaluate the performance of large language models in complex code scenarios.)
- JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language Models
- HumanEvalComm: Benchmarking the Communication Competence of Code Generation for LLMs and LLM Agent https://arxiv.org/abs/2406.00215
- CoderUJB: An Executable and Unified Java Benchmark for Practical Programming Scenarios https://arxiv.org/abs/2403.19287
- CodeScore: Evaluating Code Generation by Learning Code Execution (MBPP-ET)
- CrossCodeBench: Benchmarking Cross-Task Generalization of Source Code Models
- Teaching Code LLMs to Use Autocompletion Tools in Repository-Level Code Generation
Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |
---|---|---|---|---|
EvalPerf | Evaluating Language Models for Efficient Code Generation | COLM 2024 | Github | š¤Dataset |
EffiBench | EffiBench: Benchmarking the Efficiency of Automatically Generated Code | NeurIPS 2024 | Github | |
Mercury | Mercury: A Code Efficiency Benchmark for Code Large Language Models | NeurIPS 2024 | Github | š¤Dataset |
ECCO | ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness? | EMNLP 2024 | Github | š¤Dataset |
PIE | Learning Performance-Improving Code Edits | ICLR 2024 | Github | ššWebsite |
ENAMEL | How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark | ICLR 2025 | Github | š¤Dataset |
- HumanEvalFix: code repair capabilitie
- SWT-Bench: Evaluating LLMs on testing generation for real world software issues
- SWE-bench: Evaluating LLMs Resolve Real-World GitHub Issues
- SWE-bench Multimodal: Evaluate LLMs on their ability to fix bugs in visual, user-facing JavaScript software
- GitBug-Java: automatic program repair and fault localization of Java bugs
- GitBug-Actions: constructing reproducible bug-fix benchmarks using GitHub Actions
- LiveCodeBench: dynamic benchmark for contamination-free evaluation of LLMs from real-world platforms
- RepoBugs: repo-level bug-fix benchmark for evaluating LLM-based program repair with full context
- RepoFixEval: repository-level program repair benchmark for evaluating LLMs on issue discovery, fault localization, and code fixing
- DebugBench: evaluating LLMs' debugging capabilities across various bug categories and types
- Multi-Bug: a dataset for evaluating LLMs on multi-bug code debugging tasks
- Socratic-Debugging: evaluating LLMs on interactive, dialogue-based bug fixing
- Coffee-Gym: interactive benchmark environment for evaluating LLMs on NL-guided code repair
- INTERVENOR: interactive code repair benchmark with multi-turn learnerāteacher dialogue
- TFix & ManySStuBs4J & TSSB-3M: evaluating automatic program repair in JavaScript, Java, and Python
- StatType-SO: benchmark for resolving imports and types in incomplete Stack Overflow code snippets
Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |
---|---|---|---|---|
CRUXEval | CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution | Arxiv 2024/01 | Github | šLeaderBoard |
Poor-CodeSumEval | How Effectively Do Code Language Models Understand Poor-Readability Code? | ASE 2024 | Github | š¤Dataset |
A Novel Refactoring and Semantic Aware Abstract Syntax Tree Differencing Tool and a Benchmark for Evaluating the Accuracy of Diff Tools | TOSEM 2024 | Github | ||
CodeMMLU | CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs | ICLR 2025 | Github | š¤Dataset šLeaderBoard š Website |
CodeJudge-Eval | CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding? | COLING 2025 | Github |
- CRUXEval: code reasoning, understanding, and execution capabilities
- CodeMMLU: code understanding and comprehension
- CodeQueries: A Dataset of Semantic Queries over Code
- A Benchmark for Testing the Capabilities of LLMs in Assessing the Quality of Multiple-choice Questions in Introductory Programming Education
Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |
---|---|---|---|---|
DS-1000 | DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation | ICML 2023 | Github | šHomePage š¤Dataset |
ARCADE | Natural Language to Code Generation in Interactive Data Science Notebooks | ACL 2023 | Github | Dataset |
DA-Code | DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models | EMNLP 2024 | Github | šWebsite š¤Dataset |
GeoCodeBench | Evaluation of Code LLMs on Geospatial Code Generation | GeoAI 2024 | Github | |
MatPlotBench | MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization | ACL 2024 Findings | Github | |
SensorBench | SensorBench: Benchmarking LLMs in Coding-Based Sensor Processing | HotMobile 2025 | Github |
- DS-1000: Data Science Code Generation
- DA-Code: Data science tasks
- GeoCodeBench (inferred name): evaluation benchmark for testing LLMs on geospatial code generation tasks
- SensorBench: benchmark for evaluating LLMs on real-world sensor data processing tasks
- MatPlotBench: evaluating LLMs on scientific data visualization through code generation and visual feedback
- ARCADE: benchmark of multi-turn NL-to-code generation tasks in data science notebooks
- Natural Language to Code Generation in Interactive Data Science Notebooks
- Spider: text-to-SQL
- Spider 2.0: text-to-SQL
- SNAILS: benchmark for evaluating how schema identifier naturalness affects LLM-based NL-to-SQL performance
- BIRD: large-scale text-to-SQL benchmark focusing on value comprehension and SQL efficiency in realistic industrial settings
- SecureSQL: benchmark for evaluating sensitive data leakage risks in LLM-generated SQL
- SQL2Text: a dataset repurposed from Text-to-SQL resources for evaluating SQL-to-natural language generation tasks
- Spider-Syn: derived from Spider for evaluating text-to-SQL model robustness to schema-related synonym substitution in NL questions
- Spider-Realistic: evaluating text-to-SQL models under more realistic text-table alignment conditions
- Dr.Spider: evaluating text-to-SQL model robustness across NL, SQL, and database variations
- BookSQL: large-scale text-to-SQL dataset for the accounting and finance domain
- Archer: bilingual text-to-SQL dataset focused on complex reasoning types across 20 domains
- EHRSQL-2024: text-to-SQL dataset for question answering over electronic health records focusing on reliability in clinical settings
- Spider-DK: evaluating text-to-SQL model robustness to rarely observed domain knowledge in NL questions
- ScienceBenchmark: NL-to-SQL benchmark for complex, domain-specific scientific databases
- BULL: practical text-to-SQL dataset for financial analysis, covering fund, stock, and macroeconomic databases
- cwd-benchmark-data: enterprise SQL QA benchmark in the insurance domain for evaluating LLM accuracy in real-world business scenarios
-
ChartMimic : Chart-to-Code Generation
-
MatPlotBench :An LLM-based agent approach for scientific data visualization.
-
MMCodeļ¼Evaluating Multi-Modal Code Large Language Models with Visually Rich Programming
-
DiagramGenBenchmark:Build the industry's first text-to-diagram task benchmark, covering a variety of diagram types
-
HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks
-
Drawing Pandas:A Benchmark for LLMs in Generating Plotting Code
-
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs
-
WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs
-
MRWeb: An Exploration of Generating Multi-Page Resource-Aware Web Code from UI Designs
-
Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering
-
WebSight-Test: Multi-Modal Model for Automated Front-End Development
-
Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping
-
Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping
-
ScratchEval:provides a series of very challenging questions designed to test the large multimodal models' (LMM) visual code reasoning ability.
-
Flame-React-Eval:assess syntactic precision, functional correctness, and visual consistency in React
code generation across a range of design specifications.
-
ChartCoder: Advancing Multimodal Large Language Model for Chart-to-Code Generation
-
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots
-
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation
-
CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation
-
Image2Struct: Benchmarking Structure Extraction for Vision-Language Models
-
SLIDESBENCH:the first benchmark for slide generation with 7k training and 585 testing examples derived from 310 slide decks across 10 domains
-
SVG-Bench: across 10 datasets, and 3 tasks: Image-to-SVG, Text-to-SVG generation, and diagram generation
-
LLM4SVG:A multi-modal code generation benchmark for text/image-to-SVG synthesis
Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |
---|---|---|---|---|
RedCode | RedCode: Risky Code Execution and Generation Benchmark for Code Agents | NeurIPS 2024 | Github | šWebsite šLeaderBoard |
CodeWMBench | CodeWMBench: An Automated Benchmark for Code Watermarking Evaluation | ACM-TURC 2024 | Github | |
RMCBench | RMCBench: Benchmarking Large Language Models' Resistance to Malicious Code | ASE 2024 | Github | š¤Dataset |
Tests4Py | Tests4Py: A Benchmark for System Testing | FSE 2024 | Github | |
PyP4LLMSec | Benchmarking the Security Aspect of Large Language Model-Based Code Generation | ICSE 2024 | Github | |
LLMSecGuard | LLM Security Guard for Code | EASE 2024 | Github | |
CyberSecEval 3 | CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models | Arxiv 2024/08 | Github |
- RedCode: comprehensive and practical evaluations on the safety of code agents
- CodeWMBench: benchmark for evaluating code watermarking methods in detecting AI-generated code
- RMCBench: benchmark to assess LLMs' resistance to generating malicious code
- Tests4Py: benchmark for evaluating system and unit test generation on real-world Python applications
- PyP4LLMSec: Python benchmark for evaluating LLM-generated code security across real-world vulnerability types
- LLMSecGuard: framework integrating static code analyzers with LLMs to enhance code security and benchmark LLMs' security attributes
- CyberSecEval 3: benchmark suite assessing LLMs' cybersecurity risks and capabilities across eight risk areas
- TransCoder: code translation in C++, Java, Python
- AVATAR:. A parallel corpus of Java and Python program translations.
- G-TransEval:Evaluate the cross-linguistic capabilities of neural code translation models
- CodeTransOcean:A Comprehensive Multilingual Benchmark for Code Translation
- RustRepoTrans:repository-level constructed from GitHub projects, focusing on translating code from C, Java, and Python to Rust
- xCodeEval:Perform multilingual, multitasking code evaluation benchmarks
- ClassEval-T: the first class-level code translation benchmark with parallel corpora in Python, Java, and C++, featuring practical coding tasks, high test coverage, and rich contextual dependencies.
- TransRepo-bench:a benchmark for repository-level code translation from Java to C#, featuring high-quality open-source repositories with structural skeletons, unit tests, and build configurations to enable fine-grained quality evaluation
- CodeUpdateEval: code migration with Time-wise dataset
- JavaVersionGenBench: Code Completion Across Evolving JAVA Versions
- VersiCode: Version-controllable Code Generation
- GitChameleon: 116 version-aware Python code-completion problems with unit tests
- LLM-Deprecated-APl: Deprecated APl mapping and functions code completion
- LibEvolutionEval: Version-Specifc Code Generation
- CodeUpdateArena: API Update Knowledge Editing Assessment
- RustEvo2: API Evolution in LLM-based Rust Code Generation
- VerilogEval:Evaluating Large Language Models for Verilog Code Generation
- RTLLM:Evaluates LLM-generated RTL designs across syntax, functionality, and quality metrics
- MetRex:A benchmark for LLM reasoning on Verilog post-synthesis metrics (area, delay, power) using 25K designs with Chain-of-Thought prompts
- Exploring Code Language Models for Automated HLS-based Hardware Generation: Benchmark, Infrastructure and Analysis
- VHDL-Eval: A curated dataset of 202 VHDL code problems with self-verifying testbenches to assess LLM-generated hardware designs for functional correctness
- VHDL-Xform:Chain-of-Descriptions: Improving Code LLMs for VHDL Code Generation and Summarization
- LLM4PLC:Structured text (ST) dataset processed based on the OSCAT IEC 61131-3 library (636 valid samples), including three types of tasks: Generation, Completion, and Fixing
- Agents4PLC :comprises 23 programming tasks with 58 properties across industrial control domains , transitioning from natural language requirements to human-verified formal specifications and reference PLC code , enabling rigorous evaluation of syntax correctness and functional verification in industrial control systems
- OSCAT Library +Siemens LGF Library+Siemens Competition Dataset: suite covering open-source IEC 61131-3 Structured Text (ST) and vendor-specific Siemens SCL variants for evaluating PLC code generation methods.
- Natural language is not enough: Benchmarking multi-modal generative AI for Verilog generation : A hierarchical benchmark for evaluating multi-modal generative models in synthesizing Verilog code from visual-linguistic inputs, covering simple to complex hardware modules.
- VGen: comprising 17 Verilog coding problems with varying difficulty levels, accompanied by test benches for functional validation
Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |
---|---|---|---|---|
LiveCodeBench | LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code | Arxiv 2024/03 | Github | š¤Dataset |
RACE | Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models | Arxiv 2024/07 | Github | šLeaderBoard |
- LiveCodeBench: self-repair, code execution, test output prediction, code generation
- RACE: Readability, Maintainability, Correctness, and Efficiency
- CODEEDITORBENCH: EVALUATING CODE EDITING CAPABILITY OF LARGE LANGUAGE MODELS
- AnalogCoder: Analog Circuit Design via Training-Free Code Generation