PurCL
diff --git a/‎README.md
+10-13 b/‎README.md
+10-13
diff --git a/‎data/labeldata/labeldata.json
+62 b/‎data/labeldata/labeldata.json
+62
diff --git a/‎data/papers/labels/agent_design.md
+1-1 b/‎data/papers/labels/agent_design.md
+1-1
diff --git a/‎data/papers/labels/benchmark.md
+5 b/‎data/papers/labels/benchmark.md
+5
diff --git a/‎data/papers/labels/bug_detection.md
+8-2 b/‎data/papers/labels/bug_detection.md
+8-2
@@ -1,10 +1,7 @@
-# CodeLLM Paper <img src='https://img.shields.io/github/stars/PurCL/CodeLLMPaper' width="120" height="26" />
+# CodeLLM Paper
 
 This repository provides a curated list of research papers focused on Large Language Models (LLMs) for code. It aims to facilitate researchers and practitioners in exploring the rapidly growing body of literature on this topic. The papers are systematically collected from various top-tier venues, categorized, and labeled for easier navigation.
 
-## 📍 News
-- :star: We updated the papers published in CCS 2024, USENIX Sec 2024, and NDSS 2025. (03/05/2025)
-
 ## Table of Contents
 
 - [A. Venues](#a-venues)
@@ -70,9 +67,9 @@ The papers in this repository are categorized along three dimensions: **Applicat
 This category focuses on typical tasks in Software Engineering (SE) and Programming Languages (PL).
 
 - [General Coding Task](data/papers/labels/general_coding_task.md)   (32)
-- [Code Generation](data/papers/labels/code_generation.md)   (197)
+- [Code Generation](data/papers/labels/code_generation.md)   (198)
   - [Program Synthesis](data/papers/labels/program_synthesis.md)   (83)
-  - [Code Completion](data/papers/labels/code_completion.md)   (22)
+  - [Code Completion](data/papers/labels/code_completion.md)   (23)
   - [Program Repair](data/papers/labels/program_repair.md)   (41)
   - [Program Transformation](data/papers/labels/program_transformation.md)   (31)
 - [Program Testing](data/papers/labels/program_testing.md)   (55)
@@ -88,16 +85,16 @@ This category focuses on typical tasks in Software Engineering (SE) and Programm
   - [Debugging](data/papers/labels/debugging.md)   (9)
   - [Bug Reproduction](data/papers/labels/bug_reproduction.md)   (2)
   - [Vulnerability Exploitation](data/papers/labels/vulnerability_exploitation.md)   (6)
-- [Static Analysis](data/papers/labels/static_analysis.md)   (143)
+- [Static Analysis](data/papers/labels/static_analysis.md)   (145)
   - [Syntactic Analysis](data/papers/labels/syntactic_analysis.md)   (1)
   - [Pointer Analysis](data/papers/labels/pointer_analysis.md)   (3)
   - [Call Graph Analysis](data/papers/labels/call_graph_analysis.md)   (2)
   - [Data-flow Analysis](data/papers/labels/data-flow_analysis.md)   (8)
   - [Type Inference](data/papers/labels/type_inference.md)   (3)
-  - [Specification Inference](data/papers/labels/specification_inference.md)   (15)
+  - [Specification Inference](data/papers/labels/specification_inference.md)   (16)
   - [Equivalence Checking](data/papers/labels/equivalence_checking.md)   (1)
   - [Code Similarity Analysis](data/papers/labels/code_similarity_analysis.md)   (5)
-  - [Bug Detection](data/papers/labels/bug_detection.md)   (73)
+  - [Bug Detection](data/papers/labels/bug_detection.md)   (74)
   - [Program Verification](data/papers/labels/program_verification.md)   (20)
   - [Program Optimization](data/papers/labels/program_optimization.md)   (4)
   - [Program Decompilation](data/papers/labels/program_decompilation.md)   (9)
@@ -116,7 +113,7 @@ This category focuses on typical tasks in Software Engineering (SE) and Programm
 
 This category concentrates on the LLMs' ability in understanding different forms of code and the non-functional properties of the LLMs (e.g., security and robustness). We also consider how to utilize the LLMs for general reasoning problems, such as typical agent-centric designs and specific PL designs for LLMs.
 
-- [Code Model](data/papers/labels/code_model.md)   (111)
+- [Code Model](data/papers/labels/code_model.md)   (112)
   - [Code Model Training](data/papers/labels/code_model_training.md)   (84)
     - [Source Code Model](data/papers/labels/source_code_model.md)   (64)
     - [IR Code Model](data/papers/labels/IR_code_model.md)   (5)
@@ -136,8 +133,8 @@ This category concentrates on the LLMs' ability in understanding different forms
 
 This category includes studies on benchmarks, empirical evaluations, and surveys. The papers that do not belong to the following three categories are purely technical papers.
 
-- [Benchmark](data/papers/labels/benchmark.md)   (45)
-- [Empirical Study](data/papers/labels/empirical_study.md)   (78)
+- [Benchmark](data/papers/labels/benchmark.md)   (47)
+- [Empirical Study](data/papers/labels/empirical_study.md)   (79)
 - [Survey](data/papers/labels/survey.md)   (18)
 
 ## D. How to Contribute
@@ -171,4 +168,4 @@ To facilitate timely batch updates to the paper repository, we prefer to utilize
 
 This paper repository is intended solely for research purposes. All raw data is sourced from publicly available information on ACM, IEEE, and corresponding conference websites. Any content involving additional copyright information, including full PDF versions of the papers, is not disclosed in this repository.
 
-For any questions or suggestions, please contact [[email protected]](mailto:[email protected]) or [[email protected]](mailto:[email protected])
+For any questions or suggestions, please contact [[email protected]](mailto:[email protected]) or [[email protected]](mailto:[email protected])
@@ -6516,6 +6516,68 @@
         ],
         "url": "https://arxiv.org/pdf/2311.13721"
     },
+    "Is Your Benchmark (Still) Useful? Dynamic Benchmarking for Code Language Models": {
+        "type": "article",
+        "key": "shaohua_arxiv25",
+        "title": "Is Your Benchmark (Still) Useful? Dynamic Benchmarking for Code Language Models",
+        "author": "Batu Guan, Xiao Wu, Yuanyuan Yuan, Shaohua Li",
+        "journal": "arXiv preprint arXiv:2503.06643",
+        "year": "2025",
+        "venue": "arXiv2025",
+        "abstract": "In this paper, we tackle a critical challenge in model evaluation: how to keep code benchmarks useful when models might have already seen them during training. We introduce a novel solution, dynamic benchmarking framework, to address this challenge. Given a code understanding or reasoning benchmark, our framework dynamically transforms each input, i.e., programs, with various semantic-preserving mutations to build a syntactically new while semantically identical benchmark. We evaluated ten popular language models on our dynamic benchmarks. Our evaluation reveals several interesting or surprising findings: (1) all models perform significantly worse than before, (2) the ranking between some models shifts dramatically, and (3) our dynamic benchmarks can resist against the data contamination problem.",
+        "labels": [
+            "benchmark"
+        ],
+        "url": "https://arxiv.org/abs/2503.06643"
+    },
+    "KNighter: Transforming Static Analysis with LLM-Synthesized Checkers": {
+        "type": "article",
+        "key": "knighter_2025",
+        "title": "KNighter: Transforming Static Analysis with LLM-Synthesized Checkers",
+        "author": "Chenyuan Yang, Zijie Zhao, Zichen Xie, Haoyu Li, Lingming Zhang",
+        "journal": "arXiv preprint arXiv:2503.09002v1",
+        "year": "2025",
+        "venue": "arXiv2025",
+        "abstract": "Static analysis is a powerful technique for bug detection in critical systems like operating system kernels. However, designing and implementing static analyzers is challenging, timeconsuming, and typically limited to predefined bug patterns. While large language models (LLMs) have shown promise for static analysis, directly applying them to scan large codebases remains impractical due to computational constraints and contextual limitations. We present KNighter, the first approach that unlocks practical LLM-based static analysis by automatically synthesizing static analyzers from historical bug patterns. Rather than using LLMs to directly analyze massive codebases, our key insight is leveraging LLMs to generate specialized static analyzers guided by historical patch knowledge. KNighter implements this vision through a multi-stage synthesis pipeline that validates checker correctness against original patches and employs an automated refinement process to iteratively reduce false positives. Our evaluation on the Linux kernel demonstrates that KNighter generates high-precision checkers capable of detecting diverse bug patterns overlooked by existing human-written analyzers. To date, KNighter-synthesized checkers have discovered 70 new bugs/vulnerabilities in the Linux kernel, with 56 confirmed and 41 already fixed. 11 of these findings have been assigned CVE numbers. This work establishes an entirely new paradigm for scalable, reliable, and traceable LLM-based static analysis for real-world systems via checker synthesis.",
+        "labels": [
+            "static analysis",
+            "bug detection"
+        ],
+        "url": "https://arxiv.org/pdf/2503.09002"
+    },
+    "Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specification Inference": {
+        "type": "article",
+        "key": "knighter_2025",
+        "title": "Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specification Inference",
+        "author": "Thanh Le-Cong, Bach Le, Toby Murray",
+        "journal": "arXiv preprint arXiv:2503.04779",
+        "year": "2025",
+        "venue": "arXiv2025",
+        "abstract": "Large Language Models (LLMs) are increasingly being used to automate programming tasks. Yet, LLMs' capabilities in reasoning about program semantics are still inadequately studied, leaving significant potential for further exploration. This paper introduces FormalBench, a comprehensive benchmark designed to evaluate LLMs' reasoning abilities on program semantics, particularly via the task of synthesizing formal program specifications to assist verifying program correctness. This task requires both comprehensive reasoning over all possible program executions and the generation of precise, syntactically correct expressions that adhere to formal syntax and semantics. Using this benchmark, we evaluated the ability of LLMs in synthesizing consistent and complete specifications. Our findings show that LLMs perform well with simple control flows but struggle with more complex structures, especially loops, even with advanced prompting. Additionally, LLMs exhibit limited robustness against semantic-preserving transformations. We also highlight common failure patterns and design self-repair prompts, improving success rates by 25%.",
+        "labels": [
+            "static analysis",
+            "specification inference",
+            "benchmark",
+            "empirical study"
+        ],
+        "url": "https://arxiv.org/abs/2503.04779"
+    },
+    "Type-Aware Constraining for Code LLMs": {
+        "type": "article",
+        "key": "knighter_2025",
+        "title": "Type-Aware Constraining for Code LLMs",
+        "author": "Niels Mündler, Jingxuan He, Hao Wang, Koushik Sen, Dawn Song, Martin Vechev",
+        "journal": "ICLR 2025 workshop",
+        "year": "2025",
+        "venue": "ICLR2025",
+        "abstract": "Large Language Models (LLMs) have achieved notable success in code generation. However, they still frequently produce invalid code, as they do not precisely model formal aspects of programming languages. Constrained decoding is a promising approach to alleviate this issue and has been successfully applied to domain-specific languages and syntactic features, but is not able to enforce more semantic features, such as well-typedness. To address this issue, we introduce type-aware constrained decoding. We develop a novel prefix automata formalism and introduce a sound approach to guarantee existence of a type-safe completion of a partial program based on type inference and a search over inhabitable types. We implement type-aware constraining first for a foundational simply-typed language, then extend it to TypeScript. In our evaluation across state-of-the-art open-weight LLMs of up to 34B parameters and various model families, type-aware constraining reduces compilation errors by on average 70.9% and increases functional correctness by 16.2% in code synthesis, translation, and repair tasks.",
+        "labels": [
+            "code generation",
+            "code completion",
+            "code model"
+        ],
+        "url": "https://openreview.net/forum?id=DNAapYMXkc"
+    },
     "Evaluating the Effectiveness of Small Language Models in Detecting Refactoring Bugs": {
         "type": "article",
         "key": "Rohit2025arXiv",
 
@@ -22,7 +22,7 @@
   - **Labels**: [code generation](code_generation.md), [code completion](code_completion.md), [source code model](source_code_model.md), [agent design](agent_design.md), [prompt strategy](prompt_strategy.md), [retrieval-augmented generation](retrieval-augmented_generation.md)
 
 
-- [Hierarchical Repository-Level Code Summarization for Business Applications Using Local LLMs](../venues/arXiv2025/paper_7.md), ([arXiv2025](../venues/arXiv2025/README.md))
+- [Hierarchical Repository-Level Code Summarization for Business Applications Using Local LLMs](../venues/arXiv2025/paper_10.md), ([arXiv2025](../venues/arXiv2025/README.md))
 
   - **Abstract**: In large-scale software development, understanding the functionality and intent behind complex codebases is critical for effective development and maintenance. While code summarization has been widely studied, existing methods primarily focus on smaller code units, such as functions, and struggle with larger code artifacts like files and packages. Additionally, current summarization models tend to emphasize low-level implementation details, often overlooking the domain and business context that ...
   - **Labels**: [static analysis](static_analysis.md), [code summarization](code_summarization.md), [agent design](agent_design.md), [prompt strategy](prompt_strategy.md), [retrieval-augmented generation](retrieval-augmented_generation.md)
 
@@ -212,6 +212,11 @@
 
 ## Static Analysis
 
+- [Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specification Inference](../venues/arXiv2025/paper_5.md), ([arXiv2025](../venues/arXiv2025/README.md))
+
+  - **Abstract**: Large Language Models (LLMs) are increasingly being used to automate programming tasks. Yet, LLMs' capabilities in reasoning about program semantics are still inadequately studied, leaving significant potential for further exploration. This paper introduces FormalBench, a comprehensive benchmark designed to evaluate LLMs' reasoning abilities on program semantics, particularly via the task of synthesizing formal program specifications to assist verifying program correctness. This task requires bo...
+  - **Labels**: [static analysis](static_analysis.md), [specification inference](specification_inference.md), [benchmark](benchmark.md), [empirical study](empirical_study.md)
+
 - [CompilerGym: robust, performant compiler optimization environments for AI research](../venues/CGO2022/paper_1.md), ([CGO2022](../venues/CGO2022/README.md))
 
   - **Abstract**: Interest in applying Artificial Intelligence (AI) techniques to compiler optimizations is increasing rapidly, but compiler research has a high entry barrier. Unlike in other domains, compiler and AI researchers do not have access to the datasets and frameworks that enable fast iteration and development of ideas, and getting started requires a significant engineering investment. What is needed is an easy, reusable experimental infrastructure for real world compiler optimization tasks that can ser...
 
@@ -60,7 +60,7 @@
   - **Labels**: [static analysis](static_analysis.md), [bug detection](bug_detection.md), [agent design](agent_design.md)
 
 
-- [Combining Large Language Models with Static Analyzers for Code Review Generation](../venues/arXiv2025/paper_4.md), ([arXiv2025](../venues/arXiv2025/README.md))
+- [Combining Large Language Models with Static Analyzers for Code Review Generation](../venues/arXiv2025/paper_7.md), ([arXiv2025](../venues/arXiv2025/README.md))
 
   - **Abstract**: Code review is a crucial but often complex, subjective, and time-consuming activity in software development. Over the past decades, significant efforts have been made to automate this process. Early approaches focused on knowledge-based systems (KBS) that apply rule-based mechanisms to detect code issues, providing precise feedback but struggling with complex, context-dependent cases. More recent work has shifted toward fine-tuning pre-trained language models for code review, enabling broader is...
   - **Labels**: [static analysis](static_analysis.md), [bug detection](bug_detection.md)
@@ -120,7 +120,7 @@
   - **Labels**: [static analysis](static_analysis.md), [bug detection](bug_detection.md)
 
 
-- [Evaluating the Effectiveness of Small Language Models in Detecting Refactoring Bugs](../venues/arXiv2025/paper_3.md), ([arXiv2025](../venues/arXiv2025/README.md))
+- [Evaluating the Effectiveness of Small Language Models in Detecting Refactoring Bugs](../venues/arXiv2025/paper_6.md), ([arXiv2025](../venues/arXiv2025/README.md))
 
   - **Abstract**: Popular IDEs frequently contain bugs in their refactoring implementations. Ensuring that a transformation preserves a program's behavior is a complex task. Traditional detection methods rely on predefined preconditions for each refactoring type, limiting their scalability and adaptability to new transformations. These methods often require extensive static and dynamic analyses, which are computationally expensive, time-consuming, and may still fail to detect certain refactoring bugs. This study ...
   - **Labels**: [static analysis](static_analysis.md), [bug detection](bug_detection.md)
@@ -198,6 +198,12 @@
   - **Labels**: [static analysis](static_analysis.md), [bug detection](bug_detection.md), [code model](code_model.md), [code model training](code_model_training.md), [binary code model](binary_code_model.md)
 
 
+- [KNighter: Transforming Static Analysis with LLM-Synthesized Checkers](../venues/arXiv2025/paper_4.md), ([arXiv2025](../venues/arXiv2025/README.md))
+
+  - **Abstract**: Static analysis is a powerful technique for bug detection in critical systems like operating system kernels. However, designing and implementing static analyzers is challenging, timeconsuming, and typically limited to predefined bug patterns. While large language models (LLMs) have shown promise for static analysis, directly applying them to scan large codebases remains impractical due to computational constraints and contextual limitations. We present KNighter, the first approach that unlocks p...
+  - **Labels**: [static analysis](static_analysis.md), [bug detection](bug_detection.md)
+
+
 - [LAMD: Context-driven Android Malware Detection and Classification with LLMs](../venues/arXiv2025/paper_1.md), ([arXiv2025](../venues/arXiv2025/README.md))
 
   - **Abstract**: The rapid growth of mobile applications has escalated Android malware threats. Although there are numerous detection methods, they often struggle with evolving attacks, dataset biases, and limited explainability. Large Language Models (LLMs) offer a promising alternative with their zero-shot inference and reasoning capabilities. However, applying LLMs to Android malware detection presents two key challenges: (1)the extensive support code in Android applications, often spanning thousands of class...