Abstract
CL4SE presents a comprehensive benchmark for evaluating context learning in software engineering tasks, demonstrating significant performance improvements across code generation, summarization, review, and patch assessment through four distinct context types.
Context engineering has emerged as a pivotal paradigm for unlocking the potential of Large Language Models (LLMs) in Software Engineering (SE) tasks, enabling performance gains at test time without model fine-tuning. Despite its success, existing research lacks a systematic taxonomy of SE-specific context types and a dedicated benchmark to quantify the heterogeneous effects of different contexts across core SE workflows. To address this gap, we propose CL4SE (Context Learning for Software Engineering), a comprehensive benchmark featuring a fine-grained taxonomy of four SE-oriented context types (interpretable examples, project-specific context, procedural decision-making context, and positive & negative context), each mapped to a representative task (code generation, code summarization, code review, and patch correctness assessment). We construct high-quality datasets comprising over 13,000 samples from more than 30 open-source projects and evaluate five mainstream LLMs across nine metrics. Extensive experiments demonstrate that context learning yields an average performance improvement of 24.7% across all tasks. Specifically, procedural context boosts code review performance by up to 33% (Qwen3-Max), mixed positive-negative context improves patch assessment by 30% (DeepSeek-V3), project-specific context increases code summarization BLEU by 14.78% (GPT-Oss-120B), and interpretable examples enhance code generation PASS@1 by 5.72% (DeepSeek-V3). CL4SE establishes the first standardized evaluation framework for SE context learning, provides actionable empirical insights into task-specific context design, and releases a large-scale dataset to facilitate reproducible research in this domain.
Community
Context Learning Benchmark for Software Engineering Tasks
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Towards Comprehensive Benchmarking Infrastructure for LLMs In Software Engineering (2026)
- Sphinx: Benchmarking and Modeling for LLM-Driven Pull Request Review (2026)
- Parameter-Efficient Multi-Task Fine-Tuning in Code-Related Tasks (2026)
- AACR-Bench: Evaluating Automatic Code Review with Holistic Repository-Level Context (2026)
- Unseen-Codebases-Domain Data Synthesis and Training Based on Code Graphs (2026)
- TAM-Eval: Evaluating LLMs for Automated Unit Test Maintenance (2026)
- KOCO-BENCH: Can Large Language Models Leverage Domain Knowledge in Software Development? (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper