AI写代码只会Python?新测试揭底
AI写代码的评测一直只考Python,但现实项目要用Java、C++、Go等十几种语言。研究者把Python题翻译成12种语言,测了24个模型,发现很多模型在Python上高分,换种语言就崩——这叫“Python过拟合”。比如某个模型Python正确率80%,到C++直接掉到30%。这不是你明天能用上的工具,但它告诉你:别信AI“会写代码”的笼统宣传,得问它“会写哪种语言的代码”。
📄 原文摘要(英文)
LiveCodeBench (LCB) has recently become a widely adopted benchmark for evaluating large language models (LLMs) on code-generation tasks. By curating competitive programming problems, constantly adding fresh problems to the set, and filtering them by release dates, LCB provides contamination-aware evaluation and offers a holistic view of coding capability. However, LCB remains restricted to Python, leaving open the question of whether LLMs can generalize across the diverse programming languages required in real-world software engineering. We introduce Multi-LCB, a benchmark for evaluating LLMs across twelve programming languages, including Python. Multi-LCB transforms Python tasks from the LCB dataset into equivalent tasks in other languages while preserving LCB's contamination controls and evaluation protocol. Because it is fully compatible with the original LCB format, Multi-LCB will automatically track future LCB updates, enabling systematic assessment of cross-language code generation competence and requiring models to sustain performance well beyond Python. We evaluated 24 LLMs for instruction and reasoning on Multi-LCB, uncovering evidence of Python overfitting, language-specific contamination, and substantial disparities in multilingual performance. Our results establish Multi-LCB as a rigorous new benchmark for multi-programming-language code evaluation, directly addressing LCB's primary limitation and exposing critical gaps in current LLM capabilities.