Boldly going where no benchmark has gone before: exposing bias and shortcomings in code generation evaluation

doi:10.48550/arXiv.2401.03855

Boldly going where no benchmark has gone before: exposing bias and shortcomings in code generation evaluation

Source

arXiv

Date Issued

2024-01-01

DOI

10.48550/arXiv.2401.03855

Abstract

Motivated by the increasing popularity of code generation from human descriptions using large language models (LLMs), several benchmarks have been proposed to assess the capabilities of existing and emerging models. This study presents a large-scale human evaluation of HumanEval and MBPP, two widely used benchmarks for Python code generation, focusing on their diversity and difficulty. Our findings reveal a significant bias towards a limited number of programming concepts, with negligible or no representation of most concepts. Additionally, we identify a concerningly high proportion of easy programming questions, potentially leading to an overestimation of model performance on code generation tasks.

URI

https://d8.irins.org/handle/IITG2025/19862