Hualin Luan Cloud Native · Quant Trading · AI Engineering
Back to articles

Article

Panorama of AI programming ability evaluation: from HumanEval to SWE-bench, the evolution and selection of benchmarks

Public benchmarks are not a decoration for model rankings, but a measurement tool for understanding the boundaries of AI programming capabilities. This article starts from benchmarks such as HumanEval, APPS, CodeContests, SWE-bench, LiveCodeBench and Aider, and explains how to read the rankings, how to choose benchmarks, and how to convert public evaluations into the team's own Coding Mentor evaluation system.

Meta

Published

3/30/2026

Category

interpretation

Reading Time

19 min read

Copyright Statement and Disclaimer This article provides a comprehensive interpretation based on public materials such as HumanEval, SWE-bench, LiveCodeBench, APPS, CodeContests, EvalPlus, and Aider Leaderboards. The copyright of the original text belongs to the respective authors and research institutions.

Original reference

Original Nature This article is not a collection of benchmark-by-benchmark summaries, but rather helps readers build a systematic understanding of benchmarking by juxtaposing multiple sources and refining evaluation dimensions and selection frameworks.


Beginning: Benchmarking is not a list, it is a tool to measure boundaries

If you have accepted the judgment in Part 1: Developers must not only be users of AI programming assistants, but also play the role of Coding Mentors, then you will soon encounter the second question: How do you judge the boundaries of AI capabilities?

Just checking whether the output can run once is not enough. Just watching the model manufacturer’s demonstration is not enough. Just looking at community reputation is not enough. AI programming ability is not a single score. It includes at least multiple dimensions such as function generation, algorithm reasoning, code editing, real warehouse repair, test generation, error localization, context utilization, cost and stability.

The value of public benchmarking is here. HumanEval, APPS, CodeContests, SWE-bench, LiveCodeBench, and Aider Leaderboards do not provide readers with a permanent answer to “who is the strongest”, but provide a set of measurement methods: different benchmarks put AI into different task environments to see under what conditions it is reliable and under what conditions it fails.

AI Programming Benchmark Capability Map

When reading benchmarks, the most important thing is not to remember a certain list ranking, but to answer three questions:

questionWhy is it important
What tasks does this benchmark actually measure?Also called “code ability”, function generation and real warehouse repair are not the same ability.
What this benchmark doesn’t measureThe uncovered portion cannot be inferred from the score
Can it be transferred to your team scenario?Public scores can only be used for external reference and cannot replace private eval.

This article will be developed according to this reading. You’ll see how benchmarks evolve from small function questions to real GitHub issues, see which questions different benchmarks are suitable for answering, and see how to transform public benchmarks into the team’s own evaluation set.

Panoramic comparison of six categories of benchmarks

The development path of open benchmarks can be roughly understood as moving from “function-level correctness” to “real engineering tasks.” This is not a simple difficulty upgrade, but a change in the assessment object: from a single function, to algorithm competition questions, to issues in real warehouses, to continuous updates and code editing scenarios.

benchmarkPublisher/maintainercore missionMain questions answeredTypical blind area
HumanEvalOpenAIComplete Python functions based on function signature and descriptionWhether the model has the ability to generate basic functionsContext, engineering constraints and contamination risks
EvalPluscommunity maintenanceExtended HumanEval/MBPP test casesAre original function questions overestimated by weak tests?Still mainly focused on function-level tasks
APPSStanfordCompetition style programming questionsWhether the model has algorithmic reasoning and complex input processing capabilitiesCompetitive ability does not equal engineering ability
CodeContestsDeepMindProgramming competition questions and multi-language submissionsCan the model handle difficult algorithmic challenges?Lack of real code base maintenance context
SWE-benchPrinceton/UCBFix issues in real open source repositoriesCan the model understand the code base and generate verifiable patches?The environmental cost is high and task distribution still has boundaries
LiveCodeBenchUCB/MIT/CornellContinuously collect new questions and evaluate them by time slicesWhether the model is still reliable on new questionsMore of a programming question, not equivalent to an enterprise code base
Aider LeaderboardsPaul GauthierCode editing in the context of the code baseIs the model suitable for daily editing and multi-language modifications?Results depend on tool workflow and task settings

This table has an important takeaway: no public benchmark can stand alone as a proxy for “AI programming ability.” Each benchmark measures an aspect. Only by stacking multiple aspects together can you get closer to the boundary of your abilities that you really care about.

HumanEval: The starting point for function-level generation capabilities

The historical significance of HumanEval is that it turns “can the model write code” into a repeatable evaluation question. Each question provides a function signature, documentation string, and a few examples. The model needs to complete the function implementation, and then use hidden tests to determine whether it passes.

It is suitable for answering a very narrow but basic question: whether the model can write an independent function with clear input and output.

design pointInspiration for readers
The task is smallBasic competency assessment should start with isolable tasks
test automationEvaluation conclusions must be supported by repeatable testing
Use pass@kCode generation is random, and a single failure does not mean that it will not happen at all.
Hidden testsCorrectness cannot be judged based solely on public examples

HumanEval also has obvious limitations. The number of questions is limited, the circulation time is long, and the risk of data contamination is high; the tasks are mainly independent functions, which cannot represent the dependencies, status, compatibility, and architectural constraints in the real code base; the language range is also narrow. EvalPlus alleviates the problem of weak original tests by supplementing them with stronger test cases, but does not change the basic nature of function-level evaluation.

For Coding Mentor, the value of HumanEval is not to let you copy 164 questions, but to remind you that the minimum evaluation unit must have a clear contract and automatic acceptance. When you design the eval seed set within the team, you also need to first have a batch of basic tasks with “clear boundaries, low cost, and repeatability” to quickly discover the lower limit of the model.

APPS vs. CodeContests: What algorithm challenges can and cannot test

APPS and CodeContests take assessment to competition-level programming questions. Compared with HumanEval, this type of task usually has more complex questions, a larger input range, stricter time constraints, and relies more on algorithm combination and boundary processing.

They are suitable for answering questions about whether the model is capable of complex reasoning, algorithm selection, and the ability to pass intensive testing.

valueexplain
Clearly layered difficultyCompetition questions naturally vary in difficulty such as entry level, competition, and Olympiad.
Test cases are strongerMany questions have been submitted by a large number of players and polished by boundaries.
Can expose reasoning shortcomingsComplex constraints, extreme value inputs, and combinatorial algorithms can make models more prone to errors.

But engineering readers cannot equate competitive ability with software engineering ability. Competition questions usually seek to provide an AC solution within a limited time. Real projects are more concerned with maintainability, readability, local modification, compatibility, security constraints and long-term evolution. Just because a model is good at competition questions doesn’t mean it can safely change permission logic in your business code; just because a model has an average competition ranking doesn’t mean it can’t help with testing, document changes, or partial refactoring.

Therefore, this type of benchmark is more suitable for “complex reasoning stress testing” rather than as the only basis for enterprise selection.

SWE-bench: From writing code to solving real problems

The key change of SWE-bench is to shift the evaluation from “writing a function” to “fixing an issue in a real warehouse”. After the model gets the issue description and code base context, it needs to locate relevant files, understand the cause of the failure, generate patches, and verify them through testing.

This type of task is closer to real software engineering because it examines multiple capabilities simultaneously:

abilityReflection in SWE-bench
Requirements understandingIdentify the real issue to be fixed from the issue description
Code navigationFind the relevant call chain in the multi-file code base
root cause analysisDistinguish between superficial errors and real failure reasons
minimal fixModify necessary locations to avoid large-scale irrelevant refactoring
return to consciousnessFix the target problem without breaking existing behavior

Its inspiration for Coding Mentor is very direct: if you want to evaluate whether AI can participate in real delivery, you cannot just give it an independent question, you must put it into the code base, issue, test and review constraints.

However, SWE-bench is not a complete image of the production environment. It relies on specific open source projects and reproducible test environments. Task distribution is not equal to your business distribution. The ranking results will also be affected by the agent framework, tool invocation, context strategy and evaluation configuration. When reading SWE-bench, don’t just look at the solution rate, but also look at the task settings, whether it is a Verified subset, whether additional tools are used, whether multiple rounds of repair are allowed, and what types of failed samples are concentrated.

LiveCodeBench: Why time slicing matters

When a public benchmark is repeatedly used for training, parameter tuning, and promotion, it faces data pollution problems. If the model performs well on old questions, it may be because its capabilities have improved, or it may be because it has seen similar questions during training. The important value of LiveCodeBench is to introduce the time dimension into evaluation: it continuously collects newly released questions and observes model performance based on release time slices.

This has two implications for readers.

First, the public score depends on the time. A model that performs well on early questions does not necessarily mean that it is still reliable when facing new questions. For questions after the training deadline, the model has a harder time relying on memory and can only rely more on real reasoning and generation capabilities.

Second, task types should be looked at separately. Code generation, self-healing, code execution prediction, and test output prediction are not the same capabilities. A model may be good at generating solutions, but not good at predicting the results of existing code; it may be good at fixing minor errors, but not good at constructing a complete solution from scratch.

Coding Mentor should also retain time-slicing thinking when doing evaluations within the team. Don’t leave assessment sets unchanged for years, and don’t lump all tasks into one total score. New requirements, recent incidents, newly repaired defects, and recent changes in code specifications should all enter the candidate task pool regularly and become new eval samples after governance.

Aider Leaderboards: Code editing is closer to daily development than building from scratch

A lot of real development work is not to write code from scratch, but to edit existing code: modify an interface, add a test, migrate an API, adjust a configuration, fix a regression. Aider Leaderboards focuses on this editing power.

There are essential differences between code editing tasks and function generation tasks:

DimensionsGenerate from scratchCode editing
enterRequirements statement and a little contextExisting files, calling relationships, modification instructions
riskGenerate results may be incompleteModifications may break existing behavior
key capabilitiesCompletion, inference, grammatical correctnessPositioning, local modification, maintaining style, reducing irrelevant diff
Verification methodUnit tests or sample testsTesting, diff review, regression checking

If your team mainly uses AI for daily code modifications, code editing benchmarks are more valuable than pure generation benchmarks. But it still cannot replace private evaluation, because the editing success rate strongly depends on the tool chain, context injection method, language ecology and task type.

For Coding Mentor, the significance of lists like Aider is to remind the team: Don’t just evaluate “whether you can write”, but also evaluate “whether you can make fewer changes, make changes accurately, and make them verifiable”.

How to choose a benchmark that suits your scenario

When reading benchmarks, it is recommended to start from “what decision do you want to make” rather than “which list is more popular”. Different decisions require different evidence.

Selection path from public benchmarks to team private assessments

your questionPreferred public referenceWhat else needs to be added internally?
Whether the model has basic code generation capabilitiesHumanEval / EvalPlusSmall tasks in the team’s commonly used languages ​​and frameworks
Whether the model can handle complex algorithms and boundariesAPPS / CodeContestsBusiness-related data structures, performance, and extreme inputs
Whether the model can repair real warehouse problemsSWE-bench / SWE-bench VerifiedTeam history bugfixes, test escapes and review return samples
Is the model overestimated by old questions?LiveCodeBenchRecent requirements, recent failures, and recent specification changes
Is the model suitable for daily code editing?Aider LeaderboardsLocal modifications and multilingual tasks in real code bases
Is it possible to purchase or switch tools?Multiple datum cross-referencesCost, latency, security, permissions, data governance and developer experience

When choosing a baseline, there is a simple rule: a public baseline answers “where does this model fit in the xref”, a private eval answers “can this model make it into our delivery process?” Both issues are important but cannot replace each other.

Indicator selection: Don’t be carried away by a total score

Many lists will give overall scores or pass rates, but what teams really want is an ability profile. The higher the total score, the easier it is to conceal local risks. A model may be strong in function generation but weak in real warehouse positioning; it may also have a high editing success rate, but the cost and delay are not suitable for large-scale use.

indexWhat to measureEasily misleading
pass@kThe probability of passing the test under multiple samplesDoes not represent code quality, maintainability and security
solve rateReal task end-to-end resolution rateBinarization results will conceal the cause of failure
edit success rateWhether the code editing task was successfulDepends on toolchain and task settings
regression rateWhether the modification breaks existing behaviorNeed a strong enough test set
cost / latencySingle task delivery cost and waiting timeCannot represent quality alone
human review loadManual review and rework costsRequires long-term team records

For the evaluation report for Coding Mentor, it is best not to output just a ranking, but to output a capability profile:

Capability dimensionEvidence to be recorded
functional correctnessTest pass rate, failed use cases, verification results after repair
Contextual exploitationWhether the correct file is found, whether the call chain and existing constraints are understood
minimal modificationdiff range, proportion of irrelevant changes, whether to introduce additional dependencies
quality riskConcurrency, security, performance, compatibility, and maintainability issues
collaboration costPrompt rounds, manual modifications, review return reasons

Such a report has more engineering value than “Model A is 3 points higher than Model B” because it can directly determine which tasks the AI ​​should be deeply involved in, which tasks it should only do drafts on, and which tasks it must quit.

How to convert public benchmarks into team private evals

Public benchmarks don’t directly solve team problems, but they can provide design principles. The team’s internal evaluation set does not need to be large at the beginning. The key is that the tasks come from real work, and each sample can be verified, reviewed, and routed.

From public benchmark to private eval implementation closed loop

An implementable private eval can start from 30 to 60 tasks, covering four types of samples:

Sample typesourceAssessment focus
Small function implementationRecently completed low-risk requirementsRequirements understanding, implementation integrity, and testing awareness
Bug fixesHistorical bugs, test escapes, and online accident reviewRoot cause location, minimal repair, regression risk
Test enhancementBoundary conditions missed in the pastBoundary identification, use case design, assertion quality
code reviewreview returns, architectural constraint violations, security issuesRisk identification, explanation ability, correction suggestions

Each task must retain at least six categories of information:

Fieldeffect
Task descriptionBe clear about what the model solves and what it does not solve
code versionEnsure tasks are reproducible
context boundariesDescribes which files, documents, and tools are allowed access
Acceptance criteriaDefine what passes, partial passes, and failures mean
Reference fix or reference feedbackAs a basis for manual review and subsequent training of candidates
Sensitive data processing resultsPrevent private data, credentials and customer information from entering assessment or training

The goal here is not to copy SWE-bench, but to learn its spirit: put the model into real context and judge the results with verifiable evidence. The eval within the team must be closer to its own language stack, framework, coding standards, release process and risk boundaries.

The correct way to read the list: first look at the task definition, then look at the rankings

As of 2026-03, the model release pace, inference modes, tool calling capabilities, and billing methods are all changing rapidly. No article should take a screenshot of a certain list as a long-term conclusion. A safer approach is to record the method of reading the list.

Assessment entranceHighlights from the listQuestions suitable to be answered
HumanEval / EvalPlusIs it close to saturation, is there stronger testing, is there a risk of contaminationIs the model’s basic function generation capability acceptable?
SWE-bench VerifiedIssue source, verification method, agent settings, whether additional tools are allowedCan the model handle real warehouse repairs?
LiveCodeBenchRelease time slice, task category, performance after training deadlineWhether the model is still reliable on new questions
Aider LeaderboardsMultilingual editing success rate, cost, contextual strategiesIs the model suitable for daily code editing?
Team private evalBusiness language, frameworks, security rules, review bounces and costsIs the model suitable to enter into your delivery process?

It is recommended that the order of reading the list be fixed at four steps:

stepJudgment problem
Let’s look at the task definition firstWhat does this list measure and what does it not measure?
Let’s look at the evaluation configuration againDoes the model use tools, agents, additional context, and multiple attempts?
Look at the failure sample againDoes the failure focus on understanding, positioning, implementation, testing or regression?
Finally do a private re-inspectionCan the results be transferred to the team’s real tasks?

If you skip the first three steps and only look at the rankings, it is easy to misread the external list as an internal procurement conclusion.

Capability Boundaries: What Public Benchmarks Collectively Reveal

Looking at these benchmarks together, a more robust judgment can be obtained: AI programming capabilities are sufficient to undertake a large number of local tasks, but it cannot be separated from the task contract, evaluation criteria and acceptance mechanism of human Mentor.

AI is generally better at:

scenereason
Small functions with clear input and outputTask boundaries are clear and tests are easy to automate
Common algorithms and data structuresRich training corpus and stable model
Local code editingThe scope of modification is controllable and the feedback cycle is short
Standard error fixesError reporting, stack and common repair modes are clearer

AI is often more prone to mistakes:

scenerisk
Fuzzy business semanticsThe model is easy to complete assumptions that seem reasonable but do not conform to the business
Cross-module architectural changesImplicit constraints are easily broken when the context is incomplete
performance sensitive pathPassing correctness does not mean acceptable resource consumption
Security and Permission BoundariesSmall changes may expand access or bypass verification
Low test coverage code baseWithout strong validation, errors are harder to detect in a timely manner

This goes right back to the main thread of this series: Benchmarking gives you an external mirror, and Coding Mentor gives you an internal judgment system. Without benchmarks, you can easily choose a model based on your feelings; with only benchmarks, you can easily mistake public scores for production reliability.

Conclusion: The benchmark is the entrance, and private eval is the starting point.

The value of open benchmarks is not to make the final decision for the team, but to help you establish a measurement language. HumanEval tells you how function-level tasks are automatically evaluated, APPS and CodeContests tell you how complex reasoning puts pressure, SWE-bench tells you that real warehouse tasks must enter the context and test closed loop, LiveCodeBench tells you the importance of time slicing and pollution control, and Aider tells you that code editing ability should be measured separately.

But the final work of Coding Mentor is to transform these public experiences into the team’s own evaluation system:

Public benchmarks providedThe team needs to make up
standard task paradigmReal business tasks and code base context
Automated verification ideasTeam acceptance criteria and review rules
Ability Boundary ReferencePrivate risk boundaries and usage policies
Model horizontal comparisonCost, latency, security and governance decisions

So, the conclusion of Part 2 is simple: Don’t be superstitious about lists, and don’t ignore them. The public benchmark is responsible for helping you see the outer boundaries of AI programming capabilities, and the private eval is responsible for answering whether it can enter your project delivery process.

When we next enter Part 3, the problem will be further narrowed: since public benchmarks cannot directly replace team assessment, how should a set of high-quality programming questions and task samples be designed.


References and Acknowledgments

Series context

You are reading: AI Coding Mentor Series

This is article 2 of 9. Reading progress is stored only in this browser so the full series page can resume from the right entry.

View full series →

Series Path

Current series chapters

Chapter clicks store reading progress only in this browser so the series page can resume from the right entry.

9 chapters
  1. Part 1 Previous in path Why do you need to be a coding mentor for AI? When AI programming assistants become standard equipment, the real competitiveness is no longer whether they can use AI, but whether they can judge, calibrate and constrain the engineering output of AI. This article starts from trust gaps, feedback protocols, evaluation standards and closed-loop capabilities to establish the core framework of "Humans as Coding Mentors".
  2. Part 2 Current Panorama of AI programming ability evaluation: from HumanEval to SWE-bench, the evolution and selection of benchmarks Public benchmarks are not a decoration for model rankings, but a measurement tool for understanding the boundaries of AI programming capabilities. This article starts from benchmarks such as HumanEval, APPS, CodeContests, SWE-bench, LiveCodeBench and Aider, and explains how to read the rankings, how to choose benchmarks, and how to convert public evaluations into the team's own Coding Mentor evaluation system.
  3. Part 3 How to design high-quality programming questions: from question surface to evaluation contract High-quality programming questions are not longer prompts, but assessment contracts that can stably expose the boundaries of abilities. This article starts from Bloom level, difficulty calibration, task contract, test design and question bank management to explain how to build a reproducible question system for AI Coding Mentor.
  4. Part 4 Four-step approach to AI capability assessment: from one test to continuous system evaluation Serving as a coding mentor for AI is not about doing a model evaluation, but establishing an evaluation operation system that can continuously expose the boundaries of capabilities, record failure evidence, drive special improvements, and support collaborative decision-making.
  5. Part 5 Best Practices for Collaborating with AI: Task Agreement, Dialogue Control and Feedback Closed Loop The core skill of being a Coding Mentor for AI is not to write longer prompt words, but to design task protocols, control the rhythm of conversations, identify error patterns, and precipitate the collaboration process into verifiable and reusable feedback signals.
  6. Part 6 Practical cases: feedback protocol, evaluation closed loop, code review and programming education data Case studies should not stop at “how to use AI tools better”. This article uses four engineering scenarios: model selection evaluation, feedback protocol design, code review signal precipitation, and programming education data closed loop to explain how humans can transform the AI ​​collaboration process into evaluable, trainable, and reusable mentor signals.
  7. Part 7 From delivery to training: How to turn AI programming collaboration into a Coding Mentor data closed loop The real organizational value of AI programming assistants is not just to increase delivery speed, but to precipitate trainable, evaluable, and reusable mentor signals in every requirement disassembly, code generation, review and revision, test verification, and online review. This article reconstructs the closed-loop framework of AI training, AI-assisted product engineering delivery, high-quality SFT data precipitation, and model evaluation.
  8. Part 8 From engineering practice to training data: a systematic method for automatically generating SFT data in AI engineering Following the data closed loop in Part 7, this article focuses on how to process the screened engineering assets into high-quality SFT samples and connect them to a manageable, evaluable, and iterable training pipeline.
  9. Part 9 Future Outlook: Evolutionary Trends and Long-term Thinking of AI Programming Assessment As the final article in the series, this article reconstructs the future route of AI Coding Mentor from the perspective of engineering decision-making: how evaluation objects evolve, how organizational capabilities are layered, and how governance boundaries are advanced.

Reading path

Continue along this topic path

Follow the recommended order for AI programming assessment instead of jumping through random articles in the same topic.

View full topic path →

Next step

Go deeper into this topic

If this article is useful, continue from the topic page or subscribe to follow later updates.

Return to topic Subscribe via RSS

RSS Subscribe

Subscribe to updates

Follow new articles in an RSS reader without checking the site manually.

Recommended readers include Follow , Feedly or Inoreader and other RSS readers.

Comments and discussion

Sign in with GitHub to join the discussion. Comments are synced to GitHub Discussions

Loading comments...