Article

How to design high-quality programming questions: from question surface to evaluation contract

High-quality programming questions are not longer prompts, but assessment contracts that can stably expose the boundaries of abilities. This article starts from Bloom level, difficulty calibration, task contract, test design and question bank management to explain how to build a reproducible question system for AI Coding Mentor.

Topic · AI programming assessment Series AI Coding Mentor Series 3/9

Ai Coding Mentor Problem Design Original Interpretation Coding Exercises Bloom Taxonomy

Beginning: A good question is not about a longer description, but about clearer constraints.

When many teams evaluate AI programming capabilities for the first time, they will naturally start with a requirement that “looks like a programming problem”: write a sorting function, implement a cache, do a form validation, or repair an existing piece of code. Such questions can quickly start experiments, but it is difficult to support long-term evaluation. The reason is not that the question is too simple, but that the task boundaries, input and output, constraints, acceptance criteria and failure determination are not clearly stated in the question.

For human engineers, vague requirements have a second chance to be clarified. Candidates can ask questions about sorting fields, stability, exception input, performance requirements, and business priorities. The AI programming assistant usually does not ask questions steadily, but directly completes the missing information: default language, default data structure, default boundary processing, and default exception policy. It seems that it completed the task, but in fact it just made assumptions for the team about the parts that were not written in the question.

This is the core problem of question design from the perspective of Coding Mentor: a question that humans think is “almost understandable” may not be an evaluable task for AI. Evaluable tasks must reduce implicit assumptions so that the model’s output primarily reflects ability rather than its chance of guessing the questioner’s intent.

An effective judgment criterion is: whether the question can be repeated by different people, different models, and different time points, and obtain comparable scoring results. If not, the question is not about valuing assets, it is just one-time interactive material.

Task contracts for evaluable programming problems

Assessment boundaries: from traditional questions to ability levels

Why traditional question setting methods cannot be directly transferred to AI assessment

Traditional programming questions often serve teaching, interviews, or training. They assume that human question readers have the ability to understand context, and that evaluators can add explanations, ask questions, and observe the debugging process. The environment for AI evaluation is different: the model input is the question, the output is code, explanations, or fix patches, and scoring often relies on testing, static inspection, manual rubrics, or multiple rounds of feedback recording.

This makes three differences.

First, the topic needs to be closed. Human questions can leave some room for interpretation, but AI assessment questions cannot hide key conditions in the tone of voice. For example, “handle exceptions reasonably” is not a scorable requirement; “when the input is empty, return an empty list, when the page number is out of bounds, return an empty result and keep the total number” is a scorable requirement.

Second, the question needs to be resistant to pattern matching. The model has seen a large number of public algorithm problems and common business segments. As long as the question falls into a familiar template, it may not really analyze the requirements, but call the standard solution in memory. High-quality questions do not seek to deliberately create difficulties, but expose loopholes in template solutions through business constraints, combination boundaries and verification mechanisms.

Third, the questions need to be calibrated. AI capabilities are updated quickly. For the same question under different models, different tool permissions, and different context budgets, the pass rate may vary significantly. The difficulty of a question cannot only depend on the “question giver’s feeling”, but must be tied to assessment settings, pass rate distribution, failure types and version records.

Therefore, designing programming questions for AI is not just about writing questions, but designing an evaluation contract. The function of the contract is to convert “What do I hope the model will do” into “What evidence can I observe”, and then into “How do I judge whether it meets the standards?”

From knowledge points to competency levels: Engineering usage of Bloom’s taxonomy

Bloom’s cognitive taxonomy breaks down learning objectives into six levels: remember, understand, apply, analyze, evaluate, and create. It is originally an educational framework, but it has strong engineering value in AI programming assessment because it reminds question makers not to write all questions as “implementing a certain function.”

Different levels correspond to different abilities evidence.

Bloom level	Evidence of programming ability	Questions suitable for observation	AI assesses risk
memory	Recall syntax, API, concept nouns	Do you know a language feature or library function?	Easily covered by training corpus, low discrimination
understand	Explain conceptual differences and applicable scenarios	Can you explain the concepts of shallow copy, transaction isolation, power, etc.	The answer may be smooth but lacks engineering constraints
application	Implement algorithms or business logic as required	Is it possible to put known methods into code?	Best for automated testing, but easily hit by templates
analyze	Disassemble complex requirements and locate bottlenecks and boundaries	Whether it can identify dependencies, status, and failure paths	Need more detailed rubric and evidence records
evaluate	Decide between alternatives and explain why	Can you make technical judgments and risk trade-offs?	Can’t just be graded with unit tests
create	Design new structures, protocols or system solutions	Can open questions be organized and structured?	Manual review and long-term evaluation must be introduced

Most AI programming assessments fail not because the questions are not difficult enough, but because the ability level is confusing. The question says it will test “architectural capabilities,” but the test only verifies whether a function returns the correct list; the question says it tests “problem decomposition,” but the score only depends on whether the code runs; the question says it tests “engineering judgment,” but the rubric does not differentiate between security, maintainability, performance, and observability.

Coding Mentor should answer a question before asking the question: What kind of evidence of ability should be observed for this question? If the goal is the application level, clear interfaces, inputs and outputs, and automated tests should be given; if the goal is the evaluation level, multiple candidates, constraint conflicts, and decision-making basis must be given; if the goal is to create levels, the architectural boundaries, non-functional requirements, review criteria, and unacceptable risks must be clarified.

Bloom level and AI assessment difficulty mapping

Difficulty and mission contracts

Difficulty grading system: design standards from Easy to Hard

The difficulty is not the length of the question or the number of lines of reference answers. A 20-line function might be difficult because of complex boundary states, or a 200-line scaffold might just be a mechanical fill-in-the-blank. A more meaningful definition of difficulty for AI evaluation is how many judgments the model needs to complete, how many constraints are combined, and how many failure modes are exposed under a fixed evaluation budget.

The difficulty can be broken down into two levels here.

The first level is the human reference standard, which is used to estimate the size of the question. For example, does it take a skilled engineer approximately 15 minutes, 45 minutes, or longer to complete the reference implementation. This can help the team determine whether the questions are too difficult, but it cannot be directly applied to AI. AI does not work on human time, its actual capabilities are affected by context length, tool calls, test feedback, number of retries, and model type.

The second layer is the AI evaluation caliber, which is used to fix the experimental conditions. For example, whether to allow the model to read multiple file contexts, whether to allow running tests, whether to provide failure logs, whether to allow one-time repair, and whether to allow access to external documents. Without these conditions, the so-called Easy, Medium, and Hard cannot be compared.

Dimensions	Easy	Medium	Hard
target ability	Correct application of a single concept	Multi-concept combination and boundary processing	Complex states, cross-module constraints or solution trade-offs
human reference scale	Usually within 15 minutes	Usually 15 to 45 minutes	Usually more than 45 minutes
AI assessment budget	Single-round generation, no test feedback provided	Single round of generation plus one test feedback	Allows multiple file contexts, tool calls, and limited repairs
Number of concepts	1 core concept	2 to 3 core concepts	4+ concepts or multiple subsystems
Reference implementation form	a short function or local logic	a function group or small module	Multi-function, multi-state or cross-module changes
boundary conditions	Empty input, single element, repeated value	Multiple condition combinations, paging out of bounds, sorting stability	Extreme data, degenerate paths, concurrency or consistency issues
Automated scoring	Unit testing can basically cover	Unit testing plus boundary and performance testing	A combination of tests, rubrics, manual reviews, or trace reviews
expected pass rate	Stable and high on the target model	Can widen the gap between models	Mainly used to expose capability boundaries

The value of Easy questions is not to prove that the model “can write code”, but to establish a baseline. It’s suitable for checking syntax, interface compliance, simple boundaries, and basic test pass rates. High-quality Easy questions must also clearly indicate the input range and exception strategy, otherwise there is no point in having a high pass rate.

Medium questions are the most suitable subjects for long-term assessment. It does not require an open architecture design for the model, but it does require consistency across multiple rules. For example, filtering, sorting, and paging combinations; cache eviction and capacity control combinations; parsing, validation, and error reporting combinations. Medium questions can more stably expose differences in model boundaries, complexity, and status updates.

Hard questions should not just make the requirements longer. Real Hard comes from the upgrade of state space, constraint conflicts and evaluation methods. Transactional KV storage, batch task scheduling, cross-module reconstruction, concurrency safety repair, and performance bottleneck location are all typical forms of Hard. At this time, it is often not enough to rely on “all tests passed”. It is also necessary to document the model’s planning quality, risk identification, modification scope control, and verification strategy.

A common mistake is to replace the AI difficulty label with a human difficulty label. A certain question is Medium for humans, but if there are a large number of approximate templates in the model training corpus, it may be Easy for AI. On the other hand, if a question with very simple business rules has implicit priority, abnormal input, and status rollback, it may have entered Hard for AI. Difficulty must be calibrated from actual pass rates and failure types, not determined by question name.

Six-layer task contract for high-quality questions

High-quality questions can be thought of as a six-layer contract. Each layer answers a scoring question: what should the model do, within what bounds does it do it, how to prove it is done right, and how to identify when it is done wrong.

The first level is mission objectives. The goal should not stop at “implementing a function”, but to describe the engineering scenario that the function serves. Scenes are not there to add to the story, but to clarify priorities. For example, in the same sorting process, the operation backend focuses on stability and interpretability, real-time recommendations focus on latency and throughput, and data export focuses on full consistency. Different scenarios require different correct solutions.

The second layer is the interface contract. The interface contract includes function signature, parameter meaning, return value, error handling, null value policy and variability constraints. It is easy for AI to design its own data structure when the interface is unclear, causing the answer to appear correct but fail to enter the unified test. The more stable the interface, the more reproducible the scoring.

The third layer is data constraints. Input size, field range, uniqueness, sorting rules, time format, encoding rules, and concurrency assumptions should all be written explicitly in the question. Data constraints determine algorithm choice and whether the test can rule out brute-force solutions, implicit type conversions, and accidental correctness.

The fourth level is behavioral constraints. Behavioral constraints describe how boundaries, conflicts, and exceptions should be handled. For example, whether the sorting is stable, whether an error is reported when the paging page number crosses the boundary, whether repeated requests are idempotent, whether a transaction fails to be rolled back, and what to do when the cache capacity is 0. This layer is the easiest to overlook, but it is the layer that best distinguishes “can run” from “meets engineering requirements.”

The fifth level is the sample contract. Public examples are not answer prompts, but rather calibration of question meanings. A good example should cover normal paths, at least one boundary path, and one easily misinterpreted rule. Example explanations should explain why the output is what it is, rather than just giving the input and output. This can reduce the probability of the model passing by guessing the meaning of the question.

The sixth level is the acceptance contract. The acceptance contract states what the scoring consists of: open testing, hidden testing, performance testing, static checks, manual rubrics, log review, or scope review. As long as the scoring dimensions are not defined in advance, subsequent evaluations can easily become subjective impressions.

These six layers of contract are not template fields, but engineering judgments that must be completed when formulating the question. The question can be very short, but these six layers cannot be missing. Without task goals, the model does not know the priority; without interface contracts, tests cannot be unified; without data constraints, algorithm selection cannot be determined; without behavioral constraints, boundary processing cannot be scored; without example contracts, the meaning of the question is easily misunderstood; without acceptance contracts, the results cannot be reviewed.

From competency goals to test evidence

AI-oriented question design: Don’t test templates, test judgments

The performance of AI programming assistants on common algorithm problems is often higher than its performance on real engineering tasks. The reason is not mysterious: public question banks, blog explanations, question solution warehouses, and training data are full of standard templates. The more the question resembles a classic question, the easier it is for the model to use the routines in its memory.

The goal of question design is not to deliberately set traps, but to make key judgments unable to be bypassed by templates.

The first method is to add business constraints. Don’t just ask “implement LRU cache”, but explain the capacity of 0, repeated writing, whether to update the hotness after reading, how to handle abnormal keys, and whether statistical information participates in the state. Business constraints force the model to handle specific rules in the problem.

The second approach is to combine multiple common concepts. Filtering, sorting, and paging are not difficult to look at individually, but when combined, sequence errors will be exposed: paging first and then sorting, sorting first and then filtering, the total statistics position is wrong, and the stability is destroyed. Combination questions are closer to engineering tasks than isolated questions.

The third approach is to design adversarial boundaries. Confrontation does not equal upsets, but targets common mistakes. For example, maximum input size verification complexity, repeated element verification stability, null result verification return contract, degenerate input verification algorithm hypothesis, concurrent call verification shared state.

The fourth method is to distinguish between topic facts and model suggestions. For open questions, the model can make suggestions, but the suggestions cannot be regarded as facts. The question should clarify which constraints cannot be changed, which solutions can be chosen, and which risks must be explained. In this way, we can evaluate whether the model respects task boundaries.

The fifth method is to retain evidence of failure. A test failure is not a bad thing in itself; what is bad is that there is no structured record after the failure. Failure type, trigger input, violated contract layer, model repair method and final result should all enter the question calibration process. In the long run, this evidence of failure is more valuable than a single score.

How should a Medium question grow from a competency goal?

Question design should not start from “What code do I want the model to write”, but from “What evidence of ability do I want to observe”.

It is assumed that the target capability is a combination of multi-condition business logic. A suitable Medium question could revolve around filtering, sorting, and paginating product lists. This scenario is common enough that the model does not require additional understanding of the industry knowledge; at the same time, it contains multiple error-prone rules to observe whether the model actually maintains the task contract.

Ability goals can be broken down into four items.

Competence goals	observable evidence	Common mistakes
Filter by multiple conditions	Can handle price, rating, classification and other conditions at the same time	Misuse or relationship between conditions, error in handling empty conditions
stable sorting	Preserve the original relative order when using the same sort key	Use unstable sorting or quadratic sorting to corrupt results
Pagination semantics	Returns the current page data and the filtered total number	Paging first and then filtering, or treating the current number of pages as the total number
Boundary processing	Empty lists, empty results, and out-of-bounds page numbers behave the same	Throws an undeclared exception or returns an inconsistent structure

With the ability goal in mind, the question can be centered around the task contract: give the product field, explain that the filtering conditions can be empty, explain that the sorting field only allows whitelists, explain that paging starts from 1, explain that the total number is the total number after filtering, and explain that when the page number crosses the boundary, an empty list is returned but the total number remains unchanged. Each rule here is not a decoration, they all correspond to tests or rubrics.

Public examples don’t need to be piled high. A normal example shows multi-condition filtering and sorting, a boundary example shows empty results or out-of-bounds page numbers, and an explanation explains why the total is not the current page number. For readers, this is more valuable than showing a large section of reference code, because it illustrates the judgment point where the question is really evaluated.

Hidden tests are designed around common errors. For example, set the classification condition to empty and verify whether the model understands the empty list as unfiltered or filtered as empty; construct products with the same price and same rating to verify stable sorting; construct the page number after the last page to verify the paging contract; construct 100,000 products to verify the complexity. Each hidden use case should answer one question: What kind of error is it preventing?

If the question only requires the model to implement one function, it is still Medium. If you put it into an existing code base and require the model to understand existing data structures, modify the tests, maintain API compatibility, and explain why the public types are not changed, it may be upgraded to Hard. The fundamental reason for the change in difficulty is not that the business scenario has changed, but that task boundaries, context dependencies, and verification responsibilities have changed.

Test case design: public examples are not evaluation sets

Public examples are used to explain the meaning of the questions, hidden tests are used to verify capabilities, performance tests are used to eliminate faulty algorithms, and adversarial tests are used to expose common misunderstandings. Mixing these types of tests together will make the questions lose their calibration value.

A healthy test suite typically contains five categories of use cases.

Test type	Main function	Design focus
Basic use cases	Verify that the main path is correct	Cover the most common inputs without being tricky.
Boundary use cases	Verify that the contract is complete	Empty input, single element, duplicate values, out-of-bounds and empty results
Combining use cases	Validate multiple rule sequences	Filtering, sorting, paging, status update and other rule combinations
Stress use cases	Verification complexity and resource usage	Maximum scale, degenerate distribution, long string or large graph
Adversarial use cases	Verify whether the wrong template is applied	Construct input for known error patterns

The test ratio does not have to be fixed. The initial question bank can allow basic and boundary use cases to account for a higher proportion, and subsequent adjustments will be made based on the failure distribution. If the model’s main fault lies in boundary processing, increase the weight of boundary use cases; if the model often times out, add pressure use cases; if the model repeatedly applies the wrong template, add more explicit adversarial use cases.

Tests also need to be tied to scoring explanations. Given only a failure log, it is difficult to precipitate Mentor signals. A better approach is to map failures to the contract layer: interface mismatch, data constraint violation, behavioral rule error, substandard complexity, inconsistent exception policy, and out-of-bounds modification scope. In this way, the question can not only give scores, but also tell the team what the model does not know.

Question testing and governance closed loop

Rubric: Turn subjective evaluations into learnable signals

Automated testing is suitable for evaluating deterministic behavior, but cannot cover all programming capabilities. Code review, architecture design, debugging process, risk judgment, migration strategy and performance analysis all require rubric.

High-quality rubrics are not generalized evaluations such as “good code quality” or “clear structure”, but rather break down the evaluation into observable conditions. For example, maintainability can be broken down into whether the naming expresses domain concepts, whether to avoid duplication of rules, whether to centralize boundary conditions, and whether to keep the public interface stable. Security can be broken down into input validation, permission boundaries, sensitive logs, concurrency status and failure rollback. Performance can be broken down into complexity, hot paths, memory footprint and degraded input performance.

Rubrics also need to distinguish between qualified lines and excellent lines. The qualified line answers “whether it is acceptable”, and the excellent line answers “whether it reflects stronger engineering judgment.” If only full marks are given, raters will oscillate between intermediate situations; if only a list of defects is given, excellent solutions cannot be stably identified.

In the Coding Mentor scenario, another role of rubric is to precipitate training signals. If a manual evaluation only leaves “the answer is not good this time”, it has no long-term value; if it is recorded as “violating the sorting stability contract, the triggering use case is the same price and same rating product, the root cause is that the secondary sorting overwrites the original order”, it can enter the error type library, evaluation set, feedback protocol, and even become a candidate source of subsequent SFT data.

Question bank management and common anti-patterns

Question bank management: Make questions continue to be valid as the model iterates

Once a question enters long-term use, it is no longer just a Markdown file, but an evaluation asset. Assessing assets requires governance or they quickly become ineffective.

The most basic unit of management is title metadata. Each question should at least record the question ID, competency level, difficulty label, domain label, language or framework, evaluation budget, number of public tests, number of hidden tests, expected pass rate interval, risk of contamination, version number and latest calibration time. This information may not all be written on the question sheet, but it must be queryable by the question bank system.

Difficulty calibration should be done periodically. The team can choose several types of representative models: main closed source model, cost-friendly model, local open source model, agent with tool calling, and basic model without tools. Record the pass rate and failure type after each rerun. If a Medium question is consistently higher than the target pass rate on the main model for a long time, it may need to be downgraded, or constraints that can better expose the boundaries of capabilities should be added. If almost no one passes a certain question for a long time, it may not be Hard, but the question is unclear, the test is too narrow, or the acceptance is unreasonable.

Pollution risks must also be managed. Publicly released questions, tests, reference implementations, and problem solutions will all be included in the training corpus or search results of future models. Questions used for public articles can serve teaching purposes, but should not be used directly as private assessment sets. For questions that are truly used in the closed loop of model selection, internal acceptance and training, it is necessary to control the scope of exposure and record which parts have been made public.

Version records are equally important. Changing a boundary condition on the question, adding a hidden use case to the test, or modifying an exception determination in the scoring script will all affect the historical score. Without a version number, it would be easy for the team to compare the results of different versions together and ultimately draw the wrong conclusion.

Common anti-patterns

The first anti-pattern is to write the title as a requirements story without an acceptance contract. The scenario is written very much like a real business, but the input and output, exception strategies and test standards are not clear. Such questions stimulate the model to generate a large number of explanations, but are difficult to score.

The second anti-pattern is to write the questions as code templates to fill in the blanks. Templates can reduce implementation fluctuations, but over-templating will turn evaluation into format compliance, and the model can pass as long as the local logic is completed. It is suitable for training entry-level abilities, but not suitable for evaluating real engineering judgments.

The third anti-pattern is to only look at public examples. The more public examples, the easier it is for the model to deduce the rules from the examples without necessarily understanding the complete contract. Public examples should help readers understand the problem and are not a substitute for hidden testing and boundary coverage.

The fourth anti-pattern is that difficulty labels are not calibrated. The questions marked Hard in the question bank may just be very long; the questions marked Easy may contain implicit states and abnormal paths. If the difficulty label is not bound to the evaluation budget and actual pass rate, it will mislead model selection and ability judgment.

The fifth anti-pattern is leaving human reviews in the comments section. Manual feedback cannot be reused if it is not structured. Good feedback should flow into error classification, rubric revision, test supplementation, and training candidate samples, rather than staying at “this answer doesn’t feel right.”

Conclusion: The ability to formulate questions is the ability to evaluate

Designing high-quality programming questions for AI is essentially building a reproducible environment for ability assessment. It’s not that the longer the question, the better, nor the more difficult, the better, but the more stably it can expose the boundaries of the target’s ability, the better.

From the perspective of Coding Mentor, a good question must accomplish at least four things: clarify the level of capabilities to be observed, write down the task contract, design tests that can expose error patterns, and retain failure evidence that can enter the governance closed loop. As long as these four things are true, the problem is no longer a one-time interaction, but an organizational asset that can be accumulated, repeated, compared and improved.

The next article will go into more direct assessment practice: when the questions, tests, and rubrics are ready, how can the team systematically evaluate the AI programming ability and convert the evaluation results into executable Mentor feedback.

References and Acknowledgments

Bloom’s Taxonomy — Benjamin Bloom et al.
APPS Benchmark Design — Hendrycks et al., Stanford
LiveCodeBench Methodology — Jain et al., UC Berkeley

Series context

You are reading: AI Coding Mentor Series

This is article 3 of 9. Reading progress is stored only in this browser so the full series page can resume from the right entry.

View full series →

Reading path

Continue along this topic path

Follow the recommended order for AI programming assessment instead of jumping through random articles in the same topic.

View full topic path →

Next step

Go deeper into this topic

If this article is useful, continue from the topic page or subscribe to follow later updates.

How to design high-quality programming questions: from question surface to evaluation contract

Beginning: A good question is not about a longer description, but about clearer constraints.

Assessment boundaries: from traditional questions to ability levels

Why traditional question setting methods cannot be directly transferred to AI assessment

From knowledge points to competency levels: Engineering usage of Bloom’s taxonomy

Difficulty and mission contracts

Difficulty grading system: design standards from Easy to Hard

Six-layer task contract for high-quality questions

From competency goals to test evidence

AI-oriented question design: Don’t test templates, test judgments

How should a Medium question grow from a competency goal?

Test case design: public examples are not evaluation sets

Rubric: Turn subjective evaluations into learnable signals

Question bank management and common anti-patterns

Question bank management: Make questions continue to be valid as the model iterates

Common anti-patterns

Conclusion: The ability to formulate questions is the ability to evaluate

References and Acknowledgments

You are reading: AI Coding Mentor Series

Current series chapters

Continue along this topic path

Go deeper into this topic

Subscribe to updates

Comments and discussion

Beginning: A good question is not about a longer description, but about clearer constraints.

Assessment boundaries: from traditional questions to ability levels

Why traditional question setting methods cannot be directly transferred to AI assessment

From knowledge points to competency levels: Engineering usage of Bloom’s taxonomy

Difficulty and mission contracts

Difficulty grading system: design standards from Easy to Hard

Six-layer task contract for high-quality questions

From competency goals to test evidence

AI-oriented question design: Don’t test templates, test judgments

How should a Medium question grow from a competency goal?

Test case design: public examples are not evaluation sets

Rubric: Turn subjective evaluations into learnable signals

Question bank management and common anti-patterns

Question bank management: Make questions continue to be valid as the model iterates

Common anti-patterns

Conclusion: The ability to formulate questions is the ability to evaluate

References and Acknowledgments

You are reading: AI Coding Mentor Series

Current series chapters

Continue along this topic path

Why do you need to be a coding mentor for AI?

Panorama of AI programming ability evaluation: from HumanEval to SWE-bench, the evolution and selection of benchmarks

Four-step approach to AI capability assessment: from one test to continuous system evaluation

Continue with this topic

Best Practices for Collaborating with AI: Task Agreement, Dialogue Control and Feedback Closed Loop

Practical cases: feedback protocol, evaluation closed loop, code review and programming education data

From delivery to training: How to turn AI programming collaboration into a Coding Mentor data closed loop

Go deeper into this topic

Subscribe to updates

Comments and discussion