Article
Practical cases: feedback protocol, evaluation closed loop, code review and programming education data
Case studies should not stop at “how to use AI tools better”. This article uses four engineering scenarios: model selection evaluation, feedback protocol design, code review signal precipitation, and programming education data closed loop to explain how humans can transform the AI collaboration process into evaluable, trainable, and reusable mentor signals.
Copyright Statement and Disclaimer This article is an original interpretation based on public information such as OpenAI, Anthropic, LangChain, GitHub, SWE-bench, etc. The copyright of the original text belongs to the original author and source. This article does not constitute an official translation, nor does it represent the views of the above-mentioned institutions.
Original Nature The case framework, feedback protocol, data routing, evaluation closed loop and implementation route of this article are original and reconstructed by the author. External materials are only used as industry practice and technical basis and are not subject to paragraph-by-paragraph translation.
Beginning: Extracting Mentor signals from the case
When reading this set of cases, the most important thing is not to learn how to use a certain tool, nor to collect a set of prompts that can be copied directly. What really deserves attention is: how AI exposes the boundaries of capabilities in real engineering scenarios, how humans give learnable feedback, and how teams can precipitate the judgments left by a collaboration into subsequent evaluation, training, and governance assets.
If a case only tells you “how to write prompt words better”, it solves the problem of primary output quality; if a case can break down the poor output of AI into error types, feedback protocols, evaluation criteria and training candidate samples, it truly enters the main axis of Coding Mentor.
Therefore, the next four cases do not have tools as the subject, but the human Mentor’s judgment as the subject. Tools are just entry points, prompts are just a way to consume the protocol, and code review comments are just raw signals. What really depends is whether these portals can continue to generate verifiable, routable, and reusable Mentor signals.
When reading each case, you can ask five fixed questions:
| Ask | Why is it important |
|---|---|
| What mistakes are most likely to be made by AI in this scenario? | Only when you see error patterns can you talk about guidance |
| What feedback should a human Mentor give? | Feedback must move from subjective evaluations to learnable signals |
| What evidence can prove that improvements are effective? | There is no evaluation evidence, the case is just experience sharing |
| What assets can these processes precipitate into? | Long-term value comes from capacity building, not one-time efficiency improvement |
| What data cannot enter training or evaluation | Data governance determines whether the closed loop is reliable |
The four cases correspond to the four Mentor capabilities:
- Model selection evaluation: Transform “Which AI is easy to use” into “Which AI is suitable for our task boundaries”.
- Feedback protocol design: Transform “prompt word optimization” into “error diagnosis, feedback structure and acceptance criteria”.
- Code review signal precipitation: Transform “AI to help review” into “trainable samples of team engineering standards”.
- Closed loop of programming education data: Transform “AI teaching students” into “learning trajectories, misunderstanding labels and ability assessment data”.
If you have understood the previous assessment, question design and collaboration methods, this set of cases is the entrance to engineering practice. First, clearly understand the errors, feedback, and evidence in the specific scenario, and then enter the closed loop of organizational-level data; first, accumulate high-quality Mentor signals, and then discuss which samples are eligible to enter the SFT data generation process. The order cannot be reversed.
Case 1: Model selection is not about selecting the champion, but about task fitness.
When many teams choose AI programming assistants for the first time, they will naturally open public lists: SWE-bench rankings, Aider leaderboard, model manufacturer documents, and community discussions. This action is correct, but it can only answer a rough question: what is the approximate ability level of a certain model in public tasks.
What the team really wants to answer is another question: Is this model reliable within our code base, our types of tasks, our security boundaries, and our review habits.
The difference between the public list and the team’s private evaluation can be understood like this:
| Assessment caliber | Answered questions | unanswerable questions |
|---|---|---|
| Public list | Whether the general coding capability of the model has entered the available range | Is it a good fit for the team codebase and business constraints? |
| Manufacturer documentation | Model context, tool capabilities, pricing, and interface limitations | Rework costs in real PRs |
| Small sample trial | Is the developer’s subjective experience smooth? | Stability, boundary failure, and long-term quality |
| private eval | Whether the model can complete the real tasks of the team | Not fully representative of all future tasks |
Therefore, the core of model selection is not to “select the strongest model”, but to establish a private task evaluation protocol. This protocol should be sampled from the team’s real work, rather than improvising a few algorithmic questions.
scene background
Assume a 50-person engineering team is preparing to introduce an AI programming assistant. Candidates include a universal conversation model, an IDE built-in assistant, a command-line coding agent, and a private model that is accessed internally. The team discussion was divided into several groups at the beginning: some people value the code generation speed, some value the context window, some value the price, some are worried about security, and some think that as long as developers like it.
If this type of discussion is not converged, it can easily turn into a dispute over tool preferences. Coding Mentor’s approach is to first define the task boundary, and then let the model enter the task boundary for evaluation.
Selection issues from the perspective of Mentor
Model selection must be broken down into at least five categories of capabilities, rather than just looking at “how well the code is written.”
| Capability dimension | Assessment questions | Task sample source |
|---|---|---|
| Requirements understanding | Ability to identify true goals, non-goals and acceptance criteria | Historical requirements, PR description, defect ticket |
| Code positioning | Can the correct file and call chain be found? | Historical bugfixes, refactoring tasks |
| minimal modification | Whether to avoid irrelevant changes and excessive refactoring | review return record |
| Verify awareness | Whether to proactively perform additional tests and run correct commands | CI failure record, test escape |
| constraint compliance | Whether security, performance, compatibility and architectural boundaries are respected | Architecture decision-making, accident review |
The key here is “task sample source”. If the samples come from real historical tasks, the problems exposed by the model will be close to real delivery. If the samples are just randomly constructed demonstration questions, the evaluation results will easily be optimistic.
The inspiration for SWE-bench is here. It incorporates real GitHub issues and code repositories into its evaluation, emphasizing the need for models to fix issues in real engineering context. The team does not necessarily need to replicate the complex environment of SWE-bench, but it should learn its core spirit: to evaluate software engineering capabilities, the model must be put into a real code environment and real experimental acceptance conditions.
How to design an assessment protocol
A practical selection assessment does not need to be very big at the beginning. It is recommended to start with 30 to 60 private tasks, divided into four groups:
| task force | Quantity recommendations | Target |
|---|---|---|
| Bug fixes | 10 to 20 | Test model to locate root cause and minimum repair capability |
| Test enhancement | 8 to 12 | Test model understands boundary conditions and verification paths |
| Small function implementation | 8 to 15 | Test the completeness of the model from requirements to implementation |
| code review | 8 to 12 | The test model identifies risks, explains the reasons and gives correction suggestions. |
Each task contains at least five fields: task description, code version, context for allowed access, acceptance criteria, reference evaluation rubric. Don’t just put “standard answers.” Standard answers are useful for training but not sufficient for evaluation. Evaluation requires knowing why the model failed, why it failed, and what went wrong.
How to interpret the results
Don’t just rank a total score in the selection report. Total scores can easily mask risk. Even more valuable is the “ability profile”.
| Model performance | possible conclusion |
|---|---|
| Bug fixing is strong, but testing is weak | Suitable for development assistance, not suitable for independent submission |
| The generation speed is fast, but the scope of modification is large | Suitable for drafts, not suitable for direct push of main code |
| review has many suggestions but many false positives | Can be used for pre-review, but not suitable for blocking mergers |
| Contextual utilization is stable, but the price is high | Suitable for high-risk modules, not suitable for full-scale popularization |
| Very strong in small tasks, but weak in planning for large tasks | Suitable for local implementation, not suitable for cross-module reconstruction |
GitHub and Accenture’s research on Copilot’s enterprise practices reminds us that enterprise scenarios should not only look at developers’ subjective experience, but also combine telemetry, quality and process indicators. The same goes for model selection. It’s important for developers to feel comfortable, but it’s not a substitute for evaluating evidence.
What can be learned from this case?
After completing the model selection evaluation, don’t just leave a PPT. It should deposit at least four types of assets:
| assets | Subsequent use |
|---|---|
| private task set | Model upgrade, tool replacement, prompt return when revised |
| competency profile | Decide which tasks allow deep AI involvement |
| Error type distribution | Guide follow-up feedback protocol and training sample collection |
| risk boundary | Decide whether high-risk modules disable or limit AI |
This is Coding Mentor’s selection method: instead of asking “which model is the strongest”, ask “what model can be trusted to what extent on what tasks”.
Case 2: Optimizing prompt words into feedback protocol design
The original case 2 was about API document generation, iterating from a simple prompt word to structured prompts, few-shot examples, and quality inspection checklists. The problem with this writing is not the technique itself, but the conclusion is too shallow: it attributes the change from 60 points to 90 points to a more detailed writing of the prompt words.
A more accurate explanation is that humans have made implicit quality standards explicit.
When a team says “AI-generated API documentation is poor,” it doesn’t help the model. What’s the difference? Is it that the structure is incomplete, parameters are missing, error codes are unclear, examples are not runnable, or terminology is inconsistent with team norms? Only by breaking down “differences” into protocols that can be diagnosed, fed back, and accepted can AI have a path to improvement.
scene background
A backend team wanted to use AI to assist in generating API documentation. Initially developers threw interface code to AI and let it “generate documentation”. The result looked good, but there were many problems when it came to the review meeting: missing fields in the parameter table, incomplete error responses, sample code that could not be run, the same concept with inconsistent names in different documents, and authentication instructions sometimes written and sometimes not written.
If we only optimize the prompt at this time, the team will easily fall into a loop: if a problem is found, then add a requirement to the prompt. After a few rounds, the prompt becomes very long, but the quality is still unstable. Because the team has not established a feedback protocol, it is just heaping constraints.
What exactly is the problem with the 60-point output?
The first step is not to change prompt, but to diagnose the type of output failure.
| Failure type | surface phenomenon | Mentor judgment |
|---|---|---|
| missing structure | Document paragraphs are out of order | AI doesn’t know team-agreed document structure |
| Missing fields | Parameter description is incomplete | AI does not check the consistency between code fields and document fields. |
| Missing error path | Write only success response | AI only covers happy path by default |
| Example is not runnable | Curl or SDK examples are missing parameters | AI does not perform replicability checks |
| term drift | Token, access token, credential mixed | AI is missing a team glossary |
| Unclear security boundaries | Authentication and permission descriptions are ambiguous | The AI does not know which safety constraints must be written out explicitly |
This table is more important than a “better prompt”. It unpacks unsatisfying human intuitions into misclassifications. Error classification can be entered into the evaluation rubric or the review tag, or it can also be turned into metadata for subsequent SFT samples.
Four-layer structure of feedback protocol
I suggest splitting the feedback protocol in this scenario into four layers instead of just writing a prompt template.
| Hierarchy | effect | product |
|---|---|---|
| mission contract | Clarify inputs, outputs, non-objectives and acceptance criteria | Document generation task card |
| quality rubric | Define what is qualified, excellent, and unacceptable | Score sheet and error types |
| sample sample | Demonstrate high-quality output and typical bad examples | few-shot sample library |
| Verification mechanism | Check fields, examples, terminology, security instructions | Automatic inspection and manual spot inspection |
Both OpenAI and Anthropic’s prompt engineering documentation emphasize the importance of clear instructions, examples, and constraints. But in an engineering team, these things shouldn’t just exist in one prompt. They should be extracted into stable assets: task contracts can be reused, rubrics can be evaluated, example libraries can be updated, and verification mechanisms can be regressed.
What role should Prompt play here?
Prompt is not the core asset, the feedback protocol is the core asset. Prompt is just a way to temporarily inject a protocol into the model.
The same set of feedback protocols can be consumed in different ways:
| Consumption pattern | Usage scenarios |
|---|---|
| Prompt template | Single build or in-IDE assistant |
| System commands | Internal documentation agent |
| RAG context | Retrieve specifications and examples based on interface type |
| eval rubric | Automatic or semi-automatic scoring |
| SFT sample | Train the model to form a stable document style |
| review checklist | Be consistent when reviewed by humans |
This is why I don’t recommend writing case two as a “final prompt word template.” The final template will expire soon. The real long-term value is the agreement behind the template.
How to evaluate from 60 to 90 points
Many teams will use “60 points, 70 points, 80 points, 90 points” to express changes in the quality of AI output, but if the score has no source, it is just a subjective feeling. For improvements to be repeatable, scores must become auditable metrics.
| index | 60 points performance | 90 points required |
|---|---|---|
| structural integrity | Chapters are missing or in unstable order | The fixed chapters are complete and the order complies with team specifications. |
| Field coverage | Only main parameters are covered | Parameters, response fields, error codes are consistent with the code |
| Example runnability | The example is missing parameters or cannot be copied | At least one success example and one error handling example can be run |
| Terminology consistency | synonym mix | Use a team glossary |
| Safety instructions | Occasional mention of authentication | Authentication, permissions and sensitive fields are clearly explained |
| manual modification | reviewer writes a lot | reviewer only does a small amount of confirmation |
In this way, “90 points” is no longer the author’s subjective feeling, but the degree of agreement.
What can be learned from this case?
The most valuable asset in this case is not prompt, but six types of assets:
| assets | use |
|---|---|
| API Documentation Task Contract | Constrain AI output boundaries |
| Error type table | as evaluation and training metadata |
| High quality sample library | Support few-shot, RAG, SFT |
| Bad examples and corrected pairs | Support Mentor feedback samples |
| Automatic checking rules | Check field coverage, terminology, security instructions |
| Private eval task | Regression to different models and different prompt versions |
The conclusion of this case should be changed to: Prompt word optimization is only a superficial action, and the real engineering action is to turn human document review standards into feedback protocols. Only in this way will the AI not be “prompted more clearly this time” but gradually learn how the team judges good documents.
Case 3: Code review assistant is not a gatekeeper, but a signal collector
Code review is one of the most suitable scenarios for Coding Mentor. The reason is simple: review is inherently the place where human engineering judgments are most intensive. Security, performance, boundary conditions, abstraction costs, testing strategies, compatibility, and release risks will all appear in the review.
The problem is, most review comments only serve the current PR. After merging, they are rarely saved in a structured manner, let alone flowed back into model evaluation or training. AI code review tools can automatically make suggestions, and GitHub Copilot also provides PR-level code review capabilities. But if the team just treats it as “one more reviewer”, the value is still limited.
A better positioning is: the code review assistant is responsible for expanding the scope of problem discovery, and the human Coding Mentor is responsible for calibrating signals, turning high-value reviews into team engineering specifications, evaluation tasks, and training candidate data.
scene background
One team tried having AI review every PR first. The initial results were not good: the AI found obvious problems like null pointers, missing tests, and variable naming, but it also produced a lot of low-value suggestions. Developers complain that there is too much noise, and reviewers feel that they have to judge whether the AI’s words are credible, which makes it more tiring.
This is not because the AI review is worthless, but because the team did not break down the review tasks clearly.
What an AI review should and shouldn’t cover
| Review type | Suitable for AI pre-qualification | Requires human calibration |
|---|---|---|
| Syntax, lint, simple bugs | Suitable | Low |
| Missing tests, missing wrong paths | Suitable | middle |
| security sensitive mode | Suitable for discovery candidates | high |
| performance risk | Fit tips | high |
| architectural boundaries | Team context is required | high |
| Product trade-offs | Not suitable for final judgment | extremely high |
If AI were to give the final judgment on all issues, it would cross a line. If AI only does candidate discovery and humans are responsible for calibration, it can reduce the probability of missed detections while retaining human control over organizational constraints.
Review how signals are structured
A high-value review should leave at least four layers of signals:
| Hierarchy | example |
|---|---|
| factual evidence | Which file, which diff, which test or log triggers the problem |
| Question type | Security, performance, wrong paths, contract violations, insufficient testing |
| engineering consequences | What risks will it cause if we don’t change it? |
| Correction principle | How to deal with similar problems in the future |
A common comment says: “There may be a performance issue here”. Mentorized comments should say: “This loop queries the inventory status one by one in the request hot path. When the data volume increases, the database delay will be directly exposed to user requests. This should be changed to batch query and an integration test with more than 100 orders should be added.” Only the latter can enter the knowledge base, eval or training candidate samples.
From review to data assets
Code review scenarios can accumulate four types of assets.
| assets | source | use |
|---|---|---|
| Review rubric | Team Norms and History review | Unifying AI and human review standards |
| error pattern library | High frequency review questions | Guidance prompts, RAGs and training |
| correct sample pair | Problems discovered by AI or humans and final patches | SFT or preference data candidates |
| private review eval | Issue locating tasks for historical PRs | Test whether the model uncovers issues the team cares about |
Pay special attention here to the isolation of eval from training. A historical PR can be processed into a review eval or a training sample, but it cannot enter both at the same time in a way that reveals the answer. Otherwise, the model seems to have improved review capabilities, but in fact it just remembers historical answers.
How to control noise
The biggest engineering problem with AI reviews is often not false negatives, but noise. Noise consumes reviewer trust, and once developers get used to ignoring AI reviews, truly valuable tips will be skipped.
Noise can be controlled using three mechanisms:
| mechanism | practice |
|---|---|
| Severity stratification | Only let high-risk issues block, the rest as suggestions |
| Evidence requirements | Comments without documentation, diff, testing or specification will be downgraded |
| feedback flow | Humans label false positives, valid, duplicates, style preference |
LangSmith’s ideas of manual feedback and annotation queue are of great reference value here: instead of treating all model outputs as conclusions, humans can annotate trajectories and outputs, and then use these annotations to improve data sets and evaluations. In the same way as code review, AI comments themselves also need to be reviewed.
What can be learned from this case?
The value of code review cases is not to let AI replace the reviewer, but to turn the review process into a data source for the team’s engineering judgment.
It should eventually settle:
- Which problems can AI detect reliably?
- Which issues must be left to human judgment.
- Which review rules can enter the knowledge base.
- Which corrected samples are suitable for entering into SFT or preference data.
- Which historical PRs can become private review evals.
This is what “training AI to be a team’s code gatekeeper” exactly means. Instead of deciding everything on their own, goalkeepers are constantly calibrated in a closed loop of clear rules and feedback.
Case 4: Programming education is not about letting AI be the teacher, but about making the learning process assessable
The programming education scenario is most easily written as “AI can personalize teaching.” This direction is correct, but it is still biased toward tools. In the Coding Mentor series, the more critical question is whether the interaction process between learners and AI can precipitate ability assessment, misunderstanding diagnosis and teaching feedback data.
The educational scene has a unique advantage: the learning process naturally contains “errors.” Students write wrong code, misunderstand concepts, fail tests, and correct after being prompted. These are all high-value trajectories. Compared with production code, educational data has lower security risks and clearer teaching signals, making it suitable for training AI’s feedback and guidance capabilities.
scene background
One team trains new engineers internally, hoping to use AI to assist learning Python, testing, code reviews, and simple system design. Initially, they let AI act as a teaching assistant, answering questions, explaining concepts, and providing exercises. The effect is good, but problems soon arise: AI sometimes gives answers directly, and students bypass thinking; different students receive different prompts; tutors cannot judge what students have mastered; and learning records cannot be reused.
The core of this scenario is not “whether the AI can teach”, but “whether the learning process can be evaluated.”
Mentor Principles of Teaching AI
In teaching scenarios, AI should not default to the final answer. It should select the feedback intensity based on the learner status.
| learner status | Feedback AI should give | What not to do |
|---|---|---|
| completely stuck | Prompt question breakdown and related concepts | Post the complete answer directly |
| The idea is right but the implementation is wrong | Indicate the error location and minimum correction direction | Rewrite the entire plan |
| Passes tests but has poor structure | Guided comparison of readability, complexity, and bounds | Just say “can be optimized” |
| Making the same mistake repeatedly | Return to the conceptual level to explain misunderstandings | Continue to apply local patches |
| Have mastered the basics | Add constraints and boundary tasks | Keep repeating simple questions |
This table is the feedback protocol in the educational scenario. It changes AI’s teaching behavior from “answering questions” to “adjusting feedback based on ability status.”
How learning trajectories become evaluation data
A learning task should not only record the final answer. At least five types of trajectories must be recorded:
| trajectory | value |
|---|---|
| Initial solution | Reflect learners’ default thinking |
| test failed | Expose conceptual misunderstandings or missing boundaries |
| AI tips | Record feedback intensity and prompt content |
| correction process | Determine whether learners truly understand |
| final explanation | Check if you can explain the plan in your own words |
These data can form a learner’s ability profile and also form AI teaching evaluation data. For example, for the same mistake, does AI give the answer directly, or guide the student to locate it? Can students correct independently after being prompted? This is a better way to judge the quality of teaching than “whether the final code passes or not.”
Misconception labels are more important than the number of questions
It is easy for educational platforms to pile up questions. The more questions there are, the more complete it looks. But what really determines the effectiveness of teaching is misunderstanding labels.
| Misconception labels | Example |
|---|---|
| Missing boundary conditions | Empty array, duplicate values, None, oversized input |
| Status update error | Modify collections in loops and share mutable default values |
| Complexity misjudgment | Handle large inputs with nested loops |
| Inadequate test understanding | Only test happy paths, not error paths |
| Confusing levels of abstraction | Mix I/O, business logic and formatting together |
Misunderstanding labels can connect three things: student ability assessment, AI teaching strategies, and training sample construction. Without misunderstanding labels, learning records are just a running account; with misunderstanding labels, learning records become analyzable data.
What can be learned from this case?
Programming education scenarios can accumulate five types of assets:
| assets | use |
|---|---|
| Hierarchical question set | Covers basics, boundaries, testing, refactoring and review capabilities |
| Misunderstanding tag library | Diagnosing the quality of feedback for learners and AI |
| Guided feedback sample | Training AI does not give answers directly, but guides them step by step |
| learning process eval | Evaluate whether AI promotes understanding rather than ghostwriting |
| Teaching review data | Optimize questions, prompt strategies and tutor intervention points |
This case can also be connected to production engineering data. The misunderstandings that new engineers are repeatedly exposed in training camps are often the same mistakes that AI will make in real code: omitted boundaries, insufficient error paths, weak testing awareness, and excessive abstraction. If the educational data is well structured, it can feed back into the subsequent Coding Mentor system.
From cases to engineering assets: unified architecture and practical path
The four cases look different: model selection, feedback protocol, code review, and programming education. But behind them is actually the same architecture.
The unified process is:
- Real tasks enter the system.
- AI gives planning, output, review or teaching feedback.
- Human Mentors judge whether the output meets engineering goals.
- Feedback is structured into error types, root causes, correction strategies, and validation evidence.
- Data is routed to eval, knowledge base, SFT candidate, preference data, or drop zone.
- The next round of models, prompts, toolchains, and team processes are adjusted based on the evaluation results.
This architecture is connected to the data closed loop in Part 7. Part 6 explains why closed loops are necessary through four cases, and Part 7 abstracts closed loops into organizational-level systems.
Project implementation: from case articles to team practice
If the team wants to implement these four cases, don’t roll them out at the same time. A more stable approach is to sort by value density.
The first stage: Use model selection to create a private eval seed set
Start with 30 to 60 real-world tasks and don’t aim to cover every scenario. The goal is for teams to have a baseline of their own AI programming capabilities for the first time.
Minimum product:
| product | Require |
|---|---|
| task list | From real bugs, test escapes, review controversies |
| Evaluate rubrics | Clarify correctness, minimality, verification and constraints |
| Model comparison report | Not only the total score is ranked, but also the ability profile is given |
| risk boundary | Mark which tasks allow deep AI involvement |
Phase 2: Transform a high-frequency scenario into a feedback protocol
Choose a scenario with high frequency, low risk, and clear standards, such as API documentation, unit test generation, and error log interpretation. Don’t choose architectural design or security modules initially.
Minimum product:
| product | Require |
|---|---|
| mission contract | Clarify input, output, and acceptance criteria |
| Error type | Controlled in categories 6 to 10 |
| High quality examples | Only keep samples approved by the team |
| Bad example correction | Reasons for saving failure and reasons for correction |
| Return eval | You can run it repeatedly after changing the model or prompt. |
Phase 3: Let code review generate Mentor signals
Don’t let AI review block the merge. Let it do pre-review first, and then let humans mark valid, false positives, duplicates, and style preferences. After the signal stabilizes, decide which rules can be automatically blocked.
Minimum product:
| product | Require |
|---|---|
| Review Tag | Security, performance, error paths, contracts, testing, etc. |
| effectiveness feedback | Are human-labeled AI reviews valid? |
| High frequency question bank | Monthly statistics on recurring issues |
| Modify rules | High-value comments become team norms |
Stage 4: Use educational scenarios to train feedback abilities
Educational scenarios can serve as low-risk training grounds. Let AI learn to “help people without giving answers” first, and then transfer this feedback ability to engineering scenarios.
Minimum product:
| product | Require |
|---|---|
| Misconception labels | Covers common mistakes such as boundaries, complexity, testing, abstraction, etc. |
| Guidance strategy | Distinguishing prompts, rhetorical questions, local error correction, and concept review |
| learning trajectory | Save initial answers, tips, corrections, and final explanations |
| teaching eval | Evaluate whether AI promotes understanding rather than ghostwriting |
Data model shared by four cases
If these four cases were just stories in an article, they would not generate long-term value. To enter team practice, they must be pressed into the same data model. This model does not need to be complicated from the beginning, but it must be able to answer four questions: what is the task, what did the AI do, why did humans change it, and finally where should this record go.
I recommend abstracting each case into a Mentor Event. It is not a complete training sample, nor a complete evaluation task, but an intermediate layer between the original logs and data assets. This is where many teams fail: They either save the original conversations or directly organize the training set, without an auditable, filterable, and routable fact layer in between.
| field group | Record content | Corresponding case |
|---|---|---|
| task context | Task type, business goals, code scope, risk level, acceptance criteria | All four cases require |
| AI behavior | Generate, review, interpret, guide, plan, modify suggestions | Model selection, code review, education scenarios |
| human feedback | Error types, reasons for correction, reasons for rejection, transferability principles | Feedback protocol, code review |
| verification evidence | Test results, review conclusions, learner corrections, manual scoring | Model selection, educational scenarios |
| Data routing | eval, knowledge base, SFT candidates, preference data, discard | All four cases require |
The key to this model is to retain the “why”. Only the AI output and final results are recorded, and only rough statistics can be done later; only by recording why humans approve or reject them can training and evaluation assets be formed. For example, in the case of API documentation, “incomplete parameter description” is only a problem, “AI does not perform consistency checks against interface fields and response fields” is the root cause, and “document generation must include field coverage checks” is the migration principle.
Mentor Event has another benefit: it allows data from four cases to be merged. Failed tasks in model selection may enter private eval; bad example corrections in the feedback protocol may enter SFT candidates; valid comments in code review may enter the knowledge base; misunderstanding labels in educational scenarios may become the basis for the next round of exercise question design. Without a unified model, these assets will be scattered in different systems, and in the end they will still rely on human brain memory.
Human Tutor Workbench: Don’t leave feedback just in the chat window
To actually get the four cases up and running, the team needed a lightweight mentor workbench. It doesn’t have to be a standalone platform, it can start by embedding PRs, Issues, documentation systems or internal forms. The focus is not on the interface, but on allowing human Mentors to complete three things at minimal cost: confirm facts, supplement judgment, and decide routing.
A practical tutor workbench can be divided into four areas.
| area | effect | minimal implementation |
|---|---|---|
| fact area | Display tasks, AI output, diffs, tests, reviews or learning trajectories | Automatically pull from existing systems |
| Judgment area | Let Mentor mark error types, root causes, and correction principles | Default label plus short text |
| evidence area | Link tests, screenshots, CI, human grading or learner correction | Mainly automatic association, supplemented by manual supplement |
| routing area | Decide whether samples should be entered into eval, knowledge base, training candidates, or discarded | Single choice plus reason explanation |
There are two design principles for this workbench. First, Mentor cannot be allowed to repeatedly transport facts. Facts should come from the system as much as possible, and humans should only make judgments. Second, you cannot force every record to be refined. Most records only require rough marking, and a few high-value records are worthy of in-depth sorting.
If the team only has a few senior engineers who can be mentors, their time must be protected. Junior developers can first mark “AI made a mistake here”, the automatic system can supplement testing and diff, and finally Mentor will judge whether this record is worth entering the asset library. This way the expert’s time is spent judging value rather than doing data entry.
Data governance and assessment boundaries
Indicator system: Whether a case is valid depends on signal quality
After the four cases are implemented, we cannot just look at “whether AI is better to use.” This indicator is too thick. More accurate indicators should be divided into three layers: delivery layer, feedback layer, and asset layer.
| Hierarchy | index | illustrate |
|---|---|---|
| delivery layer | Task completion time, PR rework rate, AI review efficiency, learning task pass rate | Determine whether AI will improve current jobs |
| feedback layer | Error type coverage, Mentor annotation consistency, feedback reusability rate | Determine whether human feedback is structured |
| Asset layer | Number of private evals, number of SFT candidate samples, knowledge base rule hit rate, discard rate | Determine whether the case has accumulated long-term assets |
One of the most overlooked is the feedback layer. Many teams can count how much code AI generates and how much time it saves, but they don’t know whether human feedback is reusable. If the feedback is still “this paragraph is not good”, “change it again” and “does not meet the specifications”, then even if the AI participation rate is high, it has not developed Mentor capabilities.
I will focus on two indicators in particular.
The first is the “similar error recurrence rate.” If the team has marked missing API documentation fields as an error and added checking rules to the protocol, then the number of errors should gradually decrease. A drop indicates that the feedback is absorbed by the system; a non-drop indicates that the feedback is simply recorded.
The second is “sample route hit rate”. If out of 100 AI collaboration records, 95 do not know where to go, it can only mean that the data model is too rough or the gating is too weak. A more mature system should be able to route most records clearly: some go into eval, some go into the knowledge base, some go into training candidates, and some are discarded because they are sensitive or of low value.
Review rhythm: The case library must be continuously pruned
The bigger the case library, the better. Model selection tasks will expire, feedback protocols will be adjusted as team standards change, code review rules will become invalid due to architectural evolution, and educational misunderstandings will also change as learner levels change. A case library that no one pruns will end up disguising old constraints as new standards.
I recommend establishing both a monthly and a quarterly rhythm.
| Rhythm | processing content | person in charge |
|---|---|---|
| per month | Statistics of high-frequency errors, review noise, eval failure, and reasons for sample discarding | Team Mentor and Platform Engineering |
| quarterly | Remove expired tasks, re-evaluate rubrics, check train/eval isolation, and update knowledge base rules | Architecture Lead, QA, Security and AI Engineering |
This is not process mystique. AI systems are most afraid of “stale but authoritative” data. Humans might look at an old norm and realize it’s obsolete, models don’t. The model only treats context or training examples as current facts. Therefore, the case library must have a life cycle, especially those samples with strong rule implications.
When should you not use cases as training data?
This article has been talking about precipitation data, but not all high-quality cases should enter training. Training is the heaviest consumption method, and in many cases knowledge base, eval or manual specifications are more suitable.
| scene | A more suitable destination | reason |
|---|---|---|
| Contains sensitive business logic | After desensitization, enter the knowledge base or only keep the summary | Raw data is risky |
| Reflect interim release strategy | Review documents | It is easy to solidify the temporary method after training |
| The controversy mainly comes from personal style | Not into training, at most into team norm discussions | Preferences are unstable |
| The task depends on a specific historical version | private eval or archive | Not necessarily suitable for generalization |
| The error is caused by a tool chain defect | Toolchain backlog | Models should not be allowed to learn to bypass tool flaws |
This boundary prevents teams from mistaking “data asset awareness” for “training on everything.” Many times, the most effective improvement is not to fine-tune the model, but to add an eval, change a checker, update a specification, or fix a tool chain defect.
Data Governance: The more real the case, the more boundaries are needed
If the case study only stays in the article, there is little risk. Once the team turns the case into a data asset, governance issues must be addressed.
| risk | Processing method |
|---|---|
| Customer data enters trace | Desensitization before collection, sensitive fields are not saved by default |
| Training set pollution eval | train/eval/holdout partition and version management |
| Historical errors are solidified | Samples must have a life cycle and removal mechanism |
| Personal preferences become rules | Preference samples must state constraints and justifications |
| AI review Noise hurts trust | Severity stratification and human effectiveness feedback |
| Over-tracking of education data | Minimize the collection of learning data and clarify its purpose |
You cannot rely on verbal agreement here. Once data enters the training or evaluation process, it will in turn affect model behavior. Wrong data governance will train the team’s historical debt into future default behavior.
Action Path: Calibrating Practice Direction with Four Cases
What readers should take away: Don’t start with a template
After reading these four cases, the easiest thing to take action immediately is “organize a better prompt template”. This action is effective in the short term, but should not be the first step. Templates are just a way to temporarily hand over rules to models. What really needs to be built first is the judgment system behind the rules.
In team practice, a more stable order is to first define the task contract, then break down the error types, then write the rubric, and finally decide whether these rules will be consumed through prompt, RAG, eval, review checklist or SFT samples. The advantage of this is that even if the model is changed and the tool chain is changed, the team still retains stable quality standards.
If you are dealing with a scenario where “AI output is unstable”, you can ask four questions first:
| question | product |
|---|---|
| What’s wrong with AI? | Error type table |
| Why do humans think it is wrong? | Reasons for feedback and transferability principles |
| How to prove that it has been corrected | Acceptance criteria and verification evidence |
| Where should this record go next? | eval, knowledge base, training candidates or discard |
Only when these questions are clearly answered can the prompt be meaningful. Otherwise, prompt is just a pile of requirements, and the team does not express more clearly what is good, what is wrong, and what improvements are worth retaining.
Use four cases to calibrate your own practice direction
These four cases can also serve as a set of self-check checklists. To judge whether an AI programming practice is really close to Coding Mentor, it does not depend on how many tools it uses or how many automated scripts it writes, but whether it turns human judgment into reusable signals.
| scene | if you’re just doing these things | Should be asked further |
|---|---|---|
| Model selection | Compare model lists and subjective experiences | Do we have our own private eval and task fitness profile |
| feedback protocol | Constantly modify the prompt copy | Do we precipitate error types, rubrics, and acceptance protocols |
| code review | Let AI generate more review comments | Do we know which comments are valid, which are noise, and which can settle into team rules? |
| Programming education | Let AI give answers faster | Have we recorded learning trajectories, misunderstanding labels and ability changes? |
The point of this self-test is not to negate tool usage. Model lists, prompts, AI reviewers, and AI assistants can all continue to be used, but they can only count as entrances. The real watershed is whether the team can continuously extract task, bug, feedback, validation and routing information from these portals.
If a practice can only improve the efficiency of current delivery, but cannot review why AI succeeds or fails, it will still remain at “using AI”. If a practice can make the next evaluation more accurate, the next feedback more consistent, and the next batch of training candidate samples higher quality, it will begin to approach “serving as a Coding Mentor for AI.”
Next step: From the case to the organizational level closed loop
The four cases provide entrances, not destinations. Model selection allows the team to see task boundaries, feedback protocols make quality standards explicit, code review allows engineering preferences to be marked, and programming education allows the learning process to be evaluated. They all ultimately point to the same question: how these signals are stably collected, filtered, routed, and multiplexed by organizations.
This is where Part 7 continues. A single case can be driven by a few senior engineers, but organizational-level closed loops require a clearer system design: who is responsible for collection, who is responsible for annotation, which data goes into eval, which goes into the knowledge base, which can become SFT candidates, and which must be discarded because it is sensitive, expired, or of low value.
We will discuss SFT data generation in Part 8, the order is reasonable. Training data should not be extracted directly from chat logs or PR comments, but should come from engineering assets that have been filtered through task contracts, error classification, human feedback, verification evidence, and governance gates. There is high-quality feedback first, and then there is high-quality data; there is private eval first, and then we talk about training effects; there are governance boundaries first, and then we talk about automated pipelines. The final article, Part 9, returns to long-term evolution and future judgment.
Conclusion: The value of a case is not in the story, but in the reusable signal
The core judgment of this set of cases is simple: the case study is not to prove that AI tools are useful, but to prove how humans can turn the AI collaboration process into a reusable guidance system.
The model selection case tells us not to ask “which model is the best”, but to ask “which model is reliable within our task boundaries”. The feedback protocol case tells us not to regard prompt as a core asset, but to structure human quality standards. Code review cases tell us not to let AI pretend to be the final gatekeeper, but to let it become a review signal collector. Programming education cases tell us not to just let AI give answers, but to turn the learning process into ability assessment data.
If the team can only take one action away from this article, I suggest starting with Case 2: find a high-frequency, low-risk scenario, split “AI output is not good” into 6 to 10 error types, write out the qualification criteria, collect 20 good samples and 20 bad case corrections, and then do a small eval. This action is more valuable than continuing to change prompt.
The acceptance criteria should also be simple: after one month, whether similar errors have been reduced, whether manual feedback is more consistent, and whether high-value samples can be clearly routed. If these three things do not change, it means that the team has just written a new template and has not established a real Mentor mechanism.
Because the real Coding Mentor is not to prompt the AI to be more obedient, but to turn human engineering judgment into a signal that the AI can learn, the team can review, and the system can evaluate.
References and Acknowledgments
- OpenAI: Prompt engineering
- Anthropic: Prompt engineering overview
- OpenAI: OpenAI Evals API
- OpenAI: How evals drive business results
- LangChain: Evaluating AI apps with LangSmith
- LangChain: Human feedback and annotation queues
- GitHub Docs: About code review in GitHub Copilot
- GitHub Blog: Research: quantifying GitHub Copilot’s impact in the enterprise with Accenture
- Jimenez et al.: SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Series context
You are reading: AI Coding Mentor Series
This is article 6 of 9. Reading progress is stored only in this browser so the full series page can resume from the right entry.
Series Path
Current series chapters
Chapter clicks store reading progress only in this browser so the series page can resume from the right entry.
- Why do you need to be a coding mentor for AI? When AI programming assistants become standard equipment, the real competitiveness is no longer whether they can use AI, but whether they can judge, calibrate and constrain the engineering output of AI. This article starts from trust gaps, feedback protocols, evaluation standards and closed-loop capabilities to establish the core framework of "Humans as Coding Mentors".
- Panorama of AI programming ability evaluation: from HumanEval to SWE-bench, the evolution and selection of benchmarks Public benchmarks are not a decoration for model rankings, but a measurement tool for understanding the boundaries of AI programming capabilities. This article starts from benchmarks such as HumanEval, APPS, CodeContests, SWE-bench, LiveCodeBench and Aider, and explains how to read the rankings, how to choose benchmarks, and how to convert public evaluations into the team's own Coding Mentor evaluation system.
- How to design high-quality programming questions: from question surface to evaluation contract High-quality programming questions are not longer prompts, but assessment contracts that can stably expose the boundaries of abilities. This article starts from Bloom level, difficulty calibration, task contract, test design and question bank management to explain how to build a reproducible question system for AI Coding Mentor.
- Four-step approach to AI capability assessment: from one test to continuous system evaluation Serving as a coding mentor for AI is not about doing a model evaluation, but establishing an evaluation operation system that can continuously expose the boundaries of capabilities, record failure evidence, drive special improvements, and support collaborative decision-making.
- Best Practices for Collaborating with AI: Task Agreement, Dialogue Control and Feedback Closed Loop The core skill of being a Coding Mentor for AI is not to write longer prompt words, but to design task protocols, control the rhythm of conversations, identify error patterns, and precipitate the collaboration process into verifiable and reusable feedback signals.
- Practical cases: feedback protocol, evaluation closed loop, code review and programming education data Case studies should not stop at “how to use AI tools better”. This article uses four engineering scenarios: model selection evaluation, feedback protocol design, code review signal precipitation, and programming education data closed loop to explain how humans can transform the AI collaboration process into evaluable, trainable, and reusable mentor signals.
- From delivery to training: How to turn AI programming collaboration into a Coding Mentor data closed loop The real organizational value of AI programming assistants is not just to increase delivery speed, but to precipitate trainable, evaluable, and reusable mentor signals in every requirement disassembly, code generation, review and revision, test verification, and online review. This article reconstructs the closed-loop framework of AI training, AI-assisted product engineering delivery, high-quality SFT data precipitation, and model evaluation.
- From engineering practice to training data: a systematic method for automatically generating SFT data in AI engineering Following the data closed loop in Part 7, this article focuses on how to process the screened engineering assets into high-quality SFT samples and connect them to a manageable, evaluable, and iterable training pipeline.
- Future Outlook: Evolutionary Trends and Long-term Thinking of AI Programming Assessment As the final article in the series, this article reconstructs the future route of AI Coding Mentor from the perspective of engineering decision-making: how evaluation objects evolve, how organizational capabilities are layered, and how governance boundaries are advanced.
Reading path
Continue along this topic path
Follow the recommended order for AI programming assessment instead of jumping through random articles in the same topic.
Next step
Go deeper into this topic
If this article is useful, continue from the topic page or subscribe to follow later updates.
Loading comments...
Comments and discussion
Sign in with GitHub to join the discussion. Comments are synced to GitHub Discussions