Article

Practical cases: feedback protocol, evaluation closed loop, code review and programming education data

Case studies should not stop at “how to use AI tools better”. This article uses four engineering scenarios: model selection evaluation, feedback protocol design, code review signal precipitation, and programming education data closed loop to explain how humans can transform the AI collaboration process into evaluable, trainable, and reusable mentor signals.

Topic · AI programming assessment Series AI Coding Mentor Series 6/9

Ai Coding Mentor Case Study Original Interpretation Feedback Protocol Evaluation Framework Human Ai Collaboration

Beginning: Extracting Mentor signals from the case

When reading this set of cases, the most important thing is not to learn how to use a certain tool, nor to collect a set of prompts that can be copied directly. What really deserves attention is: how AI exposes the boundaries of capabilities in real engineering scenarios, how humans give learnable feedback, and how teams can precipitate the judgments left by a collaboration into subsequent evaluation, training, and governance assets.

If a case only tells you “how to write prompt words better”, it solves the problem of primary output quality; if a case can break down the poor output of AI into error types, feedback protocols, evaluation criteria and training candidate samples, it truly enters the main axis of Coding Mentor.

Therefore, the next four cases do not have tools as the subject, but the human Mentor’s judgment as the subject. Tools are just entry points, prompts are just a way to consume the protocol, and code review comments are just raw signals. What really depends is whether these portals can continue to generate verifiable, routable, and reusable Mentor signals.

When reading each case, you can ask five fixed questions:

Ask	Why is it important
What mistakes are most likely to be made by AI in this scenario?	Only when you see error patterns can you talk about guidance
What feedback should a human Mentor give?	Feedback must move from subjective evaluations to learnable signals
What evidence can prove that improvements are effective?	There is no evaluation evidence, the case is just experience sharing
What assets can these processes precipitate into?	Long-term value comes from capacity building, not one-time efficiency improvement
What data cannot enter training or evaluation	Data governance determines whether the closed loop is reliable

The four cases correspond to the four Mentor capabilities:

Model selection evaluation: Transform “Which AI is easy to use” into “Which AI is suitable for our task boundaries”.
Feedback protocol design: Transform “prompt word optimization” into “error diagnosis, feedback structure and acceptance criteria”.
Code review signal precipitation: Transform “AI to help review” into “trainable samples of team engineering standards”.
Closed loop of programming education data: Transform “AI teaching students” into “learning trajectories, misunderstanding labels and ability assessment data”.

AI Coding Mentor Four Types of Case Maps

If you have understood the previous assessment, question design and collaboration methods, this set of cases is the entrance to engineering practice. First, clearly understand the errors, feedback, and evidence in the specific scenario, and then enter the closed loop of organizational-level data; first, accumulate high-quality Mentor signals, and then discuss which samples are eligible to enter the SFT data generation process. The order cannot be reversed.

Case 1: Model selection is not about selecting the champion, but about task fitness.

When many teams choose AI programming assistants for the first time, they will naturally open public lists: SWE-bench rankings, Aider leaderboard, model manufacturer documents, and community discussions. This action is correct, but it can only answer a rough question: what is the approximate ability level of a certain model in public tasks.

What the team really wants to answer is another question: Is this model reliable within our code base, our types of tasks, our security boundaries, and our review habits.

The difference between the public list and the team’s private evaluation can be understood like this:

Assessment caliber	Answered questions	unanswerable questions
Public list	Whether the general coding capability of the model has entered the available range	Is it a good fit for the team codebase and business constraints?
Manufacturer documentation	Model context, tool capabilities, pricing, and interface limitations	Rework costs in real PRs
Small sample trial	Is the developer’s subjective experience smooth?	Stability, boundary failure, and long-term quality
private eval	Whether the model can complete the real tasks of the team	Not fully representative of all future tasks

Therefore, the core of model selection is not to “select the strongest model”, but to establish a private task evaluation protocol. This protocol should be sampled from the team’s real work, rather than improvising a few algorithmic questions.

scene background

Assume a 50-person engineering team is preparing to introduce an AI programming assistant. Candidates include a universal conversation model, an IDE built-in assistant, a command-line coding agent, and a private model that is accessed internally. The team discussion was divided into several groups at the beginning: some people value the code generation speed, some value the context window, some value the price, some are worried about security, and some think that as long as developers like it.

If this type of discussion is not converged, it can easily turn into a dispute over tool preferences. Coding Mentor’s approach is to first define the task boundary, and then let the model enter the task boundary for evaluation.

Selection issues from the perspective of Mentor

Model selection must be broken down into at least five categories of capabilities, rather than just looking at “how well the code is written.”

Capability dimension	Assessment questions	Task sample source
Requirements understanding	Ability to identify true goals, non-goals and acceptance criteria	Historical requirements, PR description, defect ticket
Code positioning	Can the correct file and call chain be found?	Historical bugfixes, refactoring tasks
minimal modification	Whether to avoid irrelevant changes and excessive refactoring	review return record
Verify awareness	Whether to proactively perform additional tests and run correct commands	CI failure record, test escape
constraint compliance	Whether security, performance, compatibility and architectural boundaries are respected	Architecture decision-making, accident review

The key here is “task sample source”. If the samples come from real historical tasks, the problems exposed by the model will be close to real delivery. If the samples are just randomly constructed demonstration questions, the evaluation results will easily be optimistic.

The inspiration for SWE-bench is here. It incorporates real GitHub issues and code repositories into its evaluation, emphasizing the need for models to fix issues in real engineering context. The team does not necessarily need to replicate the complex environment of SWE-bench, but it should learn its core spirit: to evaluate software engineering capabilities, the model must be put into a real code environment and real experimental acceptance conditions.

How to design an assessment protocol

A practical selection assessment does not need to be very big at the beginning. It is recommended to start with 30 to 60 private tasks, divided into four groups:

task force	Quantity recommendations	Target
Bug fixes	10 to 20	Test model to locate root cause and minimum repair capability
Test enhancement	8 to 12	Test model understands boundary conditions and verification paths
Small function implementation	8 to 15	Test the completeness of the model from requirements to implementation
code review	8 to 12	The test model identifies risks, explains the reasons and gives correction suggestions.

Each task contains at least five fields: task description, code version, context for allowed access, acceptance criteria, reference evaluation rubric. Don’t just put “standard answers.” Standard answers are useful for training but not sufficient for evaluation. Evaluation requires knowing why the model failed, why it failed, and what went wrong.

How to interpret the results

Don’t just rank a total score in the selection report. Total scores can easily mask risk. Even more valuable is the “ability profile”.

Model performance	possible conclusion
Bug fixing is strong, but testing is weak	Suitable for development assistance, not suitable for independent submission
The generation speed is fast, but the scope of modification is large	Suitable for drafts, not suitable for direct push of main code
review has many suggestions but many false positives	Can be used for pre-review, but not suitable for blocking mergers
Contextual utilization is stable, but the price is high	Suitable for high-risk modules, not suitable for full-scale popularization
Very strong in small tasks, but weak in planning for large tasks	Suitable for local implementation, not suitable for cross-module reconstruction

GitHub and Accenture’s research on Copilot’s enterprise practices reminds us that enterprise scenarios should not only look at developers’ subjective experience, but also combine telemetry, quality and process indicators. The same goes for model selection. It’s important for developers to feel comfortable, but it’s not a substitute for evaluating evidence.

What can be learned from this case?

After completing the model selection evaluation, don’t just leave a PPT. It should deposit at least four types of assets:

assets	Subsequent use
private task set	Model upgrade, tool replacement, prompt return when revised
competency profile	Decide which tasks allow deep AI involvement
Error type distribution	Guide follow-up feedback protocol and training sample collection
risk boundary	Decide whether high-risk modules disable or limit AI

This is Coding Mentor’s selection method: instead of asking “which model is the strongest”, ask “what model can be trusted to what extent on what tasks”.

Case 2: Optimizing prompt words into feedback protocol design

The original case 2 was about API document generation, iterating from a simple prompt word to structured prompts, few-shot examples, and quality inspection checklists. The problem with this writing is not the technique itself, but the conclusion is too shallow: it attributes the change from 60 points to 90 points to a more detailed writing of the prompt words.

A more accurate explanation is that humans have made implicit quality standards explicit.

When a team says “AI-generated API documentation is poor,” it doesn’t help the model. What’s the difference? Is it that the structure is incomplete, parameters are missing, error codes are unclear, examples are not runnable, or terminology is inconsistent with team norms? Only by breaking down “differences” into protocols that can be diagnosed, fed back, and accepted can AI have a path to improvement.

From prompt word optimization to feedback protocols

scene background

A backend team wanted to use AI to assist in generating API documentation. Initially developers threw interface code to AI and let it “generate documentation”. The result looked good, but there were many problems when it came to the review meeting: missing fields in the parameter table, incomplete error responses, sample code that could not be run, the same concept with inconsistent names in different documents, and authentication instructions sometimes written and sometimes not written.

If we only optimize the prompt at this time, the team will easily fall into a loop: if a problem is found, then add a requirement to the prompt. After a few rounds, the prompt becomes very long, but the quality is still unstable. Because the team has not established a feedback protocol, it is just heaping constraints.

What exactly is the problem with the 60-point output?

The first step is not to change prompt, but to diagnose the type of output failure.

Failure type	surface phenomenon	Mentor judgment
missing structure	Document paragraphs are out of order	AI doesn’t know team-agreed document structure
Missing fields	Parameter description is incomplete	AI does not check the consistency between code fields and document fields.
Missing error path	Write only success response	AI only covers happy path by default
Example is not runnable	Curl or SDK examples are missing parameters	AI does not perform replicability checks
term drift	Token, access token, credential mixed	AI is missing a team glossary
Unclear security boundaries	Authentication and permission descriptions are ambiguous	The AI does not know which safety constraints must be written out explicitly

This table is more important than a “better prompt”. It unpacks unsatisfying human intuitions into misclassifications. Error classification can be entered into the evaluation rubric or the review tag, or it can also be turned into metadata for subsequent SFT samples.

Four-layer structure of feedback protocol

I suggest splitting the feedback protocol in this scenario into four layers instead of just writing a prompt template.

Hierarchy	effect	product
mission contract	Clarify inputs, outputs, non-objectives and acceptance criteria	Document generation task card
quality rubric	Define what is qualified, excellent, and unacceptable	Score sheet and error types
sample sample	Demonstrate high-quality output and typical bad examples	few-shot sample library
Verification mechanism	Check fields, examples, terminology, security instructions	Automatic inspection and manual spot inspection

Both OpenAI and Anthropic’s prompt engineering documentation emphasize the importance of clear instructions, examples, and constraints. But in an engineering team, these things shouldn’t just exist in one prompt. They should be extracted into stable assets: task contracts can be reused, rubrics can be evaluated, example libraries can be updated, and verification mechanisms can be regressed.

What role should Prompt play here?

Prompt is not the core asset, the feedback protocol is the core asset. Prompt is just a way to temporarily inject a protocol into the model.

The same set of feedback protocols can be consumed in different ways:

Consumption pattern	Usage scenarios
Prompt template	Single build or in-IDE assistant
System commands	Internal documentation agent
RAG context	Retrieve specifications and examples based on interface type
eval rubric	Automatic or semi-automatic scoring
SFT sample	Train the model to form a stable document style
review checklist	Be consistent when reviewed by humans

This is why I don’t recommend writing case two as a “final prompt word template.” The final template will expire soon. The real long-term value is the agreement behind the template.

How to evaluate from 60 to 90 points

Many teams will use “60 points, 70 points, 80 points, 90 points” to express changes in the quality of AI output, but if the score has no source, it is just a subjective feeling. For improvements to be repeatable, scores must become auditable metrics.

index	60 points performance	90 points required
structural integrity	Chapters are missing or in unstable order	The fixed chapters are complete and the order complies with team specifications.
Field coverage	Only main parameters are covered	Parameters, response fields, error codes are consistent with the code
Example runnability	The example is missing parameters or cannot be copied	At least one success example and one error handling example can be run
Terminology consistency	synonym mix	Use a team glossary
Safety instructions	Occasional mention of authentication	Authentication, permissions and sensitive fields are clearly explained
manual modification	reviewer writes a lot	reviewer only does a small amount of confirmation

In this way, “90 points” is no longer the author’s subjective feeling, but the degree of agreement.

What can be learned from this case?

The most valuable asset in this case is not prompt, but six types of assets:

assets	use
API Documentation Task Contract	Constrain AI output boundaries
Error type table	as evaluation and training metadata
High quality sample library	Support few-shot, RAG, SFT
Bad examples and corrected pairs	Support Mentor feedback samples
Automatic checking rules	Check field coverage, terminology, security instructions
Private eval task	Regression to different models and different prompt versions

The conclusion of this case should be changed to: Prompt word optimization is only a superficial action, and the real engineering action is to turn human document review standards into feedback protocols. Only in this way will the AI not be “prompted more clearly this time” but gradually learn how the team judges good documents.

Case 3: Code review assistant is not a gatekeeper, but a signal collector

Code review is one of the most suitable scenarios for Coding Mentor. The reason is simple: review is inherently the place where human engineering judgments are most intensive. Security, performance, boundary conditions, abstraction costs, testing strategies, compatibility, and release risks will all appear in the review.

The problem is, most review comments only serve the current PR. After merging, they are rarely saved in a structured manner, let alone flowed back into model evaluation or training. AI code review tools can automatically make suggestions, and GitHub Copilot also provides PR-level code review capabilities. But if the team just treats it as “one more reviewer”, the value is still limited.

A better positioning is: the code review assistant is responsible for expanding the scope of problem discovery, and the human Coding Mentor is responsible for calibrating signals, turning high-value reviews into team engineering specifications, evaluation tasks, and training candidate data.

scene background

One team tried having AI review every PR first. The initial results were not good: the AI found obvious problems like null pointers, missing tests, and variable naming, but it also produced a lot of low-value suggestions. Developers complain that there is too much noise, and reviewers feel that they have to judge whether the AI’s words are credible, which makes it more tiring.

This is not because the AI review is worthless, but because the team did not break down the review tasks clearly.

What an AI review should and shouldn’t cover

Review type	Suitable for AI pre-qualification	Requires human calibration
Syntax, lint, simple bugs	Suitable	Low
Missing tests, missing wrong paths	Suitable	middle
security sensitive mode	Suitable for discovery candidates	high
performance risk	Fit tips	high
architectural boundaries	Team context is required	high
Product trade-offs	Not suitable for final judgment	extremely high

If AI were to give the final judgment on all issues, it would cross a line. If AI only does candidate discovery and humans are responsible for calibration, it can reduce the probability of missed detections while retaining human control over organizational constraints.

Review how signals are structured

A high-value review should leave at least four layers of signals:

Hierarchy	example
factual evidence	Which file, which diff, which test or log triggers the problem
Question type	Security, performance, wrong paths, contract violations, insufficient testing
engineering consequences	What risks will it cause if we don’t change it?
Correction principle	How to deal with similar problems in the future

A common comment says: “There may be a performance issue here”. Mentorized comments should say: “This loop queries the inventory status one by one in the request hot path. When the data volume increases, the database delay will be directly exposed to user requests. This should be changed to batch query and an integration test with more than 100 orders should be added.” Only the latter can enter the knowledge base, eval or training candidate samples.

From review to data assets

Code review scenarios can accumulate four types of assets.

assets	source	use
Review rubric	Team Norms and History review	Unifying AI and human review standards
error pattern library	High frequency review questions	Guidance prompts, RAGs and training
correct sample pair	Problems discovered by AI or humans and final patches	SFT or preference data candidates
private review eval	Issue locating tasks for historical PRs	Test whether the model uncovers issues the team cares about

Pay special attention here to the isolation of eval from training. A historical PR can be processed into a review eval or a training sample, but it cannot enter both at the same time in a way that reveals the answer. Otherwise, the model seems to have improved review capabilities, but in fact it just remembers historical answers.

How to control noise

The biggest engineering problem with AI reviews is often not false negatives, but noise. Noise consumes reviewer trust, and once developers get used to ignoring AI reviews, truly valuable tips will be skipped.

Noise can be controlled using three mechanisms:

mechanism	practice
Severity stratification	Only let high-risk issues block, the rest as suggestions
Evidence requirements	Comments without documentation, diff, testing or specification will be downgraded
feedback flow	Humans label false positives, valid, duplicates, style preference

LangSmith’s ideas of manual feedback and annotation queue are of great reference value here: instead of treating all model outputs as conclusions, humans can annotate trajectories and outputs, and then use these annotations to improve data sets and evaluations. In the same way as code review, AI comments themselves also need to be reviewed.

What can be learned from this case?

The value of code review cases is not to let AI replace the reviewer, but to turn the review process into a data source for the team’s engineering judgment.

It should eventually settle:

Which problems can AI detect reliably?
Which issues must be left to human judgment.
Which review rules can enter the knowledge base.
Which corrected samples are suitable for entering into SFT or preference data.
Which historical PRs can become private review evals.

This is what “training AI to be a team’s code gatekeeper” exactly means. Instead of deciding everything on their own, goalkeepers are constantly calibrated in a closed loop of clear rules and feedback.

Case 4: Programming education is not about letting AI be the teacher, but about making the learning process assessable

The programming education scenario is most easily written as “AI can personalize teaching.” This direction is correct, but it is still biased toward tools. In the Coding Mentor series, the more critical question is whether the interaction process between learners and AI can precipitate ability assessment, misunderstanding diagnosis and teaching feedback data.

The educational scene has a unique advantage: the learning process naturally contains “errors.” Students write wrong code, misunderstand concepts, fail tests, and correct after being prompted. These are all high-value trajectories. Compared with production code, educational data has lower security risks and clearer teaching signals, making it suitable for training AI’s feedback and guidance capabilities.

scene background

One team trains new engineers internally, hoping to use AI to assist learning Python, testing, code reviews, and simple system design. Initially, they let AI act as a teaching assistant, answering questions, explaining concepts, and providing exercises. The effect is good, but problems soon arise: AI sometimes gives answers directly, and students bypass thinking; different students receive different prompts; tutors cannot judge what students have mastered; and learning records cannot be reused.

The core of this scenario is not “whether the AI can teach”, but “whether the learning process can be evaluated.”

Mentor Principles of Teaching AI

In teaching scenarios, AI should not default to the final answer. It should select the feedback intensity based on the learner status.

learner status	Feedback AI should give	What not to do
completely stuck	Prompt question breakdown and related concepts	Post the complete answer directly
The idea is right but the implementation is wrong	Indicate the error location and minimum correction direction	Rewrite the entire plan
Passes tests but has poor structure	Guided comparison of readability, complexity, and bounds	Just say “can be optimized”
Making the same mistake repeatedly	Return to the conceptual level to explain misunderstandings	Continue to apply local patches
Have mastered the basics	Add constraints and boundary tasks	Keep repeating simple questions

This table is the feedback protocol in the educational scenario. It changes AI’s teaching behavior from “answering questions” to “adjusting feedback based on ability status.”

How learning trajectories become evaluation data

A learning task should not only record the final answer. At least five types of trajectories must be recorded:

trajectory	value
Initial solution	Reflect learners’ default thinking
test failed	Expose conceptual misunderstandings or missing boundaries
AI tips	Record feedback intensity and prompt content
correction process	Determine whether learners truly understand
final explanation	Check if you can explain the plan in your own words

These data can form a learner’s ability profile and also form AI teaching evaluation data. For example, for the same mistake, does AI give the answer directly, or guide the student to locate it? Can students correct independently after being prompted? This is a better way to judge the quality of teaching than “whether the final code passes or not.”

Misconception labels are more important than the number of questions

It is easy for educational platforms to pile up questions. The more questions there are, the more complete it looks. But what really determines the effectiveness of teaching is misunderstanding labels.

Misconception labels	Example
Missing boundary conditions	Empty array, duplicate values, None, oversized input
Status update error	Modify collections in loops and share mutable default values
Complexity misjudgment	Handle large inputs with nested loops
Inadequate test understanding	Only test happy paths, not error paths
Confusing levels of abstraction	Mix I/O, business logic and formatting together

Misunderstanding labels can connect three things: student ability assessment, AI teaching strategies, and training sample construction. Without misunderstanding labels, learning records are just a running account; with misunderstanding labels, learning records become analyzable data.

What can be learned from this case?

Programming education scenarios can accumulate five types of assets:

assets	use
Hierarchical question set	Covers basics, boundaries, testing, refactoring and review capabilities
Misunderstanding tag library	Diagnosing the quality of feedback for learners and AI
Guided feedback sample	Training AI does not give answers directly, but guides them step by step
learning process eval	Evaluate whether AI promotes understanding rather than ghostwriting
Teaching review data	Optimize questions, prompt strategies and tutor intervention points

This case can also be connected to production engineering data. The misunderstandings that new engineers are repeatedly exposed in training camps are often the same mistakes that AI will make in real code: omitted boundaries, insufficient error paths, weak testing awareness, and excessive abstraction. If the educational data is well structured, it can feed back into the subsequent Coding Mentor system.

From cases to engineering assets: unified architecture and practical path

The four cases look different: model selection, feedback protocol, code review, and programming education. But behind them is actually the same architecture.

Unified closed loop from cases to data assets

The unified process is:

Real tasks enter the system.
AI gives planning, output, review or teaching feedback.
Human Mentors judge whether the output meets engineering goals.
Feedback is structured into error types, root causes, correction strategies, and validation evidence.
Data is routed to eval, knowledge base, SFT candidate, preference data, or drop zone.
The next round of models, prompts, toolchains, and team processes are adjusted based on the evaluation results.

This architecture is connected to the data closed loop in Part 7. Part 6 explains why closed loops are necessary through four cases, and Part 7 abstracts closed loops into organizational-level systems.

Project implementation: from case articles to team practice

If the team wants to implement these four cases, don’t roll them out at the same time. A more stable approach is to sort by value density.

The first stage: Use model selection to create a private eval seed set

Start with 30 to 60 real-world tasks and don’t aim to cover every scenario. The goal is for teams to have a baseline of their own AI programming capabilities for the first time.

Minimum product:

product	Require
task list	From real bugs, test escapes, review controversies
Evaluate rubrics	Clarify correctness, minimality, verification and constraints
Model comparison report	Not only the total score is ranked, but also the ability profile is given
risk boundary	Mark which tasks allow deep AI involvement

Phase 2: Transform a high-frequency scenario into a feedback protocol

Choose a scenario with high frequency, low risk, and clear standards, such as API documentation, unit test generation, and error log interpretation. Don’t choose architectural design or security modules initially.

Minimum product:

product	Require
mission contract	Clarify input, output, and acceptance criteria
Error type	Controlled in categories 6 to 10
High quality examples	Only keep samples approved by the team
Bad example correction	Reasons for saving failure and reasons for correction
Return eval	You can run it repeatedly after changing the model or prompt.

Phase 3: Let code review generate Mentor signals

Don’t let AI review block the merge. Let it do pre-review first, and then let humans mark valid, false positives, duplicates, and style preferences. After the signal stabilizes, decide which rules can be automatically blocked.

Minimum product:

product	Require
Review Tag	Security, performance, error paths, contracts, testing, etc.
effectiveness feedback	Are human-labeled AI reviews valid?
High frequency question bank	Monthly statistics on recurring issues
Modify rules	High-value comments become team norms

Stage 4: Use educational scenarios to train feedback abilities

Educational scenarios can serve as low-risk training grounds. Let AI learn to “help people without giving answers” first, and then transfer this feedback ability to engineering scenarios.

Minimum product:

product	Require
Misconception labels	Covers common mistakes such as boundaries, complexity, testing, abstraction, etc.
Guidance strategy	Distinguishing prompts, rhetorical questions, local error correction, and concept review
learning trajectory	Save initial answers, tips, corrections, and final explanations
teaching eval	Evaluate whether AI promotes understanding rather than ghostwriting

Data model shared by four cases

If these four cases were just stories in an article, they would not generate long-term value. To enter team practice, they must be pressed into the same data model. This model does not need to be complicated from the beginning, but it must be able to answer four questions: what is the task, what did the AI do, why did humans change it, and finally where should this record go.

I recommend abstracting each case into a Mentor Event. It is not a complete training sample, nor a complete evaluation task, but an intermediate layer between the original logs and data assets. This is where many teams fail: They either save the original conversations or directly organize the training set, without an auditable, filterable, and routable fact layer in between.

field group	Record content	Corresponding case
task context	Task type, business goals, code scope, risk level, acceptance criteria	All four cases require
AI behavior	Generate, review, interpret, guide, plan, modify suggestions	Model selection, code review, education scenarios
human feedback	Error types, reasons for correction, reasons for rejection, transferability principles	Feedback protocol, code review
verification evidence	Test results, review conclusions, learner corrections, manual scoring	Model selection, educational scenarios
Data routing	eval, knowledge base, SFT candidates, preference data, discard	All four cases require

The key to this model is to retain the “why”. Only the AI output and final results are recorded, and only rough statistics can be done later; only by recording why humans approve or reject them can training and evaluation assets be formed. For example, in the case of API documentation, “incomplete parameter description” is only a problem, “AI does not perform consistency checks against interface fields and response fields” is the root cause, and “document generation must include field coverage checks” is the migration principle.

Mentor Event has another benefit: it allows data from four cases to be merged. Failed tasks in model selection may enter private eval; bad example corrections in the feedback protocol may enter SFT candidates; valid comments in code review may enter the knowledge base; misunderstanding labels in educational scenarios may become the basis for the next round of exercise question design. Without a unified model, these assets will be scattered in different systems, and in the end they will still rely on human brain memory.

Human Tutor Workbench: Don’t leave feedback just in the chat window

To actually get the four cases up and running, the team needed a lightweight mentor workbench. It doesn’t have to be a standalone platform, it can start by embedding PRs, Issues, documentation systems or internal forms. The focus is not on the interface, but on allowing human Mentors to complete three things at minimal cost: confirm facts, supplement judgment, and decide routing.

A practical tutor workbench can be divided into four areas.

area	effect	minimal implementation
fact area	Display tasks, AI output, diffs, tests, reviews or learning trajectories	Automatically pull from existing systems
Judgment area	Let Mentor mark error types, root causes, and correction principles	Default label plus short text
evidence area	Link tests, screenshots, CI, human grading or learner correction	Mainly automatic association, supplemented by manual supplement
routing area	Decide whether samples should be entered into eval, knowledge base, training candidates, or discarded	Single choice plus reason explanation

There are two design principles for this workbench. First, Mentor cannot be allowed to repeatedly transport facts. Facts should come from the system as much as possible, and humans should only make judgments. Second, you cannot force every record to be refined. Most records only require rough marking, and a few high-value records are worthy of in-depth sorting.

If the team only has a few senior engineers who can be mentors, their time must be protected. Junior developers can first mark “AI made a mistake here”, the automatic system can supplement testing and diff, and finally Mentor will judge whether this record is worth entering the asset library. This way the expert’s time is spent judging value rather than doing data entry.

Data governance and assessment boundaries

Indicator system: Whether a case is valid depends on signal quality

After the four cases are implemented, we cannot just look at “whether AI is better to use.” This indicator is too thick. More accurate indicators should be divided into three layers: delivery layer, feedback layer, and asset layer.

Hierarchy	index	illustrate
delivery layer	Task completion time, PR rework rate, AI review efficiency, learning task pass rate	Determine whether AI will improve current jobs
feedback layer	Error type coverage, Mentor annotation consistency, feedback reusability rate	Determine whether human feedback is structured
Asset layer	Number of private evals, number of SFT candidate samples, knowledge base rule hit rate, discard rate	Determine whether the case has accumulated long-term assets

One of the most overlooked is the feedback layer. Many teams can count how much code AI generates and how much time it saves, but they don’t know whether human feedback is reusable. If the feedback is still “this paragraph is not good”, “change it again” and “does not meet the specifications”, then even if the AI participation rate is high, it has not developed Mentor capabilities.

I will focus on two indicators in particular.

The first is the “similar error recurrence rate.” If the team has marked missing API documentation fields as an error and added checking rules to the protocol, then the number of errors should gradually decrease. A drop indicates that the feedback is absorbed by the system; a non-drop indicates that the feedback is simply recorded.

The second is “sample route hit rate”. If out of 100 AI collaboration records, 95 do not know where to go, it can only mean that the data model is too rough or the gating is too weak. A more mature system should be able to route most records clearly: some go into eval, some go into the knowledge base, some go into training candidates, and some are discarded because they are sensitive or of low value.

Review rhythm: The case library must be continuously pruned

The bigger the case library, the better. Model selection tasks will expire, feedback protocols will be adjusted as team standards change, code review rules will become invalid due to architectural evolution, and educational misunderstandings will also change as learner levels change. A case library that no one pruns will end up disguising old constraints as new standards.

I recommend establishing both a monthly and a quarterly rhythm.

Rhythm	processing content	person in charge
per month	Statistics of high-frequency errors, review noise, eval failure, and reasons for sample discarding	Team Mentor and Platform Engineering
quarterly	Remove expired tasks, re-evaluate rubrics, check train/eval isolation, and update knowledge base rules	Architecture Lead, QA, Security and AI Engineering

This is not process mystique. AI systems are most afraid of “stale but authoritative” data. Humans might look at an old norm and realize it’s obsolete, models don’t. The model only treats context or training examples as current facts. Therefore, the case library must have a life cycle, especially those samples with strong rule implications.

When should you not use cases as training data?

This article has been talking about precipitation data, but not all high-quality cases should enter training. Training is the heaviest consumption method, and in many cases knowledge base, eval or manual specifications are more suitable.

scene	A more suitable destination	reason
Contains sensitive business logic	After desensitization, enter the knowledge base or only keep the summary	Raw data is risky
Reflect interim release strategy	Review documents	It is easy to solidify the temporary method after training
The controversy mainly comes from personal style	Not into training, at most into team norm discussions	Preferences are unstable
The task depends on a specific historical version	private eval or archive	Not necessarily suitable for generalization
The error is caused by a tool chain defect	Toolchain backlog	Models should not be allowed to learn to bypass tool flaws

This boundary prevents teams from mistaking “data asset awareness” for “training on everything.” Many times, the most effective improvement is not to fine-tune the model, but to add an eval, change a checker, update a specification, or fix a tool chain defect.

Data Governance: The more real the case, the more boundaries are needed

If the case study only stays in the article, there is little risk. Once the team turns the case into a data asset, governance issues must be addressed.

risk	Processing method
Customer data enters trace	Desensitization before collection, sensitive fields are not saved by default
Training set pollution eval	train/eval/holdout partition and version management
Historical errors are solidified	Samples must have a life cycle and removal mechanism
Personal preferences become rules	Preference samples must state constraints and justifications
AI review Noise hurts trust	Severity stratification and human effectiveness feedback
Over-tracking of education data	Minimize the collection of learning data and clarify its purpose

You cannot rely on verbal agreement here. Once data enters the training or evaluation process, it will in turn affect model behavior. Wrong data governance will train the team’s historical debt into future default behavior.

Action Path: Calibrating Practice Direction with Four Cases

What readers should take away: Don’t start with a template

After reading these four cases, the easiest thing to take action immediately is “organize a better prompt template”. This action is effective in the short term, but should not be the first step. Templates are just a way to temporarily hand over rules to models. What really needs to be built first is the judgment system behind the rules.

In team practice, a more stable order is to first define the task contract, then break down the error types, then write the rubric, and finally decide whether these rules will be consumed through prompt, RAG, eval, review checklist or SFT samples. The advantage of this is that even if the model is changed and the tool chain is changed, the team still retains stable quality standards.

If you are dealing with a scenario where “AI output is unstable”, you can ask four questions first:

question	product
What’s wrong with AI?	Error type table
Why do humans think it is wrong?	Reasons for feedback and transferability principles
How to prove that it has been corrected	Acceptance criteria and verification evidence
Where should this record go next?	eval, knowledge base, training candidates or discard

Only when these questions are clearly answered can the prompt be meaningful. Otherwise, prompt is just a pile of requirements, and the team does not express more clearly what is good, what is wrong, and what improvements are worth retaining.

Use four cases to calibrate your own practice direction

These four cases can also serve as a set of self-check checklists. To judge whether an AI programming practice is really close to Coding Mentor, it does not depend on how many tools it uses or how many automated scripts it writes, but whether it turns human judgment into reusable signals.

scene	if you’re just doing these things	Should be asked further
Model selection	Compare model lists and subjective experiences	Do we have our own private eval and task fitness profile
feedback protocol	Constantly modify the prompt copy	Do we precipitate error types, rubrics, and acceptance protocols
code review	Let AI generate more review comments	Do we know which comments are valid, which are noise, and which can settle into team rules?
Programming education	Let AI give answers faster	Have we recorded learning trajectories, misunderstanding labels and ability changes?

The point of this self-test is not to negate tool usage. Model lists, prompts, AI reviewers, and AI assistants can all continue to be used, but they can only count as entrances. The real watershed is whether the team can continuously extract task, bug, feedback, validation and routing information from these portals.

If a practice can only improve the efficiency of current delivery, but cannot review why AI succeeds or fails, it will still remain at “using AI”. If a practice can make the next evaluation more accurate, the next feedback more consistent, and the next batch of training candidate samples higher quality, it will begin to approach “serving as a Coding Mentor for AI.”

Next step: From the case to the organizational level closed loop

The four cases provide entrances, not destinations. Model selection allows the team to see task boundaries, feedback protocols make quality standards explicit, code review allows engineering preferences to be marked, and programming education allows the learning process to be evaluated. They all ultimately point to the same question: how these signals are stably collected, filtered, routed, and multiplexed by organizations.

This is where Part 7 continues. A single case can be driven by a few senior engineers, but organizational-level closed loops require a clearer system design: who is responsible for collection, who is responsible for annotation, which data goes into eval, which goes into the knowledge base, which can become SFT candidates, and which must be discarded because it is sensitive, expired, or of low value.

We will discuss SFT data generation in Part 8, the order is reasonable. Training data should not be extracted directly from chat logs or PR comments, but should come from engineering assets that have been filtered through task contracts, error classification, human feedback, verification evidence, and governance gates. There is high-quality feedback first, and then there is high-quality data; there is private eval first, and then we talk about training effects; there are governance boundaries first, and then we talk about automated pipelines. The final article, Part 9, returns to long-term evolution and future judgment.

Conclusion: The value of a case is not in the story, but in the reusable signal

The core judgment of this set of cases is simple: the case study is not to prove that AI tools are useful, but to prove how humans can turn the AI collaboration process into a reusable guidance system.

The model selection case tells us not to ask “which model is the best”, but to ask “which model is reliable within our task boundaries”. The feedback protocol case tells us not to regard prompt as a core asset, but to structure human quality standards. Code review cases tell us not to let AI pretend to be the final gatekeeper, but to let it become a review signal collector. Programming education cases tell us not to just let AI give answers, but to turn the learning process into ability assessment data.

If the team can only take one action away from this article, I suggest starting with Case 2: find a high-frequency, low-risk scenario, split “AI output is not good” into 6 to 10 error types, write out the qualification criteria, collect 20 good samples and 20 bad case corrections, and then do a small eval. This action is more valuable than continuing to change prompt.

The acceptance criteria should also be simple: after one month, whether similar errors have been reduced, whether manual feedback is more consistent, and whether high-value samples can be clearly routed. If these three things do not change, it means that the team has just written a new template and has not established a real Mentor mechanism.

Because the real Coding Mentor is not to prompt the AI to be more obedient, but to turn human engineering judgment into a signal that the AI can learn, the team can review, and the system can evaluate.

References and Acknowledgments

OpenAI: Prompt engineering
Anthropic: Prompt engineering overview
OpenAI: OpenAI Evals API
OpenAI: How evals drive business results
LangChain: Evaluating AI apps with LangSmith
LangChain: Human feedback and annotation queues
GitHub Docs: About code review in GitHub Copilot
GitHub Blog: Research: quantifying GitHub Copilot’s impact in the enterprise with Accenture
Jimenez et al.: SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Series context

You are reading: AI Coding Mentor Series

This is article 6 of 9. Reading progress is stored only in this browser so the full series page can resume from the right entry.

View full series →

Reading path

Continue along this topic path

Follow the recommended order for AI programming assessment instead of jumping through random articles in the same topic.

View full topic path →

Next step

Go deeper into this topic

If this article is useful, continue from the topic page or subscribe to follow later updates.

Beginning: Extracting Mentor signals from the case

Case 1: Model selection is not about selecting the champion, but about task fitness.

scene background

Selection issues from the perspective of Mentor

How to design an assessment protocol

How to interpret the results

What can be learned from this case?

Case 2: Optimizing prompt words into feedback protocol design

scene background

What exactly is the problem with the 60-point output?

Four-layer structure of feedback protocol

What role should Prompt play here?

How to evaluate from 60 to 90 points

What can be learned from this case?

Case 3: Code review assistant is not a gatekeeper, but a signal collector

scene background

What an AI review should and shouldn’t cover

Review how signals are structured

From review to data assets

How to control noise

What can be learned from this case?

Case 4: Programming education is not about letting AI be the teacher, but about making the learning process assessable

scene background

Mentor Principles of Teaching AI

How learning trajectories become evaluation data

Misconception labels are more important than the number of questions

What can be learned from this case?

From cases to engineering assets: unified architecture and practical path

Project implementation: from case articles to team practice

The first stage: Use model selection to create a private eval seed set

Phase 2: Transform a high-frequency scenario into a feedback protocol

Phase 3: Let code review generate Mentor signals

Stage 4: Use educational scenarios to train feedback abilities

Data model shared by four cases

Human Tutor Workbench: Don’t leave feedback just in the chat window

Data governance and assessment boundaries

Indicator system: Whether a case is valid depends on signal quality

Review rhythm: The case library must be continuously pruned

When should you not use cases as training data?

Data Governance: The more real the case, the more boundaries are needed

Action Path: Calibrating Practice Direction with Four Cases

What readers should take away: Don’t start with a template

Use four cases to calibrate your own practice direction

Next step: From the case to the organizational level closed loop

Conclusion: The value of a case is not in the story, but in the reusable signal

References and Acknowledgments

You are reading: AI Coding Mentor Series

Current series chapters

Continue along this topic path

Why do you need to be a coding mentor for AI?

Panorama of AI programming ability evaluation: from HumanEval to SWE-bench, the evolution and selection of benchmarks

How to design high-quality programming questions: from question surface to evaluation contract

Continue with this topic

Four-step approach to AI capability assessment: from one test to continuous system evaluation

Best Practices for Collaborating with AI: Task Agreement, Dialogue Control and Feedback Closed Loop

From delivery to training: How to turn AI programming collaboration into a Coding Mentor data closed loop

Go deeper into this topic

Subscribe to updates

Comments and discussion