Article

From delivery to training: How to turn AI programming collaboration into a Coding Mentor data closed loop

The real organizational value of AI programming assistants is not just to increase delivery speed, but to precipitate trainable, evaluable, and reusable mentor signals in every requirement disassembly, code generation, review and revision, test verification, and online review. This article reconstructs the closed-loop framework of AI training, AI-assisted product engineering delivery, high-quality SFT data precipitation, and model evaluation.

Topic · AI programming assessment Series AI Coding Mentor Series 7/9

Ai Coding Mentor Evaluation System Original Interpretation Data Flywheel AI Engineering Sft Training

Beginning: What is really lost is not the code, but the process

A team involved AI in daily development for half a year: AI was asked to write a draft first when dismantling requirements, AI was asked to create the first version for function implementation, AI was asked to analyze logs when testing failed, and AI was asked to scan the PR before code review. Half a year later, more code is left in the warehouse, more commits are left in the PR, and more build records are left in the CI system. The question is: Can these things be used in turn to train an AI that understands the team better?

In most cases, no.

The reason is not that the code generated by AI has no value, but that the team only saves the delivery results, not the delivery process. Why did a failed implementation fail? Which business constraints did the AI miss? Why did the human reviewer drop a patch that seemed to work? After the test failed, which log actually pointed to the root cause? This information is usually scattered in the chat window, terminal history, PR comments and personal brains. After the delivery is complete, they are cleared just like temporary debug branches.

I prefer to put this matter more directly: if an AI-assisted development only leaves the final code, it is only a delivery; if it also leaves requirements, context, plans, tool calls, failure logs, manual corrections and acceptance evidence, it has the opportunity to become a reusable Coding Mentor training sample.

The previous articles discussed how to evaluate AI, how to design questions, how to organize collaboration and review cases. Article 7 cannot continue to be written as “How to use an AI programming assistant”, nor should it stop at “Building an evaluation platform”. This article will answer a question closer to the engineering system: how to involve AI in the entire process of product engineering delivery and design it into a closed loop that can continuously accumulate high-quality training data and evaluation data.

Four things happen simultaneously in this closed loop:

AI participates in real product delivery, not doing toy problems in a sandbox.
Human Coding Mentor gives structured feedback at key nodes instead of just saying “This section doesn’t work.”
Delivery traces are routed to eval, SFT, preference data, knowledge base or drop samples rather than all mixed up in the log.
Models, prompts, toolchains, and team processes are continuously adjusted based on evaluation results, rather than upgrading based on personal taste.

The core judgment of this article is that being able to write code with AI is not a barrier; being able to transform the entire process of AI participation in delivery into training assets, evaluation assets, and organizational memory is the real barrier for teams to serve as Coding Mentors for AI.

Problem boundaries and value judgments

Calibrate the direction first: This article is not an advance version of Part 8

This series already has an article dedicated to SFT data generation. That article will go into more detailed engineering implementation: data structure, cleaning, quality scoring, export format, and training process. This article does not repeat those details.

The responsibilities of Part 7 are more advanced: first design “where the data comes from, why it is trustworthy, how to divert it, who is responsible for judgment, what cannot be included in the training set, and how the evaluation in turn affects delivery.” If you think of Part 8 as a data processing factory, this article discusses the raw material production line, quality inspection rules and closed-loop operation methods.

I suggest using the following boundary to separate the two articles:

article	Following objects	core issues	Unexpanded content
Article 7	Data closed loop in AI engineering delivery	How every AI collaboration can become a trainable, evaluable, and reusable Mentor signal	Specific training scripts, export formats, and model fine-tuning parameters
Article 8	SFT data generation pipeline	How to process the screened engineering assets into training samples and connect them to the training process	Organizational closed loop, data ownership, online delivery indicator system

This boundary is important. When many teams talk about “training data”, they immediately jump to JSONL, SFTTrainer, LoRA, and evaluation scripts. That’s the question in the second half. If the first half is not designed well, the second half will only process the dirty data more beautifully.

The value of AI programming assistant is not equal to the value of Coding Mentor

GitHub’s early research on Copilot focused on developer speed and experience: AI can help developers complete tasks faster and improve subjective satisfaction. Later, enterprise research by GitHub and Accenture further put the perspective into enterprise practice, measuring the impact of AI programming tools based on dimensions such as telemetry, code quality, and developer feedback. This type of research is valuable because it proves that the involvement of AI in software delivery is not a gimmick but will actually change the development process.

But there is a bifurcation here that is easily overlooked: improving delivery efficiency and having a team with Coding Mentor capabilities are two different things.

When using AI programming assistants, the team’s concern is “Can AI help me complete the task faster?” When working as a Coding Mentor for AI, the team is concerned about “what capability boundaries of AI have been exposed through this collaboration, and can I transform this boundary into a trainable, evaluable, and reusable feedback signal?” The former is the ability to consume models, and the latter is the ability to build models.

These two goals are not in conflict, but engineering design is completely different.

Target	main action	Typical products	risk
Use AI to improve delivery efficiency	Let AI write code, explain errors, generate tests, and assist in reviews	PR, commit, test results, delivery report	Leave only the results, not the process
Be a Coding Mentor for AI	Record AI inputs, decisions, failures, revisions, validations, and human judgments	Structured trajectories, error labels, evaluation tasks, training candidate samples	Mistaking all logs as training data

The former can be measured by delivery cycle, task completion speed, developer satisfaction, and PR rework rate. The latter measurement indicators are more difficult: whether the AI reduces repeated mistakes, whether it is better at using team norms, whether it can make more stable choices in similar contexts, whether it can pass private eval sets, and whether it can internalize problems repeatedly pointed out by human reviewers into subsequent behaviors.

A mature team should not separate these two lines. The right direction is: use AI-assisted delivery to create real task scenarios, use real task scenarios to produce auditable trajectories, use human mentor signals to screen data, and then use training and evaluation to in turn improve the quality of the next round of delivery.

Closed Loop Overview: From a PR to a Training Flywheel

If you think of AI-assisted development as an ordinary tool call, the process is usually very short: the developer asks a question, the AI gives the code, the human copies the modifications, the test passes, and the PR is submitted. This process increases speed, but it does not produce reusable organizational assets.

To transform it into a Coding Mentor data closed loop, the process must have three more layers: observation layer, judgment layer and routing layer.

AI Coding Mentor Data Flywheel

The first layer is the observation layer. What the team wants to capture is not “what the AI finally said”, but how the AI acted in the engineering environment: what files it looked at, what context it relied on, what plan it proposed, what commands it executed, how it corrected after failure, and what the final diff and test results were.

The second layer is the judgment layer. Human Coding Mentors not only accept the code, but write judgments into learnable signals: Is this error a requirement misunderstanding, context omission, interface contract violation, insufficient test design, or an engineering choice error? Why is the correct restoration in this direction? Can this experience be transferred to other tasks?

The third layer is the routing layer. Not all trajectories can be included in the training set. A delivery track may be more suitable for entering a private eval, it may only be suitable for entering the team knowledge base, or it may have to be discarded because it contains sensitive information. Routing determines the fate of data assets.

This set of closed loops can be summarized in one sentence:

The delivery process is responsible for generating real tasks, Mentor feedback is responsible for creating high-quality supervision signals, data routing is responsible for determining usage, and evaluation results are responsible for driving the next round of delivery strategy adjustments.

Anthropic has repeatedly emphasized in the practice of agent systems that the system should not be made into a complex autonomous body from the beginning, but should start with a simple and controllable workflow and gradually add tools, memory and autonomy. This suggestion also holds true in this article. The team does not need to build a fully automated training platform on the first day. The real first step is to make the process of AI participation in delivery observable, replayable, and annotated.

LangChain’s discussion on the agent improvement loop gives a good engineering perspective: trace is not a debugging by-product, but the entrance to the agent improvement loop. Without trace, failure can only rely on impression review; with trace, the team can cluster failures, construct data sets, run eval, and then promote system improvement.

Data Acquisition and Mentor Signals

Data collection points in the delivery process: Don’t just focus on prompts and answers

Many people say “Save the AI dialogue and use it for training later.” This sentence is only half correct. Conversations are certainly important, but in software engineering scenarios, the most valuable signals are often not found in chat text, but in context, tool feedback, and human corrections.

A real AI-assisted delivery should be considered at least eight stages.

stage	Data that should be collected	What Mentor needs to judge	Assets that may be precipitated
Demand entry	Original requirements, business goals, acceptance criteria, non-goals, risk constraints	Does the AI understand real boundaries rather than just literal functions?	Private eval question stem, requirement understanding case
mission planning	AI dismantling, assumptions, dependencies, modification scope, and verification plans	Does the plan miss migration, compatibility, error paths, rollback strategies?	Planning capability samples, planning evaluation criteria
Contextual search	Files read by AI, referenced documents, search paths, missing files	Whether the model has obtained the necessary context and whether it references outdated information	RAG index optimization, context selection samples
code generation	Initial diff, alternatives, interface changes, test additions	Does the code comply with architectural, naming, error handling, and security requirements?	SFT candidate samples, bad example samples
tool execution	shell commands, test logs, lint, type checking, build failures	Can AI locate problems based on environmental feedback instead of blindly fixing them?	Tool usage traces and failure repair cases
Human review	reviewer comments, reasons for rejection, design disputes, modification suggestions	Which questions must be calibrated by humans and which can be automated	Preference data, review rules
Repair iteration	failed attempts, corrected patch, final patch	What is the correct repair path and why wrong attempts are wrong?	DPO/RFT candidate data, error pattern library
Acceptance review	CI, regression testing, online results, defect escape, rework records	Does the sample really represent “success” or just didn’t blow up at the time?	holdout eval, sample weight reduction basis

Here is a practical judgment: If the team can only save three types of data, I will give priority to saving “requirements and acceptance criteria”, “failure trajectories and tool feedback” and “human correction reasons”. In the end, the code warehouse will be saved, and what is really easy to lose are these intermediate signals.

Don’t make this a heavy form-filling process. Filling out forms will be bypassed by developers. A better approach is to embed data collection into existing engineering actions: PR templates, CI annotations, review bots, agent traces, issue status transfers, and test report archiving. Let developers write one more key judgment instead of filling in twenty more fields.

Mentor signal: Turn “I don’t think it works” into a structure that the model can learn

The AI won’t learn much from saying “bad writing here”. Human reviewers themselves often cannot explain the basis for a certain judgment. The first ability of Coding Mentor is to break down implicit engineering judgments into explicit feedback signals.

I recommend splitting Mentor signals into six categories.

Signal type	questions to answer	Typical annotation
task understanding signals	Does AI understand the real problem to be solved?	Misunderstanding of requirements, non-target expansion, and omission of acceptance criteria
context selection signal	Has the AI found the necessary context?	Interface contract ignored, old implementation referenced, caller not read
Achieve quality signals	Does the code meet team engineering standards?	Missing error handling, missing boundary conditions, excessive abstraction
Verify behavioral signals	Does AI know how to prove itself right?	No tests added, only running the happy path, ignoring type checking
Fix strategy signals	How AI adjusts when faced with failure	Blind modification, partial repair, root cause location, rollback and redo
organizational constraint signals	Does the AI comply with team invisibility rules?	Security red lines, performance budgets, compatibility strategies, release rhythm

These signals cannot be too general. For example, “poor code quality” is not a good label because it cannot guide training and evaluation. A better way to write it is: “The error path does not retain the original exception, causing the upper-layer retry strategy to be unable to distinguish between temporary failures and permanent failures.” This sentence is a bit longer than “Error handling is bad”, but it contains the cause and effect chain.

When I look at AI-generated code, what I am most wary of is the implementation of “it can run but does not pay the price”. It passed the test, but implicitly violated the team’s long-term constraints: synchronous calls were stuffed into hot paths, temporary cover-ups were written as default logic, and one-time migrations were written as permanent branches. This type of problem is most suitable to enter the Mentor signal library, because it is difficult to cover with ordinary benchmarks, but it will occur repeatedly in real projects.

The quality of the Mentor signal determines the upper limit of subsequent data. Without structured signals, the training data is just “code modified by humans”; with structured signals, the training data becomes learnable evidence of “why humans changed it this way”.

Data Contract: A delivery trace must answer at least ten questions

The team does not need to define a huge schema from the beginning, but there must be a minimum data contract. The goal of this contract is not to serve the database design, but to ensure that when looking back at the sample in the future, we can still judge whether it is trustworthy.

question	why must answer
What is the true goal of this mission?	Without goals, there is no way to tell whether the AI output solves the right problem
What are the acceptance criteria?	Without acceptance criteria, a passing test may be an illusion
What context does AI use?	Able to distinguish between insufficient model capabilities and insufficient context supply
What plan does AI come up with?	Plan to expose model engineering judgment, not just coding capabilities
Where does the initial output fail?	Failure points are one of the most valuable training signals
What have humans modified?	diff shows the fact changes, but not why
Why did humans change like this?	This is the core of the Coding Mentor signal
What are the results of automated verification?	Prevent subjective satisfaction from being mistaken for success
Whether to rework or roll back later	Prevent short-term success from being mistaken for long-term success
Where should this trajectory go?	Decide if it is eval, SFT, preference data, knowledge base or drop samples

There is a principle behind this contract: training data must contain “causal information”, not only “result information”. Only outcome information encourages the model to imitate surface answers; causal information has the opportunity to allow the model to learn engineering judgment.

The research value of SWE-agent is not only that it allows the model to solve SWE-bench tasks, but also that it puts the action process of the software engineering agent into the “agent-computer interface”: the model does not output answers at once, but reads files, edits, runs tests, and observes feedback in the environment. For the closed data loop within the team, this perspective is closer to real engineering than a single round of prompt-answer.

Data routing: not everything should go into the training set

I object to the statement “collect all AI collaboration logs and use them for future training”. It sounds data asset aware, but actually creates three problems: sensitive information leakage, low-quality sample contamination, and eval and train confusion.

A more reasonable approach is to establish data routing. After each delivery track is quality-gated, it can only enter one or a few destinations.

Routing from delivery trace to data assets

Where to go	data suitable for entry	Data not suitable for entry	main user
private eval set	Real tasks, reproducible, clear acceptance, stable testing, answers not leaked	flakey test, vague requirements, samples used for training	Model evaluation, tool chain regression
SFT candidate set	Manual corrections are of high quality, clearly explained, transferable, and fully verified.	Unreviewed AI output, occasional pass, only applicable to one-time business details	Model fine-tuning and behavioral demonstration
Preference data	Multiple options can be compared, review reasons are clear, and the boundaries between advantages and disadvantages are clear.	There is no clear basis for preference, just personal style differences	DPO/RFT, strategy selection training
Team knowledge base	Architectural constraints, common mistakes, review rules, and review conclusions	Keys, customer data, temporary workarounds	RAG, prompt word context, engineering specifications
discard area	Sensitive, contaminated, irreproducible, low value, no verification	There should be no return to training or assessment	Data governance, auditing

The most common mistake here is to mistake “human beings can change” with “can be trained.” Human correction only shows that there is a problem with the original output, but does not mean that the final sample is suitable for learning. A patch may be a temporary fire extinguisher, it may bypass the root cause, it may sacrifice long-term maintainability, or it may contain business logic that cannot be leaked. Without routing gating, the training data set swallows this historical baggage along with it.

OpenAI later stopped using SWE-bench Verified as the main public evaluation standard for cutting-edge models. There was a practical reason behind it: public benchmarks would be repeatedly optimized by the community, causing contamination and overfitting, and also exposing the limitations of the test itself. The same goes for internal team eval sets. If the training set, parameter tuning set, and evaluation set are mixed together, an increase in indicators only means that the system is better at “memorizing questions”, but it does not mean that it is better at delivering.

Quality Gating: The brake of the data flywheel is more important than the accelerator

The word “data flywheel” is very exciting, as if as long as enough trajectories are collected, the model will naturally become stronger and stronger. In engineering it’s just the opposite. What really determines whether the flywheel can spin is not the acquisition speed, but the gating quality.

I recommend dividing the gate control into seven lanes.

Gate	What to stop?	The consequences of not stopping
Privacy and security	Keys, tokens, customer data, internal addresses, production log sensitive fields	Data assets become a source of security incidents
IP and Licensing	Third-party code and restricted protocol content that cannot be used for training	The scope of use of subsequent models is limited
data pollution	Public benchmark answers and eval samples that have entered the training set	Evaluation metric distortion
Reproducibility	Unreproducible problems, missing environments, and tests that cannot be run	Model learns unverifiable experience
Verify adequacy	No testing, no review, no evidence of acceptance	Mistaking “it looks right” for “engineering is right”
teaching value	Contains only one-time business details and no transferable judgments	Adding noise does not improve model capabilities
life cycle	Expired schema, temporary workaround, deprecated API	Training historical debt into default behavior

Among these seven gates, I value “life cycle” the most. Training data is not an archive. If the expired experience is not removed from the shelves, the model will adhere to the old constraints that the team has abandoned. For example, after the team switches from REST to event-driven architecture, the “synchronous query of downstream services” model in the old sample should be downgraded or eliminated. Models have no sense of time as organizational memory, and data governance must provide it for it.

Gating should not be entirely manual. Humans are responsible for boundary determination and spot checks, while automation is responsible for duplicate checks: sensitive field scanning, license marking, test stability statistics, sample deduplication, train/eval split checks, and expired API detection. The most scarce time of human tutors should be used to judge “whether this sample has teaching value” instead of manually finding tokens.

Evaluation closed loop: private eval is the team’s engineering physical examination form

If the training data determines what the model learns, eval determines what the team believes.

Public benchmarks are still important. SWE-bench introduces real GitHub issues into software engineering evaluation, which is closer to engineering reality than traditional algorithm questions. SWE-Gym further attempts to convert real issues, environments and trajectories into trainable tasks. These works indicate a direction: the evaluation of coding agents is moving from static questions to real warehouses, real environments, and real feedback.

But the team can’t just look at public benchmarks. The public benchmark measures general capabilities, and the team’s private eval measures “whether this AI is reliable in our engineering system.” The relationship between the two is similar to that of a physical examination report and a job trial: the former can discover basic problems, and the latter can judge whether you are suitable for a specific position.

Team-private eval should cover at least four categories of tasks.

Task type	Assessment ability	Sample source
Return to repair tasks	Can the root cause be located and fixed in the real code base?	Historical bugs, online defects, CI failures
architectural constraint tasks	Can you abide by team boundaries, contracts, and performance budgets?	review disputes, architectural decision records
Test reinforcement tasks	Can the critical path, boundary conditions, and error paths be supplemented?	Test escape, defect review
Refactoring trade-off tasks	Can the structure be improved without changing the behavior?	Historical reconstruction, technical debt governance

OpenAI’s practice on eval emphasizes that evaluation should serve real business results, rather than just creating a beautiful score. In an AI programming scenario, private eval should not only look at the pass rate, but also whether the repair path is reasonable, whether the context reference is correct, whether it breaks architectural constraints, and whether verification evidence can be given.

I suggest dividing the evaluation indicators into three tiers.

Hierarchy	What to see	Indicator example
Offline capability	The performance of a model or agent on a fixed set of tasks	pass rate, repair success rate, failure type distribution
collaborative process	AI delivery quality in real PR/issues	Number of manual corrections, review defect density, CI failure rate
business results	The long-term impact of AI on engineering delivery systems	lead time, rollback, defect escape rate, maintenance cost

These three layers cannot replace each other. Offline eval can return quickly, but is easily disconnected from real work; process indicators are close to the team’s daily routine, but are affected by fluctuations in task difficulty; business results are the most realistic, but feedback is slow. A mature closed loop requires looking at all three together.

New responsibilities of human Coding Mentor: from reviewer to signal calibrator

In traditional software engineering, the reviewer’s main responsibility is to prevent bad code from entering the trunk. In the closed loop of AI data, reviewers still have to guard the gate, but they have an additional responsibility: to transform their judgments into signals that can be absorbed by the system.

This thing will change the way review is written.

normal review comment	Mentorization Review
Additional testing is needed here	There is a lack of error path testing here; the current implementation only covers success returns and cannot prove that the retry strategy at timeout and downstream 500 is correct
This abstraction is not necessary	This abstraction is only used by one call point, but introduces a new life cycle and configuration branch, and the maintenance cost is higher than the reuse benefit.
Don’t check the database like this	This query is executed on the hot path of the list page and lacks paging and index constraints. As the data volume increases, the risk of delay will be transferred to the user request link.
This does not meet the standards	This interface bypasses the existing permission middleware and destroys the team’s architectural constraint of “concentrating authentication at the gateway layer”

Mentoring comments is not to write longer, but to supplement three types of information: problem type, cause and effect, and transferable principles. This way the comments themselves can become training samples, eval criteria, or knowledge base entries.

The team can divide Mentor feedback into several fixed fields, but do not make the process rigid.

Field	Example
Question type	Missing context, boundary conditions, architectural constraints, security risks, insufficient validation
trigger evidence	Which file, which test, which log, which PR comment
Root cause judgment	AI did not read the caller, leading to the mistaken belief that the interface only serves a single scenario.
Correction strategy	First add the contract test, then change the implementation, and finally add the migration instructions.
Transferable experience	All consumers must be checked before modifying the shared contract
Data routing recommendations	Enter eval set, do not enter SFT, because it contains customer fields and needs to be desensitized before evaluation.

The key here is not that “humans are more advanced than AI”, but that humans have mastered organizational constraints outside the model. Many engineering judgments do not exist in the public code: why the team does not upgrade a certain dependency, why an old interface is retained, why this module cannot import a certain package, and why a seemingly inefficient implementation is actually for compatibility with historical customers. The value of Coding Mentor lies in making these implicit constraints explicit.

Engineering Practice: From PR to Evaluation Closed Loop

Engineering Practice 1: Start with a PR template, not a training platform

Many teams want to build a platform as soon as they talk about closed loop. My advice is the opposite: start with a PR template and review spec.

The reason is simple. PR is the smallest audit unit of real project delivery. Requirements, code, testing, review, CI, and merge results can all be gathered here. Turning PR into a data closed-loop entrance is easier to implement than building an “AI data collection system” alone.

A PR template for Coding Mentor data closed loop can only add five fields.

Field	Purpose
AI participation scope	Distinguish whether AI writes code, generates tests, interprets logs, assists in reviews, or just documents
key context	Record which files, documents, issues, and architectural constraints are used by AI
Failure and Correction	Save at least one valuable failed attempt and human correction reason
verification evidence	Correlation testing, lint, screenshots, performance data, security scans
Data routing recommendations	Whether the tag can go into eval, SFT, preference data, knowledge base or must be discarded

These five fields do not need to be filled every time. Low-risk changes can be abbreviated, while high-risk changes must be complete. The key point is to let the team develop a habit: AI is not a black box tool, and the process it participates in delivery should be auditable.

CI should also work with this set of templates. For example, when the PR is marked “AI generation core implementation”, CI can require more complete test evidence; when the PR modifies the shared contract, the system can remind the reviewer to check the consumer; when the PR is marked as an eval candidate, the system can archive issue, diff, test and review comments as candidate tasks.

This is not to burden the process, but to harden engineering judgments that would otherwise occur verbally into the system.

Engineering practice two: Trace store should save “replayable facts” and do not save chat screenshots

If the team uses Claude Code, Copilot Agent, Cursor, Aider, OpenHands or internal coding agent, they will all encounter the same problem: the process of an AI collaboration is very long, and the content is scattered in the editor, terminal, browser, PR and chat interface.

This requires a trace store. It’s not a log bin, but a system that holds “replayable facts.”

The trace store must store at least five types of facts.

type	content	use
Enter facts	User tasks, system prompts, context files, search results	Determine what the model sees
action facts	Planning, tool calling, file editing, command execution	What does the restored model do?
environmental facts	Test output, compilation errors, lint, run logs	Determine whether failure feedback is sufficient
human facts	review comments, manual modifications, acceptance conclusions	Provide Mentor signals
Result fact	Final diff, merge status, online feedback, rework records	Calibration success and failure

Don’t save non-auditable screenshots as your primary data source. Screenshots can aid understanding, but cannot serve as basic facts for training and evaluation. To be truly usable, data must be parsed, masked, retrieved, versioned, and correlated.

The trace store should also not keep everything permanently by default. Sensitive fields must be desensitized as soon as possible, low-value traces must be expired and cleared, and data entering eval or training candidates must have a version number. If a data system does not have life cycle management, it will soon become a historical black box that no one dares to use.

Engineering Practice 3: Treat test failure as the most valuable mentor signal

In AI programming collaboration, test failures are not noise, but the cheapest supervision signal.

A patch that passes the test can only show that it meets the existing tests; a patch that is correctly repaired after failure can often tell us what the model originally misunderstood, how to recover from environmental feedback, and which types of errors are most likely to reoccur. The latter has a higher pedagogical value for training and assessment.

I recommend that teams specifically keep track of three types of failures.

Failure type	why valuable	Data usage
Initial implementation failed	Exposing model default assumptions and common omissions	Error pattern library, SFT counterexample explanation
Repair failed after tool feedback	Will the exposed model misread the log or blindly modify it?	agent eval, tool usage evaluation
The repair was successful after being pointed out by humans	How exposing Mentor signals changes results	Preference data, repair strategy training

Here’s a detail: Don’t just save the final successful version. Keep at least one link between a failed version, evidence of the failure, human or environmental feedback, and the final fix. Without this link, the model can only learn “what a successful answer looks like” but cannot learn “how to go from failure to success.”

Anthropic emphasizes in Claude Code practice ways to provide model verification work, such as testing, lint, screenshots, logs, and command feedback. The essence of this suggestion is not to “run multiple tests”, but to turn the verification path into an environmental signal that the model can use. The Coding Mentor data closed loop should further preserve these environmental signals and turn them into evidence for subsequent evaluation and training.

Engineering Practice 4: The eval set should be maintained like a test set, not collected like a document

Many teams will build an “AI assessment question bank”, but the maintenance method is like a document: it is added occasionally, there are few versions, the coverage is unclear, and no one keeps track of expiration. Such question banks will soon become ineffective.

Private eval sets should be maintained like test sets.

Maintenance action	Engineering meaning
Versioning	Every time a task is added, deleted, or modified, the reasons can be traced
coverage statistics	Know which languages, modules, risk types and capability dimensions eval covers
isolation train/eval	Prevent training data from contaminating evaluation data
Flakiness monitoring	Unstable tasks cannot be used as a basis for judging model capabilities.
Difficulty Calibration	Avoid all simple bugs or all extreme problems
Invalid and removed from shelves	Outdated architecture, obsolete APIs, and historical temporary solutions must be exited

Each task in the eval set should preferably have a reason why it is worth evaluating. for example:

Task	Reasons for evaluation
Fix payment callback idempotent issue	Test whether the model can identify concurrency and duplicate message risks
Patch testing for cache penetration	Test whether the model understands boundary inputs and abnormal paths
Reconstruct order status flow	Test whether the model maintains behavior and respects state machine constraints
Handling third-party API timeouts	Test whether the model can differentiate between retries, degradation and error propagation

If there is no reason for the eval question, it will be difficult to judge whether it is worth retaining later.

Engineering Practice 5: Training candidate samples should be few and hard, not too many and scattered

When it comes to training data, many teams will naturally pursue quantity. I value hardness more.

The so-called hardness means that the sample contains real engineering constraints, clear error modes, clear reasons for correction, and reliable verification evidence. A hard sample may be more valuable than twenty generic questions and answers.

Samples suitable for entering the SFT candidate set usually have the following characteristics:

feature	Judgment criteria
The mission is real	From real issues, real PRs, real defects, not just made-up questions
Contextually appropriate	Contains necessary background without relying on a large number of undisclosed details
Correction to clarify	Human modifications are not stylistic preferences, but solve specific engineering problems
Fully verified	Supported by testing, CI, review or online feedback
Migrant	Experience can be transferred to similar tasks rather than one-time business exceptions
risk clean	Desensitized, no licensing issues, no contamination eval

Samples that are not suitable for entering the training set must also be identified.

Sample type	why not suitable
Relying solely on human intuition to judge the output of “okay”	lack of verifiable evidence
The patch that finally passed but the process was confusing	Possibly training bad strategies into default behavior
Traces containing customer data or production logs	High security risk
Full answer from public benchmark	pollution assessment
Modifications that only reflect personal coding style preferences	Low transferable value
Historical fixes for deprecated architecture	Will solidify expired experience

Part 8 will discuss in detail how to process these candidate samples into SFT data. This article only gives one principle: the training data does not feed the “successful results” to the model, but feeds the “engineering judgment behind the successful results” to the model.

Engineering Practice 6: Preference data comes from controversy, not from pretty answers

SFT is suitable for teaching a model “how it should be done”. But there is no single answer to many engineering questions: should this abstraction be torn down, which layer should the cache be placed at, whether errors should be swallowed or propagated upwards, whether testing should be done with units or integrated, and whether refactoring should be done this time or split into subsequent tasks.

These scenarios are more suitable for settling preference data. Preference data is not as simple as “answer A is better than answer B”, but rather “under what constraints is A better than answer B”.

A good preference sample usually comes from review disputes, not from code that passed smoothly in the first place.

scene	preference judgment
AI proposes massive refactoring, humans demand minimal fixes	The current mission goal is to stop bleeding, and the risks of reconstruction outweigh the benefits.
AI abstracts common components while humans retain local implementations	There is only one call point and the life cycle cost introduced by the abstraction is not worth it
AI uses caching to improve performance, but humans reject it	High data consistency requirements and incomplete cache invalidation strategy
AI supplements a large number of snapshot tests, and humans require behavioral assertions	Snapshot will solidify implementation details and cannot prove business semantics

These controversial samples are valuable because they make the team’s engineering tastes explicit. For a model to become a reliable collaborator in the team, it must not only write the right code, but also learn to make choices within constraints.

Engineering Practice 7: Knowledge base is not a trash can for training sets

A lot of data that cannot enter training can still enter the team knowledge base. Such as architectural decisions, review rules, common error patterns, migration guides, and module boundary descriptions. These can influence the AI’s next behavior through RAG or cue word context.

But the knowledge base cannot become a trash can for training sets. Just because a piece of data is not suitable for SFT, you cannot just throw it into the knowledge base. The content in the knowledge base will enter the model context, and poor quality will also contaminate the output.

Knowledge base entries should ideally meet four conditions:

condition	illustrate
clear rules	Can guide subsequent tasks, rather than just recording historical facts
clear life cycle	Know when it expires and who is responsible for updating it
Clear scope of application	Identify modules, languages, scenarios, exceptions
associated with eval	It is best to have corresponding eval task verification for important rules.

For example, “The order service cannot directly call the inventory database” is a knowledge base rule; “A certain order service caused problems because it directly checked the inventory database” is review material; “AI once wrote this error code” is a trace. The three are related, but cannot be mixed into one type of data.

Organizational Division of Labor and Maturity Model

Organizational division of labor: Who owns this closed loop

Data closed loop is not a one-man show of a certain tool team. It spans product, R&D, test, security, platform, and model teams. Without clear ownership, you end up with a dashboard that no one maintains.

I recommend using the following boundaries of responsibility.

Role	Main responsibilities
Business leader	Define real test acceptance criteria and determine task value and risk level
Developer	Document AI engagement scope, key context, failure fixes, and verification evidence
Reviewer / Coding Mentor	Provide structured feedback, labeling error types and transferable experiences
QA/Test Leader	Maintain verification evidence, flaky tags, regression tasks
Security and Compliance	Define desensitization, permissions, retention periods, and untrainable boundaries
Platform engineering	Build trace store, data routing, eval runner, quality gate control
Model/AI Engineering	Use data for prompt, RAG, SFT, RFT or toolchain optimization

The most critical one is the Coding Mentor role. This role is not equivalent to a senior engineer, nor is it equivalent to a model training engineer. He needs to understand the engineering context and also needs to know what feedback has learning value for the model. What many teams lack is not AI tools, but this kind of intermediate role that “understands engineering, understands feedback, and understands data boundaries.”

Maturity Model: From Personal Habits to Organizational Flywheel

Closed-loop construction cannot be achieved in one step. The mature path can be roughly divided into four levels.

AI Coding Mentor Data Closed Loop Maturity Roadmap

The first level is personal habits. Developers themselves document the scope of AI involvement, failure cases, and reasons for corrections. This stage does not require a platform, only discipline. The goal is to allow individuals to review why an AI collaboration succeeded or failed.

The second level is team norms. PR templates, review tags, AI usage records, and verification evidence begin to be unified. This stage begins to produce comparable data, and the team can see which tasks the AI is reliable on and which tasks it reworks frequently.

The third level is platform closed loop. Trace store, eval runner, data masking, sample routing, and quality gating start to be automated. The team no longer relies on manual sorting, but continues to generate candidate data in daily delivery.

The fourth level is model and tool chain optimization. Private eval, SFT candidate set, preference data, knowledge base and prompt version form a closed loop. Model upgrades, prompt word changes, and tool chain adjustments all require private eval and online indicator regression.

stage	main goal	minimum feasible action	Don’t do it in a hurry
personal habits	Make AI collaboration replayable	Save key prompts, failure logs, and reasons for manual corrections	Training model
team norms	Make data structures consistent	PR template, review tag, verification evidence field	Fully automatic collection
Platform closed loop	Make trajectories searchable, gated, and routable	trace store, desensitization, eval runner, sample version	Multi-model complex scheduling
Model optimization	Let data feed back capabilities	Private eval, SFT/RFT candidate set, A/B comparison	Blind pursuit of large-scale fine-tuning

This maturity model has a realistic premise: each level must be able to generate value independently. Personal habits can improve review quality, team standards can reduce repeated disputes, platform closed-loop can reduce data collection costs, and model optimization can improve the quality of the next round of delivery. If one level only serves the next level, it can easily fall by the wayside.

Anti-pattern: Where closed loops are most likely to fail

Anti-pattern 1: Treat AI logs as training data

Logs are not training data. Logs are just raw material.

A raw AI conversation may contain faulty context, expired constraints, sensitive information, half-baked reasoning, invalid attempts, and ad hoc human instructions. Taking it directly for training is equivalent to letting the model learn the most chaotic side of the team.

The correct approach is to layer the logs:

Hierarchy	Processing method
Original log	Short-term retention for audit and problem review
Structured trace	Extract facts and associate tasks, tools, diffs, tests, reviews
candidate sample	After desensitization, duplication removal, quality scoring and manual inspection
Training/evaluation assets	Clarify purpose, version, life cycle and isolation relationship

Logs that are not structured and gated are at most debugging materials, not model assets.

Anti-pattern 2: Only reward “last pass”

If the team only regards the output that “finally passes the test” as a good sample, the model will learn a dangerous preference: as long as it can pass in the end, it doesn’t matter how the process goes.

Software engineering is not about answering questions. An implementation that finally passes may rely on expanding the scope, bypassing interfaces, adding hidden states, sacrificing performance, or creating subsequent maintenance costs. Real teams care about “passing maintainably”, not “passing by chance”.

Therefore, sample scoring cannot only look at the results. Look at at least five dimensions simultaneously:

Dimensions	question
correctness	Does the function meet the acceptance criteria?
minimality	Whether the scope of the modification is reasonable and whether irrelevant changes are introduced
maintainability	Is the structure in line with the team’s long-term evolution direction?
Verifiability	Are sufficient tests and evidence provided?
constraint compliance	Adherence to security, performance, compatibility and architectural boundaries

This is why human mentor feedback is important. Automated testing can judge many things, but it cannot completely judge engineering choices.

Anti-pattern 3: Use public benchmarks to replace team private tasks

Public benchmarks can help the team compare the basic capabilities of the model, but they cannot replace the team’s private tasks. Especially for AI programming, whether a model “can write code” is only the threshold. Whether it “understands your code base, your constraints, and your release method” determines whether it can enter real delivery.

The value of SWE-bench is that it brings real GitHub issues, code repositories and tests into the evaluation, approaching real software engineering. But for a specific team, the most critical assessment tasks should come from its own historical bugs, architectural constraints, test escapes, and review disputes.

I suggest that the public benchmark only answers two questions:

Does this model have the basic capabilities to enter the team trial?
Is there any obvious degradation in general capabilities after the model is upgraded?

Team-private eval answers more critical questions:

Can this model reliably locate problems in our warehouse?
Does it adhere to our architectural boundaries and security redlines?
Does it reduce duplication of work for the reviewer, or does it create new rework?
Can it continue to improve from the Mentor signals we give it?

Anti-pattern 4: Make Coding Mentor an additional burden for a few experts

If all structured feedback relied on the handwriting of a handful of experts, the system would break down quickly. Expert time is too expensive to be used for reorganizing.

The correct approach is to layer.

Hierarchy	who is responsible	degree of automation
Basic fact collection	system	High, automatically extracted from PR, CI, agent trace
General label	Developers and reviewers	, completed through templates and preset tags
High value sample judgment	Coding Mentor	Low, manual judgment of teaching value and boundaries
Data set sampling	AI/Platform/Security Federation	Medium, automatic scanning plus manual sampling

Experts should only deal with high-value judgments: Does this sample represent a certain type of engineering capability? Is this error typical enough? Is this fix portable? Has this experience expired? The rest is automated as much as possible.

90-day roadmap, indicator system and integration boundaries

An actionable 90-day roadmap

If a team wants to start this closed loop today, I don’t recommend directly setting up a half-year platform project. Start by running for 90 days.

Days 1-30: Make AI collaboration auditable

The goal is not to collect big data, but to have a minimal record of each AI engagement delivery.

action:

Modify the PR template to include AI engagement scope, key context, verification evidence, failure remediation, and data routing recommendations.
Define 6-8 error type tags, no more than 10.
Choose 2 real projects to pilot, not company-wide.
Select 5 AI collaboration cases to review every week to determine which ones have Mentor value.

Acceptance criteria:

index	Target
Record completeness rate of AI participation in PR	more than 70%
Cases that can be reviewed every week	at least 5
First version error type label	Can cover 80% of frequently asked questions
Clear discard rules	Cover at least three categories: sensitive information, non-reproducible, and low value

Days 31-60: Building a private eval seed set

The goal is to solidify high-value tasks in real delivery into regressible evaluations.

action:

Pick 20-50 tasks from historical bugs, review controversies, and test escapes.
Complete question stems, warehouse versions, acceptance criteria, test commands and reference fixes for each task.
Explicit train/eval isolation rules.
Select 2-3 models or toolchain versions for offline evaluation.

Acceptance criteria:

index	Target
eval number of tasks	20-50 pcs
Each task is reproducible	100%
flaky quest tag	100% status
Model comparison report	at least 1 serving

Days 61-90: Open up sample routing and feedback for improvements

The goal is for data to start feeding back into the tool chain, rather than just reporting.

action:

To create a minimal version of the trace store, you can first use structured files or internal tables, without having to install a complex system at the beginning.
Route delivery trajectories into evals, SFT candidates, preference data, knowledge bases, and drop zones.
Make a prompt, RAG or toolchain fix for a high-frequency error.
Regression correction effect with private eval.

Acceptance criteria:

index	Target
Data routing coverage	More than 60% of pilot project AI PRs
SFT candidate sample	30-100 hard samples
High frequency error correction	There is a significant decrease in at least Type 1 error
eval regression mechanism	Can run stably before and after tool chain changes

A decision on whether to platform will be made after 90 days. If PR templates, eval seed sets, and sample routing cannot run, hastily building a platform will only solidify process problems into system problems.

Indicator system: Don’t just ask how much time AI saves

Of course we need to look at the productivity indicators of the AI programming assistant, but the Coding Mentor data closed loop also needs to look at several other sets of indicators.

indicator group	Representative indicators	illustrate
Delivery efficiency	lead time, AI participation task completion time, PR cycle	Determining whether AI is actually helping delivery
Project quality	CI failure rate, review defect density, defect escape, rollback	Determine whether AI creates quality debt
Collaboration burden	Number of manual corrections, review rounds, frequency of repeated feedback	Determine whether AI reduces tutor burden
data assets	Number of available evals, number of hard samples, proportion of samples passing the gate	Determine whether the closed loop produces reusable assets
Model improvements	Private eval improvement, high-frequency error reduction, tool chain return stability	Determine whether the data really feeds back the ability

The most misleading is “AI participation rate.” A high AI participation rate does not mean high value. A team can let AI write a lot of code, while making the reviewers more tired, with more defects, and the architecture more messy. What really needs to be seen is whether the participation of AI reduces repeated errors and makes engineering judgments more reusable.

Integration boundaries with existing engineering systems

This closed loop should not be reinvented. It should be embedded into existing engineering systems.

Existing system	Integration method
Issue system	Save requirements, acceptance criteria, defect classification, and business priorities
Git/PR	Save diff, review, merge status, AI participation fields
CI/CD	Save test, build, security scan, deployment results
Logging/Monitoring	Save online feedback, error rate, and performance changes
Document system	Save architectural constraints, specifications, reviews and knowledge bases
model platform	Run eval, prompt versions, SFT/RFT experiments and A/B comparisons

Boundaries also need to be clear. AI data closed loop does not replace ALM, CI, code review or knowledge base, but connects the key signals in these systems. It is more like an “AI engineering learning layer” that is responsible for converting delivery facts into assets that the model can learn and the team can evaluate.

What problem does this closed loop really solve?

Writing the article here may easily make people think that this is a very heavy system. It is indeed heavier than “opening the AI tool and writing code directly”. But it solves a problem that the latter can never solve.

First, it makes AI progress attributable. Is the model getting better because the model has been changed, the prompt has been changed, the context has been added, the training samples have been improved, or the task has become simpler? Without eval and trace, the team can only guess.

Second, it makes human experience reusable. The judgment of senior engineers in each review no longer only serves the current PR, but becomes an asset for subsequent models, knowledge bases and evals.

Third, it makes AI risks governable. Sensitive data, expired experience, benchmark contamination, and unreproducible samples no longer rely on personal awareness, but enter the gate control system.

Fourth, it lets the model optimize service delivery rather than service scores. If the improvement on the public list cannot reduce the team’s rework, it is not the real benefit of the team.

Fifth, it allows Coding Mentor to transform from individual capabilities to organizational capabilities. A person who can train AI is valuable, but a team that can stably produce Mentor signals is even more valuable.

Conclusion: Build the feedback system first, then talk about the training system

My advice is clear: don’t start with “We’re going to train a team-specific model.” Let’s start with “Can we turn the process of AI participation in delivery into a reliable feedback system?”

Don’t rush into SFT if the team hasn’t documented why the AI failed. If the team doesn’t have a private eval yet, don’t trust the fine-tuned score. If your team doesn’t already have data gating, don’t treat collaboration logs as an asset. If the reviewer’s judgment still remains at “It doesn’t work here”, first train humans how to write Mentor signals.

The real order should be:

Make the AI delivery process observable.
Make human feedback structured.
Make data routing gated.
Make private eval returnable.
Keep training data small and hard.
Let models and toolchains continue to improve based on evaluation.

This is what this article calls a closed loop: AI-assisted product engineering delivery is not the end point, it is the production site of training data and evaluation data; the human Coding Mentor is not the final acceptor, but the designer of feedback signals; SFT does not pour logs into the model, but precipitates gated engineering judgments into learnable assets.

The next article (Part 8) moves into a more specific issue: when these trajectories, feedback, and candidate samples have been generated, how to clean, filter, label, convert them into high-quality SFT data, and connect them to the training pipeline. After completing this step, the final chapter 9 returns to long-term evolution and future prospects.

References and Acknowledgments

GitHub: Research: quantifying GitHub Copilot’s impact on developer productivity and happiness
GitHub: Research: quantifying GitHub Copilot’s impact in the enterprise with Accenture
Anthropic: Building effective agents
Anthropic: Claude Code best practices
LangChain: Traces start the agent improvement loop
OpenAI: OpenAI Evals API
OpenAI: How evals drive business results
OpenAI: Model optimization
Jimenez et al.: SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Yang et al.: SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
Jimenez et al.: SWE-Gym: Advancing the State-of-the-Art of Software Engineering Agents

Series context

You are reading: AI Coding Mentor Series

This is article 7 of 9. Reading progress is stored only in this browser so the full series page can resume from the right entry.

View full series →

Reading path

Continue along this topic path

Follow the recommended order for AI programming assessment instead of jumping through random articles in the same topic.

View full topic path →

Next step

Go deeper into this topic

If this article is useful, continue from the topic page or subscribe to follow later updates.

Beginning: What is really lost is not the code, but the process

Problem boundaries and value judgments

Calibrate the direction first: This article is not an advance version of Part 8

The value of AI programming assistant is not equal to the value of Coding Mentor

Closed Loop Overview: From a PR to a Training Flywheel

Data Acquisition and Mentor Signals

Data collection points in the delivery process: Don’t just focus on prompts and answers

Mentor signal: Turn “I don’t think it works” into a structure that the model can learn

Data Contract: A delivery trace must answer at least ten questions

Data routing: not everything should go into the training set

Quality Gating: The brake of the data flywheel is more important than the accelerator

Evaluation closed loop: private eval is the team’s engineering physical examination form

New responsibilities of human Coding Mentor: from reviewer to signal calibrator

Engineering Practice: From PR to Evaluation Closed Loop

Engineering Practice 1: Start with a PR template, not a training platform

Engineering practice two: Trace store should save “replayable facts” and do not save chat screenshots

Engineering Practice 3: Treat test failure as the most valuable mentor signal

Engineering Practice 4: The eval set should be maintained like a test set, not collected like a document

Engineering Practice 5: Training candidate samples should be few and hard, not too many and scattered

Engineering Practice 6: Preference data comes from controversy, not from pretty answers

Engineering Practice 7: Knowledge base is not a trash can for training sets

Organizational Division of Labor and Maturity Model

Organizational division of labor: Who owns this closed loop

Maturity Model: From Personal Habits to Organizational Flywheel

Anti-pattern: Where closed loops are most likely to fail

Anti-pattern 1: Treat AI logs as training data

Anti-pattern 2: Only reward “last pass”

Anti-pattern 3: Use public benchmarks to replace team private tasks

Anti-pattern 4: Make Coding Mentor an additional burden for a few experts

90-day roadmap, indicator system and integration boundaries

An actionable 90-day roadmap

Days 1-30: Make AI collaboration auditable

Days 31-60: Building a private eval seed set

Days 61-90: Open up sample routing and feedback for improvements

Indicator system: Don’t just ask how much time AI saves

Integration boundaries with existing engineering systems

What problem does this closed loop really solve?

Conclusion: Build the feedback system first, then talk about the training system

References and Acknowledgments

You are reading: AI Coding Mentor Series

Current series chapters

Continue along this topic path

Why do you need to be a coding mentor for AI?

Panorama of AI programming ability evaluation: from HumanEval to SWE-bench, the evolution and selection of benchmarks

How to design high-quality programming questions: from question surface to evaluation contract

Continue with this topic

Four-step approach to AI capability assessment: from one test to continuous system evaluation

Best Practices for Collaborating with AI: Task Agreement, Dialogue Control and Feedback Closed Loop

Practical cases: feedback protocol, evaluation closed loop, code review and programming education data

Go deeper into this topic

Subscribe to updates

Comments and discussion