Hualin Luan Cloud Native · Quant Trading · AI Engineering
Back to articles

Article

From delivery to training: How to turn AI programming collaboration into a Coding Mentor data closed loop

The real organizational value of AI programming assistants is not just to increase delivery speed, but to precipitate trainable, evaluable, and reusable mentor signals in every requirement disassembly, code generation, review and revision, test verification, and online review. This article reconstructs the closed-loop framework of AI training, AI-assisted product engineering delivery, high-quality SFT data precipitation, and model evaluation.

Meta

Published

3/30/2026

Category

interpretation

Reading Time

46 min read

Copyright Statement and Disclaimer This article is an original interpretation based on public materials such as GitHub, Anthropic, LangChain, OpenAI, SWE-bench and SWE-agent. The copyright of the original text belongs to the original author and source. This article does not constitute an official translation, nor does it represent the views of the above-mentioned institutions.

Original Nature The closed-loop architecture, data routing model, Mentor signal definition, quality gate control and implementation route of this article are original reconstructions by the author. External materials are only used as sources of argument and industry cases, and are not subject to paragraph-by-paragraph translation.


Beginning: What is really lost is not the code, but the process

A team involved AI in daily development for half a year: AI was asked to write a draft first when dismantling requirements, AI was asked to create the first version for function implementation, AI was asked to analyze logs when testing failed, and AI was asked to scan the PR before code review. Half a year later, more code is left in the warehouse, more commits are left in the PR, and more build records are left in the CI system. The question is: Can these things be used in turn to train an AI that understands the team better?

In most cases, no.

The reason is not that the code generated by AI has no value, but that the team only saves the delivery results, not the delivery process. Why did a failed implementation fail? Which business constraints did the AI ​​miss? Why did the human reviewer drop a patch that seemed to work? After the test failed, which log actually pointed to the root cause? This information is usually scattered in the chat window, terminal history, PR comments and personal brains. After the delivery is complete, they are cleared just like temporary debug branches.

I prefer to put this matter more directly: if an AI-assisted development only leaves the final code, it is only a delivery; if it also leaves requirements, context, plans, tool calls, failure logs, manual corrections and acceptance evidence, it has the opportunity to become a reusable Coding Mentor training sample.

The previous articles discussed how to evaluate AI, how to design questions, how to organize collaboration and review cases. Article 7 cannot continue to be written as “How to use an AI programming assistant”, nor should it stop at “Building an evaluation platform”. This article will answer a question closer to the engineering system: how to involve AI in the entire process of product engineering delivery and design it into a closed loop that can continuously accumulate high-quality training data and evaluation data.

Four things happen simultaneously in this closed loop:

  1. AI participates in real product delivery, not doing toy problems in a sandbox.
  2. Human Coding Mentor gives structured feedback at key nodes instead of just saying “This section doesn’t work.”
  3. Delivery traces are routed to eval, SFT, preference data, knowledge base or drop samples rather than all mixed up in the log.
  4. Models, prompts, toolchains, and team processes are continuously adjusted based on evaluation results, rather than upgrading based on personal taste.

The core judgment of this article is that being able to write code with AI is not a barrier; being able to transform the entire process of AI participation in delivery into training assets, evaluation assets, and organizational memory is the real barrier for teams to serve as Coding Mentors for AI.

Problem boundaries and value judgments

Calibrate the direction first: This article is not an advance version of Part 8

This series already has an article dedicated to SFT data generation. That article will go into more detailed engineering implementation: data structure, cleaning, quality scoring, export format, and training process. This article does not repeat those details.

The responsibilities of Part 7 are more advanced: first design “where the data comes from, why it is trustworthy, how to divert it, who is responsible for judgment, what cannot be included in the training set, and how the evaluation in turn affects delivery.” If you think of Part 8 as a data processing factory, this article discusses the raw material production line, quality inspection rules and closed-loop operation methods.

I suggest using the following boundary to separate the two articles:

articleFollowing objectscore issuesUnexpanded content
Article 7Data closed loop in AI engineering deliveryHow every AI collaboration can become a trainable, evaluable, and reusable Mentor signalSpecific training scripts, export formats, and model fine-tuning parameters
Article 8SFT data generation pipelineHow to process the screened engineering assets into training samples and connect them to the training processOrganizational closed loop, data ownership, online delivery indicator system

This boundary is important. When many teams talk about “training data”, they immediately jump to JSONL, SFTTrainer, LoRA, and evaluation scripts. That’s the question in the second half. If the first half is not designed well, the second half will only process the dirty data more beautifully.

The value of AI programming assistant is not equal to the value of Coding Mentor

GitHub’s early research on Copilot focused on developer speed and experience: AI can help developers complete tasks faster and improve subjective satisfaction. Later, enterprise research by GitHub and Accenture further put the perspective into enterprise practice, measuring the impact of AI programming tools based on dimensions such as telemetry, code quality, and developer feedback. This type of research is valuable because it proves that the involvement of AI in software delivery is not a gimmick but will actually change the development process.

But there is a bifurcation here that is easily overlooked: improving delivery efficiency and having a team with Coding Mentor capabilities are two different things.

When using AI programming assistants, the team’s concern is “Can AI help me complete the task faster?” When working as a Coding Mentor for AI, the team is concerned about “what capability boundaries of AI have been exposed through this collaboration, and can I transform this boundary into a trainable, evaluable, and reusable feedback signal?” The former is the ability to consume models, and the latter is the ability to build models.

These two goals are not in conflict, but engineering design is completely different.

Targetmain actionTypical productsrisk
Use AI to improve delivery efficiencyLet AI write code, explain errors, generate tests, and assist in reviewsPR, commit, test results, delivery reportLeave only the results, not the process
Be a Coding Mentor for AIRecord AI inputs, decisions, failures, revisions, validations, and human judgmentsStructured trajectories, error labels, evaluation tasks, training candidate samplesMistaking all logs as training data

The former can be measured by delivery cycle, task completion speed, developer satisfaction, and PR rework rate. The latter measurement indicators are more difficult: whether the AI ​​reduces repeated mistakes, whether it is better at using team norms, whether it can make more stable choices in similar contexts, whether it can pass private eval sets, and whether it can internalize problems repeatedly pointed out by human reviewers into subsequent behaviors.

A mature team should not separate these two lines. The right direction is: use AI-assisted delivery to create real task scenarios, use real task scenarios to produce auditable trajectories, use human mentor signals to screen data, and then use training and evaluation to in turn improve the quality of the next round of delivery.

Closed Loop Overview: From a PR to a Training Flywheel

If you think of AI-assisted development as an ordinary tool call, the process is usually very short: the developer asks a question, the AI ​​gives the code, the human copies the modifications, the test passes, and the PR is submitted. This process increases speed, but it does not produce reusable organizational assets.

To transform it into a Coding Mentor data closed loop, the process must have three more layers: observation layer, judgment layer and routing layer.

AI Coding Mentor Data Flywheel

The first layer is the observation layer. What the team wants to capture is not “what the AI ​​finally said”, but how the AI ​​acted in the engineering environment: what files it looked at, what context it relied on, what plan it proposed, what commands it executed, how it corrected after failure, and what the final diff and test results were.

The second layer is the judgment layer. Human Coding Mentors not only accept the code, but write judgments into learnable signals: Is this error a requirement misunderstanding, context omission, interface contract violation, insufficient test design, or an engineering choice error? Why is the correct restoration in this direction? Can this experience be transferred to other tasks?

The third layer is the routing layer. Not all trajectories can be included in the training set. A delivery track may be more suitable for entering a private eval, it may only be suitable for entering the team knowledge base, or it may have to be discarded because it contains sensitive information. Routing determines the fate of data assets.

This set of closed loops can be summarized in one sentence:

The delivery process is responsible for generating real tasks, Mentor feedback is responsible for creating high-quality supervision signals, data routing is responsible for determining usage, and evaluation results are responsible for driving the next round of delivery strategy adjustments.

Anthropic has repeatedly emphasized in the practice of agent systems that the system should not be made into a complex autonomous body from the beginning, but should start with a simple and controllable workflow and gradually add tools, memory and autonomy. This suggestion also holds true in this article. The team does not need to build a fully automated training platform on the first day. The real first step is to make the process of AI participation in delivery observable, replayable, and annotated.

LangChain’s discussion on the agent improvement loop gives a good engineering perspective: trace is not a debugging by-product, but the entrance to the agent improvement loop. Without trace, failure can only rely on impression review; with trace, the team can cluster failures, construct data sets, run eval, and then promote system improvement.

Data Acquisition and Mentor Signals

Data collection points in the delivery process: Don’t just focus on prompts and answers

Many people say “Save the AI ​​dialogue and use it for training later.” This sentence is only half correct. Conversations are certainly important, but in software engineering scenarios, the most valuable signals are often not found in chat text, but in context, tool feedback, and human corrections.

A real AI-assisted delivery should be considered at least eight stages.

stageData that should be collectedWhat Mentor needs to judgeAssets that may be precipitated
Demand entryOriginal requirements, business goals, acceptance criteria, non-goals, risk constraintsDoes the AI ​​understand real boundaries rather than just literal functions?Private eval question stem, requirement understanding case
mission planningAI dismantling, assumptions, dependencies, modification scope, and verification plansDoes the plan miss migration, compatibility, error paths, rollback strategies?Planning capability samples, planning evaluation criteria
Contextual searchFiles read by AI, referenced documents, search paths, missing filesWhether the model has obtained the necessary context and whether it references outdated informationRAG index optimization, context selection samples
code generationInitial diff, alternatives, interface changes, test additionsDoes the code comply with architectural, naming, error handling, and security requirements?SFT candidate samples, bad example samples
tool executionshell commands, test logs, lint, type checking, build failuresCan AI locate problems based on environmental feedback instead of blindly fixing them?Tool usage traces and failure repair cases
Human reviewreviewer comments, reasons for rejection, design disputes, modification suggestionsWhich questions must be calibrated by humans and which can be automatedPreference data, review rules
Repair iterationfailed attempts, corrected patch, final patchWhat is the correct repair path and why wrong attempts are wrong?DPO/RFT candidate data, error pattern library
Acceptance reviewCI, regression testing, online results, defect escape, rework recordsDoes the sample really represent “success” or just didn’t blow up at the time?holdout eval, sample weight reduction basis

Here is a practical judgment: If the team can only save three types of data, I will give priority to saving “requirements and acceptance criteria”, “failure trajectories and tool feedback” and “human correction reasons”. In the end, the code warehouse will be saved, and what is really easy to lose are these intermediate signals.

Don’t make this a heavy form-filling process. Filling out forms will be bypassed by developers. A better approach is to embed data collection into existing engineering actions: PR templates, CI annotations, review bots, agent traces, issue status transfers, and test report archiving. Let developers write one more key judgment instead of filling in twenty more fields.

Mentor signal: Turn “I don’t think it works” into a structure that the model can learn

The AI ​​won’t learn much from saying “bad writing here”. Human reviewers themselves often cannot explain the basis for a certain judgment. The first ability of Coding Mentor is to break down implicit engineering judgments into explicit feedback signals.

I recommend splitting Mentor signals into six categories.

Signal typequestions to answerTypical annotation
task understanding signalsDoes AI understand the real problem to be solved?Misunderstanding of requirements, non-target expansion, and omission of acceptance criteria
context selection signalHas the AI ​​found the necessary context?Interface contract ignored, old implementation referenced, caller not read
Achieve quality signalsDoes the code meet team engineering standards?Missing error handling, missing boundary conditions, excessive abstraction
Verify behavioral signalsDoes AI know how to prove itself right?No tests added, only running the happy path, ignoring type checking
Fix strategy signalsHow AI adjusts when faced with failureBlind modification, partial repair, root cause location, rollback and redo
organizational constraint signalsDoes the AI ​​comply with team invisibility rules?Security red lines, performance budgets, compatibility strategies, release rhythm

These signals cannot be too general. For example, “poor code quality” is not a good label because it cannot guide training and evaluation. A better way to write it is: “The error path does not retain the original exception, causing the upper-layer retry strategy to be unable to distinguish between temporary failures and permanent failures.” This sentence is a bit longer than “Error handling is bad”, but it contains the cause and effect chain.

When I look at AI-generated code, what I am most wary of is the implementation of “it can run but does not pay the price”. It passed the test, but implicitly violated the team’s long-term constraints: synchronous calls were stuffed into hot paths, temporary cover-ups were written as default logic, and one-time migrations were written as permanent branches. This type of problem is most suitable to enter the Mentor signal library, because it is difficult to cover with ordinary benchmarks, but it will occur repeatedly in real projects.

The quality of the Mentor signal determines the upper limit of subsequent data. Without structured signals, the training data is just “code modified by humans”; with structured signals, the training data becomes learnable evidence of “why humans changed it this way”.

Data Contract: A delivery trace must answer at least ten questions

The team does not need to define a huge schema from the beginning, but there must be a minimum data contract. The goal of this contract is not to serve the database design, but to ensure that when looking back at the sample in the future, we can still judge whether it is trustworthy.

questionwhy must answer
What is the true goal of this mission?Without goals, there is no way to tell whether the AI ​​output solves the right problem
What are the acceptance criteria?Without acceptance criteria, a passing test may be an illusion
What context does AI use?Able to distinguish between insufficient model capabilities and insufficient context supply
What plan does AI come up with?Plan to expose model engineering judgment, not just coding capabilities
Where does the initial output fail?Failure points are one of the most valuable training signals
What have humans modified?diff shows the fact changes, but not why
Why did humans change like this?This is the core of the Coding Mentor signal
What are the results of automated verification?Prevent subjective satisfaction from being mistaken for success
Whether to rework or roll back laterPrevent short-term success from being mistaken for long-term success
Where should this trajectory go?Decide if it is eval, SFT, preference data, knowledge base or drop samples

There is a principle behind this contract: training data must contain “causal information”, not only “result information”. Only outcome information encourages the model to imitate surface answers; causal information has the opportunity to allow the model to learn engineering judgment.

The research value of SWE-agent is not only that it allows the model to solve SWE-bench tasks, but also that it puts the action process of the software engineering agent into the “agent-computer interface”: the model does not output answers at once, but reads files, edits, runs tests, and observes feedback in the environment. For the closed data loop within the team, this perspective is closer to real engineering than a single round of prompt-answer.

Data routing: not everything should go into the training set

I object to the statement “collect all AI collaboration logs and use them for future training”. It sounds data asset aware, but actually creates three problems: sensitive information leakage, low-quality sample contamination, and eval and train confusion.

A more reasonable approach is to establish data routing. After each delivery track is quality-gated, it can only enter one or a few destinations.

Routing from delivery trace to data assets

Where to godata suitable for entryData not suitable for entrymain user
private eval setReal tasks, reproducible, clear acceptance, stable testing, answers not leakedflakey test, vague requirements, samples used for trainingModel evaluation, tool chain regression
SFT candidate setManual corrections are of high quality, clearly explained, transferable, and fully verified.Unreviewed AI output, occasional pass, only applicable to one-time business detailsModel fine-tuning and behavioral demonstration
Preference dataMultiple options can be compared, review reasons are clear, and the boundaries between advantages and disadvantages are clear.There is no clear basis for preference, just personal style differencesDPO/RFT, strategy selection training
Team knowledge baseArchitectural constraints, common mistakes, review rules, and review conclusionsKeys, customer data, temporary workaroundsRAG, prompt word context, engineering specifications
discard areaSensitive, contaminated, irreproducible, low value, no verificationThere should be no return to training or assessmentData governance, auditing

The most common mistake here is to mistake “human beings can change” with “can be trained.” Human correction only shows that there is a problem with the original output, but does not mean that the final sample is suitable for learning. A patch may be a temporary fire extinguisher, it may bypass the root cause, it may sacrifice long-term maintainability, or it may contain business logic that cannot be leaked. Without routing gating, the training data set swallows this historical baggage along with it.

OpenAI later stopped using SWE-bench Verified as the main public evaluation standard for cutting-edge models. There was a practical reason behind it: public benchmarks would be repeatedly optimized by the community, causing contamination and overfitting, and also exposing the limitations of the test itself. The same goes for internal team eval sets. If the training set, parameter tuning set, and evaluation set are mixed together, an increase in indicators only means that the system is better at “memorizing questions”, but it does not mean that it is better at delivering.

Quality Gating: The brake of the data flywheel is more important than the accelerator

The word “data flywheel” is very exciting, as if as long as enough trajectories are collected, the model will naturally become stronger and stronger. In engineering it’s just the opposite. What really determines whether the flywheel can spin is not the acquisition speed, but the gating quality.

I recommend dividing the gate control into seven lanes.

GateWhat to stop?The consequences of not stopping
Privacy and securityKeys, tokens, customer data, internal addresses, production log sensitive fieldsData assets become a source of security incidents
IP and LicensingThird-party code and restricted protocol content that cannot be used for trainingThe scope of use of subsequent models is limited
data pollutionPublic benchmark answers and eval samples that have entered the training setEvaluation metric distortion
ReproducibilityUnreproducible problems, missing environments, and tests that cannot be runModel learns unverifiable experience
Verify adequacyNo testing, no review, no evidence of acceptanceMistaking “it looks right” for “engineering is right”
teaching valueContains only one-time business details and no transferable judgmentsAdding noise does not improve model capabilities
life cycleExpired schema, temporary workaround, deprecated APITraining historical debt into default behavior

Among these seven gates, I value “life cycle” the most. Training data is not an archive. If the expired experience is not removed from the shelves, the model will adhere to the old constraints that the team has abandoned. For example, after the team switches from REST to event-driven architecture, the “synchronous query of downstream services” model in the old sample should be downgraded or eliminated. Models have no sense of time as organizational memory, and data governance must provide it for it.

Gating should not be entirely manual. Humans are responsible for boundary determination and spot checks, while automation is responsible for duplicate checks: sensitive field scanning, license marking, test stability statistics, sample deduplication, train/eval split checks, and expired API detection. The most scarce time of human tutors should be used to judge “whether this sample has teaching value” instead of manually finding tokens.

Evaluation closed loop: private eval is the team’s engineering physical examination form

If the training data determines what the model learns, eval determines what the team believes.

Public benchmarks are still important. SWE-bench introduces real GitHub issues into software engineering evaluation, which is closer to engineering reality than traditional algorithm questions. SWE-Gym further attempts to convert real issues, environments and trajectories into trainable tasks. These works indicate a direction: the evaluation of coding agents is moving from static questions to real warehouses, real environments, and real feedback.

But the team can’t just look at public benchmarks. The public benchmark measures general capabilities, and the team’s private eval measures “whether this AI is reliable in our engineering system.” The relationship between the two is similar to that of a physical examination report and a job trial: the former can discover basic problems, and the latter can judge whether you are suitable for a specific position.

Team-private eval should cover at least four categories of tasks.

Task typeAssessment abilitySample source
Return to repair tasksCan the root cause be located and fixed in the real code base?Historical bugs, online defects, CI failures
architectural constraint tasksCan you abide by team boundaries, contracts, and performance budgets?review disputes, architectural decision records
Test reinforcement tasksCan the critical path, boundary conditions, and error paths be supplemented?Test escape, defect review
Refactoring trade-off tasksCan the structure be improved without changing the behavior?Historical reconstruction, technical debt governance

OpenAI’s practice on eval emphasizes that evaluation should serve real business results, rather than just creating a beautiful score. In an AI programming scenario, private eval should not only look at the pass rate, but also whether the repair path is reasonable, whether the context reference is correct, whether it breaks architectural constraints, and whether verification evidence can be given.

I suggest dividing the evaluation indicators into three tiers.

HierarchyWhat to seeIndicator example
Offline capabilityThe performance of a model or agent on a fixed set of taskspass rate, repair success rate, failure type distribution
collaborative processAI delivery quality in real PR/issuesNumber of manual corrections, review defect density, CI failure rate
business resultsThe long-term impact of AI on engineering delivery systemslead time, rollback, defect escape rate, maintenance cost

These three layers cannot replace each other. Offline eval can return quickly, but is easily disconnected from real work; process indicators are close to the team’s daily routine, but are affected by fluctuations in task difficulty; business results are the most realistic, but feedback is slow. A mature closed loop requires looking at all three together.

New responsibilities of human Coding Mentor: from reviewer to signal calibrator

In traditional software engineering, the reviewer’s main responsibility is to prevent bad code from entering the trunk. In the closed loop of AI data, reviewers still have to guard the gate, but they have an additional responsibility: to transform their judgments into signals that can be absorbed by the system.

This thing will change the way review is written.

normal review commentMentorization Review
Additional testing is needed hereThere is a lack of error path testing here; the current implementation only covers success returns and cannot prove that the retry strategy at timeout and downstream 500 is correct
This abstraction is not necessaryThis abstraction is only used by one call point, but introduces a new life cycle and configuration branch, and the maintenance cost is higher than the reuse benefit.
Don’t check the database like thisThis query is executed on the hot path of the list page and lacks paging and index constraints. As the data volume increases, the risk of delay will be transferred to the user request link.
This does not meet the standardsThis interface bypasses the existing permission middleware and destroys the team’s architectural constraint of “concentrating authentication at the gateway layer”

Mentoring comments is not to write longer, but to supplement three types of information: problem type, cause and effect, and transferable principles. This way the comments themselves can become training samples, eval criteria, or knowledge base entries.

The team can divide Mentor feedback into several fixed fields, but do not make the process rigid.

FieldExample
Question typeMissing context, boundary conditions, architectural constraints, security risks, insufficient validation
trigger evidenceWhich file, which test, which log, which PR comment
Root cause judgmentAI did not read the caller, leading to the mistaken belief that the interface only serves a single scenario.
Correction strategyFirst add the contract test, then change the implementation, and finally add the migration instructions.
Transferable experienceAll consumers must be checked before modifying the shared contract
Data routing recommendationsEnter eval set, do not enter SFT, because it contains customer fields and needs to be desensitized before evaluation.

The key here is not that “humans are more advanced than AI”, but that humans have mastered organizational constraints outside the model. Many engineering judgments do not exist in the public code: why the team does not upgrade a certain dependency, why an old interface is retained, why this module cannot import a certain package, and why a seemingly inefficient implementation is actually for compatibility with historical customers. The value of Coding Mentor lies in making these implicit constraints explicit.

Engineering Practice: From PR to Evaluation Closed Loop

Engineering Practice 1: Start with a PR template, not a training platform

Many teams want to build a platform as soon as they talk about closed loop. My advice is the opposite: start with a PR template and review spec.

The reason is simple. PR is the smallest audit unit of real project delivery. Requirements, code, testing, review, CI, and merge results can all be gathered here. Turning PR into a data closed-loop entrance is easier to implement than building an “AI data collection system” alone.

A PR template for Coding Mentor data closed loop can only add five fields.

FieldPurpose
AI participation scopeDistinguish whether AI writes code, generates tests, interprets logs, assists in reviews, or just documents
key contextRecord which files, documents, issues, and architectural constraints are used by AI
Failure and CorrectionSave at least one valuable failed attempt and human correction reason
verification evidenceCorrelation testing, lint, screenshots, performance data, security scans
Data routing recommendationsWhether the tag can go into eval, SFT, preference data, knowledge base or must be discarded

These five fields do not need to be filled every time. Low-risk changes can be abbreviated, while high-risk changes must be complete. The key point is to let the team develop a habit: AI is not a black box tool, and the process it participates in delivery should be auditable.

CI should also work with this set of templates. For example, when the PR is marked “AI generation core implementation”, CI can require more complete test evidence; when the PR modifies the shared contract, the system can remind the reviewer to check the consumer; when the PR is marked as an eval candidate, the system can archive issue, diff, test and review comments as candidate tasks.

This is not to burden the process, but to harden engineering judgments that would otherwise occur verbally into the system.

Engineering practice two: Trace store should save “replayable facts” and do not save chat screenshots

If the team uses Claude Code, Copilot Agent, Cursor, Aider, OpenHands or internal coding agent, they will all encounter the same problem: the process of an AI collaboration is very long, and the content is scattered in the editor, terminal, browser, PR and chat interface.

This requires a trace store. It’s not a log bin, but a system that holds “replayable facts.”

The trace store must store at least five types of facts.

typecontentuse
Enter factsUser tasks, system prompts, context files, search resultsDetermine what the model sees
action factsPlanning, tool calling, file editing, command executionWhat does the restored model do?
environmental factsTest output, compilation errors, lint, run logsDetermine whether failure feedback is sufficient
human factsreview comments, manual modifications, acceptance conclusionsProvide Mentor signals
Result factFinal diff, merge status, online feedback, rework recordsCalibration success and failure

Don’t save non-auditable screenshots as your primary data source. Screenshots can aid understanding, but cannot serve as basic facts for training and evaluation. To be truly usable, data must be parsed, masked, retrieved, versioned, and correlated.

The trace store should also not keep everything permanently by default. Sensitive fields must be desensitized as soon as possible, low-value traces must be expired and cleared, and data entering eval or training candidates must have a version number. If a data system does not have life cycle management, it will soon become a historical black box that no one dares to use.

Engineering Practice 3: Treat test failure as the most valuable mentor signal

In AI programming collaboration, test failures are not noise, but the cheapest supervision signal.

A patch that passes the test can only show that it meets the existing tests; a patch that is correctly repaired after failure can often tell us what the model originally misunderstood, how to recover from environmental feedback, and which types of errors are most likely to reoccur. The latter has a higher pedagogical value for training and assessment.

I recommend that teams specifically keep track of three types of failures.

Failure typewhy valuableData usage
Initial implementation failedExposing model default assumptions and common omissionsError pattern library, SFT counterexample explanation
Repair failed after tool feedbackWill the exposed model misread the log or blindly modify it?agent eval, tool usage evaluation
The repair was successful after being pointed out by humansHow exposing Mentor signals changes resultsPreference data, repair strategy training

Here’s a detail: Don’t just save the final successful version. Keep at least one link between a failed version, evidence of the failure, human or environmental feedback, and the final fix. Without this link, the model can only learn “what a successful answer looks like” but cannot learn “how to go from failure to success.”

Anthropic emphasizes in Claude Code practice ways to provide model verification work, such as testing, lint, screenshots, logs, and command feedback. The essence of this suggestion is not to “run multiple tests”, but to turn the verification path into an environmental signal that the model can use. The Coding Mentor data closed loop should further preserve these environmental signals and turn them into evidence for subsequent evaluation and training.

Engineering Practice 4: The eval set should be maintained like a test set, not collected like a document

Many teams will build an “AI assessment question bank”, but the maintenance method is like a document: it is added occasionally, there are few versions, the coverage is unclear, and no one keeps track of expiration. Such question banks will soon become ineffective.

Private eval sets should be maintained like test sets.

Maintenance actionEngineering meaning
VersioningEvery time a task is added, deleted, or modified, the reasons can be traced
coverage statisticsKnow which languages, modules, risk types and capability dimensions eval covers
isolation train/evalPrevent training data from contaminating evaluation data
Flakiness monitoringUnstable tasks cannot be used as a basis for judging model capabilities.
Difficulty CalibrationAvoid all simple bugs or all extreme problems
Invalid and removed from shelvesOutdated architecture, obsolete APIs, and historical temporary solutions must be exited

Each task in the eval set should preferably have a reason why it is worth evaluating. for example:

TaskReasons for evaluation
Fix payment callback idempotent issueTest whether the model can identify concurrency and duplicate message risks
Patch testing for cache penetrationTest whether the model understands boundary inputs and abnormal paths
Reconstruct order status flowTest whether the model maintains behavior and respects state machine constraints
Handling third-party API timeoutsTest whether the model can differentiate between retries, degradation and error propagation

If there is no reason for the eval question, it will be difficult to judge whether it is worth retaining later.

Engineering Practice 5: Training candidate samples should be few and hard, not too many and scattered

When it comes to training data, many teams will naturally pursue quantity. I value hardness more.

The so-called hardness means that the sample contains real engineering constraints, clear error modes, clear reasons for correction, and reliable verification evidence. A hard sample may be more valuable than twenty generic questions and answers.

Samples suitable for entering the SFT candidate set usually have the following characteristics:

featureJudgment criteria
The mission is realFrom real issues, real PRs, real defects, not just made-up questions
Contextually appropriateContains necessary background without relying on a large number of undisclosed details
Correction to clarifyHuman modifications are not stylistic preferences, but solve specific engineering problems
Fully verifiedSupported by testing, CI, review or online feedback
MigrantExperience can be transferred to similar tasks rather than one-time business exceptions
risk cleanDesensitized, no licensing issues, no contamination eval

Samples that are not suitable for entering the training set must also be identified.

Sample typewhy not suitable
Relying solely on human intuition to judge the output of “okay”lack of verifiable evidence
The patch that finally passed but the process was confusingPossibly training bad strategies into default behavior
Traces containing customer data or production logsHigh security risk
Full answer from public benchmarkpollution assessment
Modifications that only reflect personal coding style preferencesLow transferable value
Historical fixes for deprecated architectureWill solidify expired experience

Part 8 will discuss in detail how to process these candidate samples into SFT data. This article only gives one principle: the training data does not feed the “successful results” to the model, but feeds the “engineering judgment behind the successful results” to the model.

Engineering Practice 6: Preference data comes from controversy, not from pretty answers

SFT is suitable for teaching a model “how it should be done”. But there is no single answer to many engineering questions: should this abstraction be torn down, which layer should the cache be placed at, whether errors should be swallowed or propagated upwards, whether testing should be done with units or integrated, and whether refactoring should be done this time or split into subsequent tasks.

These scenarios are more suitable for settling preference data. Preference data is not as simple as “answer A is better than answer B”, but rather “under what constraints is A better than answer B”.

A good preference sample usually comes from review disputes, not from code that passed smoothly in the first place.

scenepreference judgment
AI proposes massive refactoring, humans demand minimal fixesThe current mission goal is to stop bleeding, and the risks of reconstruction outweigh the benefits.
AI abstracts common components while humans retain local implementationsThere is only one call point and the life cycle cost introduced by the abstraction is not worth it
AI uses caching to improve performance, but humans reject itHigh data consistency requirements and incomplete cache invalidation strategy
AI supplements a large number of snapshot tests, and humans require behavioral assertionsSnapshot will solidify implementation details and cannot prove business semantics

These controversial samples are valuable because they make the team’s engineering tastes explicit. For a model to become a reliable collaborator in the team, it must not only write the right code, but also learn to make choices within constraints.

Engineering Practice 7: Knowledge base is not a trash can for training sets

A lot of data that cannot enter training can still enter the team knowledge base. Such as architectural decisions, review rules, common error patterns, migration guides, and module boundary descriptions. These can influence the AI’s next behavior through RAG or cue word context.

But the knowledge base cannot become a trash can for training sets. Just because a piece of data is not suitable for SFT, you cannot just throw it into the knowledge base. The content in the knowledge base will enter the model context, and poor quality will also contaminate the output.

Knowledge base entries should ideally meet four conditions:

conditionillustrate
clear rulesCan guide subsequent tasks, rather than just recording historical facts
clear life cycleKnow when it expires and who is responsible for updating it
Clear scope of applicationIdentify modules, languages, scenarios, exceptions
associated with evalIt is best to have corresponding eval task verification for important rules.

For example, “The order service cannot directly call the inventory database” is a knowledge base rule; “A certain order service caused problems because it directly checked the inventory database” is review material; “AI once wrote this error code” is a trace. The three are related, but cannot be mixed into one type of data.

Organizational Division of Labor and Maturity Model

Organizational division of labor: Who owns this closed loop

Data closed loop is not a one-man show of a certain tool team. It spans product, R&D, test, security, platform, and model teams. Without clear ownership, you end up with a dashboard that no one maintains.

I recommend using the following boundaries of responsibility.

RoleMain responsibilities
Business leaderDefine real test acceptance criteria and determine task value and risk level
DeveloperDocument AI engagement scope, key context, failure fixes, and verification evidence
Reviewer / Coding MentorProvide structured feedback, labeling error types and transferable experiences
QA/Test LeaderMaintain verification evidence, flaky tags, regression tasks
Security and ComplianceDefine desensitization, permissions, retention periods, and untrainable boundaries
Platform engineeringBuild trace store, data routing, eval runner, quality gate control
Model/AI EngineeringUse data for prompt, RAG, SFT, RFT or toolchain optimization

The most critical one is the Coding Mentor role. This role is not equivalent to a senior engineer, nor is it equivalent to a model training engineer. He needs to understand the engineering context and also needs to know what feedback has learning value for the model. What many teams lack is not AI tools, but this kind of intermediate role that “understands engineering, understands feedback, and understands data boundaries.”

Maturity Model: From Personal Habits to Organizational Flywheel

Closed-loop construction cannot be achieved in one step. The mature path can be roughly divided into four levels.

AI Coding Mentor Data Closed Loop Maturity Roadmap

The first level is personal habits. Developers themselves document the scope of AI involvement, failure cases, and reasons for corrections. This stage does not require a platform, only discipline. The goal is to allow individuals to review why an AI collaboration succeeded or failed.

The second level is team norms. PR templates, review tags, AI usage records, and verification evidence begin to be unified. This stage begins to produce comparable data, and the team can see which tasks the AI ​​is reliable on and which tasks it reworks frequently.

The third level is platform closed loop. Trace store, eval runner, data masking, sample routing, and quality gating start to be automated. The team no longer relies on manual sorting, but continues to generate candidate data in daily delivery.

The fourth level is model and tool chain optimization. Private eval, SFT candidate set, preference data, knowledge base and prompt version form a closed loop. Model upgrades, prompt word changes, and tool chain adjustments all require private eval and online indicator regression.

stagemain goalminimum feasible actionDon’t do it in a hurry
personal habitsMake AI collaboration replayableSave key prompts, failure logs, and reasons for manual correctionsTraining model
team normsMake data structures consistentPR template, review tag, verification evidence fieldFully automatic collection
Platform closed loopMake trajectories searchable, gated, and routabletrace store, desensitization, eval runner, sample versionMulti-model complex scheduling
Model optimizationLet data feed back capabilitiesPrivate eval, SFT/RFT candidate set, A/B comparisonBlind pursuit of large-scale fine-tuning

This maturity model has a realistic premise: each level must be able to generate value independently. Personal habits can improve review quality, team standards can reduce repeated disputes, platform closed-loop can reduce data collection costs, and model optimization can improve the quality of the next round of delivery. If one level only serves the next level, it can easily fall by the wayside.

Anti-pattern: Where closed loops are most likely to fail

Anti-pattern 1: Treat AI logs as training data

Logs are not training data. Logs are just raw material.

A raw AI conversation may contain faulty context, expired constraints, sensitive information, half-baked reasoning, invalid attempts, and ad hoc human instructions. Taking it directly for training is equivalent to letting the model learn the most chaotic side of the team.

The correct approach is to layer the logs:

HierarchyProcessing method
Original logShort-term retention for audit and problem review
Structured traceExtract facts and associate tasks, tools, diffs, tests, reviews
candidate sampleAfter desensitization, duplication removal, quality scoring and manual inspection
Training/evaluation assetsClarify purpose, version, life cycle and isolation relationship

Logs that are not structured and gated are at most debugging materials, not model assets.

Anti-pattern 2: Only reward “last pass”

If the team only regards the output that “finally passes the test” as a good sample, the model will learn a dangerous preference: as long as it can pass in the end, it doesn’t matter how the process goes.

Software engineering is not about answering questions. An implementation that finally passes may rely on expanding the scope, bypassing interfaces, adding hidden states, sacrificing performance, or creating subsequent maintenance costs. Real teams care about “passing maintainably”, not “passing by chance”.

Therefore, sample scoring cannot only look at the results. Look at at least five dimensions simultaneously:

Dimensionsquestion
correctnessDoes the function meet the acceptance criteria?
minimalityWhether the scope of the modification is reasonable and whether irrelevant changes are introduced
maintainabilityIs the structure in line with the team’s long-term evolution direction?
VerifiabilityAre sufficient tests and evidence provided?
constraint complianceAdherence to security, performance, compatibility and architectural boundaries

This is why human mentor feedback is important. Automated testing can judge many things, but it cannot completely judge engineering choices.

Anti-pattern 3: Use public benchmarks to replace team private tasks

Public benchmarks can help the team compare the basic capabilities of the model, but they cannot replace the team’s private tasks. Especially for AI programming, whether a model “can write code” is only the threshold. Whether it “understands your code base, your constraints, and your release method” determines whether it can enter real delivery.

The value of SWE-bench is that it brings real GitHub issues, code repositories and tests into the evaluation, approaching real software engineering. But for a specific team, the most critical assessment tasks should come from its own historical bugs, architectural constraints, test escapes, and review disputes.

I suggest that the public benchmark only answers two questions:

  1. Does this model have the basic capabilities to enter the team trial?
  2. Is there any obvious degradation in general capabilities after the model is upgraded?

Team-private eval answers more critical questions:

  1. Can this model reliably locate problems in our warehouse?
  2. Does it adhere to our architectural boundaries and security redlines?
  3. Does it reduce duplication of work for the reviewer, or does it create new rework?
  4. Can it continue to improve from the Mentor signals we give it?

Anti-pattern 4: Make Coding Mentor an additional burden for a few experts

If all structured feedback relied on the handwriting of a handful of experts, the system would break down quickly. Expert time is too expensive to be used for reorganizing.

The correct approach is to layer.

Hierarchywho is responsibledegree of automation
Basic fact collectionsystemHigh, automatically extracted from PR, CI, agent trace
General labelDevelopers and reviewers, completed through templates and preset tags
High value sample judgmentCoding MentorLow, manual judgment of teaching value and boundaries
Data set samplingAI/Platform/Security FederationMedium, automatic scanning plus manual sampling

Experts should only deal with high-value judgments: Does this sample represent a certain type of engineering capability? Is this error typical enough? Is this fix portable? Has this experience expired? The rest is automated as much as possible.

90-day roadmap, indicator system and integration boundaries

An actionable 90-day roadmap

If a team wants to start this closed loop today, I don’t recommend directly setting up a half-year platform project. Start by running for 90 days.

Days 1-30: Make AI collaboration auditable

The goal is not to collect big data, but to have a minimal record of each AI engagement delivery.

action:

  1. Modify the PR template to include AI engagement scope, key context, verification evidence, failure remediation, and data routing recommendations.
  2. Define 6-8 error type tags, no more than 10.
  3. Choose 2 real projects to pilot, not company-wide.
  4. Select 5 AI collaboration cases to review every week to determine which ones have Mentor value.

Acceptance criteria:

indexTarget
Record completeness rate of AI participation in PRmore than 70%
Cases that can be reviewed every weekat least 5
First version error type labelCan cover 80% of frequently asked questions
Clear discard rulesCover at least three categories: sensitive information, non-reproducible, and low value

Days 31-60: Building a private eval seed set

The goal is to solidify high-value tasks in real delivery into regressible evaluations.

action:

  1. Pick 20-50 tasks from historical bugs, review controversies, and test escapes.
  2. Complete question stems, warehouse versions, acceptance criteria, test commands and reference fixes for each task.
  3. Explicit train/eval isolation rules.
  4. Select 2-3 models or toolchain versions for offline evaluation.

Acceptance criteria:

indexTarget
eval number of tasks20-50 pcs
Each task is reproducible100%
flaky quest tag100% status
Model comparison reportat least 1 serving

Days 61-90: Open up sample routing and feedback for improvements

The goal is for data to start feeding back into the tool chain, rather than just reporting.

action:

  1. To create a minimal version of the trace store, you can first use structured files or internal tables, without having to install a complex system at the beginning.
  2. Route delivery trajectories into evals, SFT candidates, preference data, knowledge bases, and drop zones.
  3. Make a prompt, RAG or toolchain fix for a high-frequency error.
  4. Regression correction effect with private eval.

Acceptance criteria:

indexTarget
Data routing coverageMore than 60% of pilot project AI PRs
SFT candidate sample30-100 hard samples
High frequency error correctionThere is a significant decrease in at least Type 1 error
eval regression mechanismCan run stably before and after tool chain changes

A decision on whether to platform will be made after 90 days. If PR templates, eval seed sets, and sample routing cannot run, hastily building a platform will only solidify process problems into system problems.

Indicator system: Don’t just ask how much time AI saves

Of course we need to look at the productivity indicators of the AI ​​programming assistant, but the Coding Mentor data closed loop also needs to look at several other sets of indicators.

indicator groupRepresentative indicatorsillustrate
Delivery efficiencylead time, AI participation task completion time, PR cycleDetermining whether AI is actually helping delivery
Project qualityCI failure rate, review defect density, defect escape, rollbackDetermine whether AI creates quality debt
Collaboration burdenNumber of manual corrections, review rounds, frequency of repeated feedbackDetermine whether AI reduces tutor burden
data assetsNumber of available evals, number of hard samples, proportion of samples passing the gateDetermine whether the closed loop produces reusable assets
Model improvementsPrivate eval improvement, high-frequency error reduction, tool chain return stabilityDetermine whether the data really feeds back the ability

The most misleading is “AI participation rate.” A high AI participation rate does not mean high value. A team can let AI write a lot of code, while making the reviewers more tired, with more defects, and the architecture more messy. What really needs to be seen is whether the participation of AI reduces repeated errors and makes engineering judgments more reusable.

Integration boundaries with existing engineering systems

This closed loop should not be reinvented. It should be embedded into existing engineering systems.

Existing systemIntegration method
Issue systemSave requirements, acceptance criteria, defect classification, and business priorities
Git/PRSave diff, review, merge status, AI participation fields
CI/CDSave test, build, security scan, deployment results
Logging/MonitoringSave online feedback, error rate, and performance changes
Document systemSave architectural constraints, specifications, reviews and knowledge bases
model platformRun eval, prompt versions, SFT/RFT experiments and A/B comparisons

Boundaries also need to be clear. AI data closed loop does not replace ALM, CI, code review or knowledge base, but connects the key signals in these systems. It is more like an “AI engineering learning layer” that is responsible for converting delivery facts into assets that the model can learn and the team can evaluate.

What problem does this closed loop really solve?

Writing the article here may easily make people think that this is a very heavy system. It is indeed heavier than “opening the AI ​​tool and writing code directly”. But it solves a problem that the latter can never solve.

First, it makes AI progress attributable. Is the model getting better because the model has been changed, the prompt has been changed, the context has been added, the training samples have been improved, or the task has become simpler? Without eval and trace, the team can only guess.

Second, it makes human experience reusable. The judgment of senior engineers in each review no longer only serves the current PR, but becomes an asset for subsequent models, knowledge bases and evals.

Third, it makes AI risks governable. Sensitive data, expired experience, benchmark contamination, and unreproducible samples no longer rely on personal awareness, but enter the gate control system.

Fourth, it lets the model optimize service delivery rather than service scores. If the improvement on the public list cannot reduce the team’s rework, it is not the real benefit of the team.

Fifth, it allows Coding Mentor to transform from individual capabilities to organizational capabilities. A person who can train AI is valuable, but a team that can stably produce Mentor signals is even more valuable.

Conclusion: Build the feedback system first, then talk about the training system

My advice is clear: don’t start with “We’re going to train a team-specific model.” Let’s start with “Can we turn the process of AI participation in delivery into a reliable feedback system?”

Don’t rush into SFT if the team hasn’t documented why the AI ​​failed. If the team doesn’t have a private eval yet, don’t trust the fine-tuned score. If your team doesn’t already have data gating, don’t treat collaboration logs as an asset. If the reviewer’s judgment still remains at “It doesn’t work here”, first train humans how to write Mentor signals.

The real order should be:

  1. Make the AI ​​delivery process observable.
  2. Make human feedback structured.
  3. Make data routing gated.
  4. Make private eval returnable.
  5. Keep training data small and hard.
  6. Let models and toolchains continue to improve based on evaluation.

This is what this article calls a closed loop: AI-assisted product engineering delivery is not the end point, it is the production site of training data and evaluation data; the human Coding Mentor is not the final acceptor, but the designer of feedback signals; SFT does not pour logs into the model, but precipitates gated engineering judgments into learnable assets.

The next article (Part 8) moves into a more specific issue: when these trajectories, feedback, and candidate samples have been generated, how to clean, filter, label, convert them into high-quality SFT data, and connect them to the training pipeline. After completing this step, the final chapter 9 returns to long-term evolution and future prospects.

References and Acknowledgments

Series context

You are reading: AI Coding Mentor Series

This is article 7 of 9. Reading progress is stored only in this browser so the full series page can resume from the right entry.

View full series →

Series Path

Current series chapters

Chapter clicks store reading progress only in this browser so the series page can resume from the right entry.

9 chapters
  1. Part 1 Previous in path Why do you need to be a coding mentor for AI? When AI programming assistants become standard equipment, the real competitiveness is no longer whether they can use AI, but whether they can judge, calibrate and constrain the engineering output of AI. This article starts from trust gaps, feedback protocols, evaluation standards and closed-loop capabilities to establish the core framework of "Humans as Coding Mentors".
  2. Part 2 Previous in path Panorama of AI programming ability evaluation: from HumanEval to SWE-bench, the evolution and selection of benchmarks Public benchmarks are not a decoration for model rankings, but a measurement tool for understanding the boundaries of AI programming capabilities. This article starts from benchmarks such as HumanEval, APPS, CodeContests, SWE-bench, LiveCodeBench and Aider, and explains how to read the rankings, how to choose benchmarks, and how to convert public evaluations into the team's own Coding Mentor evaluation system.
  3. Part 3 Previous in path How to design high-quality programming questions: from question surface to evaluation contract High-quality programming questions are not longer prompts, but assessment contracts that can stably expose the boundaries of abilities. This article starts from Bloom level, difficulty calibration, task contract, test design and question bank management to explain how to build a reproducible question system for AI Coding Mentor.
  4. Part 4 Previous in path Four-step approach to AI capability assessment: from one test to continuous system evaluation Serving as a coding mentor for AI is not about doing a model evaluation, but establishing an evaluation operation system that can continuously expose the boundaries of capabilities, record failure evidence, drive special improvements, and support collaborative decision-making.
  5. Part 5 Previous in path Best Practices for Collaborating with AI: Task Agreement, Dialogue Control and Feedback Closed Loop The core skill of being a Coding Mentor for AI is not to write longer prompt words, but to design task protocols, control the rhythm of conversations, identify error patterns, and precipitate the collaboration process into verifiable and reusable feedback signals.
  6. Part 6 Previous in path Practical cases: feedback protocol, evaluation closed loop, code review and programming education data Case studies should not stop at “how to use AI tools better”. This article uses four engineering scenarios: model selection evaluation, feedback protocol design, code review signal precipitation, and programming education data closed loop to explain how humans can transform the AI ​​collaboration process into evaluable, trainable, and reusable mentor signals.
  7. Part 7 Current From delivery to training: How to turn AI programming collaboration into a Coding Mentor data closed loop The real organizational value of AI programming assistants is not just to increase delivery speed, but to precipitate trainable, evaluable, and reusable mentor signals in every requirement disassembly, code generation, review and revision, test verification, and online review. This article reconstructs the closed-loop framework of AI training, AI-assisted product engineering delivery, high-quality SFT data precipitation, and model evaluation.
  8. Part 8 From engineering practice to training data: a systematic method for automatically generating SFT data in AI engineering Following the data closed loop in Part 7, this article focuses on how to process the screened engineering assets into high-quality SFT samples and connect them to a manageable, evaluable, and iterable training pipeline.
  9. Part 9 Future Outlook: Evolutionary Trends and Long-term Thinking of AI Programming Assessment As the final article in the series, this article reconstructs the future route of AI Coding Mentor from the perspective of engineering decision-making: how evaluation objects evolve, how organizational capabilities are layered, and how governance boundaries are advanced.

Reading path

Continue along this topic path

Follow the recommended order for AI programming assessment instead of jumping through random articles in the same topic.

View full topic path →

Next step

Go deeper into this topic

If this article is useful, continue from the topic page or subscribe to follow later updates.

Return to topic Subscribe via RSS

RSS Subscribe

Subscribe to updates

Follow new articles in an RSS reader without checking the site manually.

Recommended readers include Follow , Feedly or Inoreader and other RSS readers.

Comments and discussion

Sign in with GitHub to join the discussion. Comments are synced to GitHub Discussions

Loading comments...