Article
From delivery to training: How to turn AI programming collaboration into a Coding Mentor data closed loop
The real organizational value of AI programming assistants is not just to increase delivery speed, but to precipitate trainable, evaluable, and reusable mentor signals in every requirement disassembly, code generation, review and revision, test verification, and online review. This article reconstructs the closed-loop framework of AI training, AI-assisted product engineering delivery, high-quality SFT data precipitation, and model evaluation.
Copyright Statement and Disclaimer This article is an original interpretation based on public materials such as GitHub, Anthropic, LangChain, OpenAI, SWE-bench and SWE-agent. The copyright of the original text belongs to the original author and source. This article does not constitute an official translation, nor does it represent the views of the above-mentioned institutions.
Original Nature The closed-loop architecture, data routing model, Mentor signal definition, quality gate control and implementation route of this article are original reconstructions by the author. External materials are only used as sources of argument and industry cases, and are not subject to paragraph-by-paragraph translation.
Beginning: What is really lost is not the code, but the process
A team involved AI in daily development for half a year: AI was asked to write a draft first when dismantling requirements, AI was asked to create the first version for function implementation, AI was asked to analyze logs when testing failed, and AI was asked to scan the PR before code review. Half a year later, more code is left in the warehouse, more commits are left in the PR, and more build records are left in the CI system. The question is: Can these things be used in turn to train an AI that understands the team better?
In most cases, no.
The reason is not that the code generated by AI has no value, but that the team only saves the delivery results, not the delivery process. Why did a failed implementation fail? Which business constraints did the AI miss? Why did the human reviewer drop a patch that seemed to work? After the test failed, which log actually pointed to the root cause? This information is usually scattered in the chat window, terminal history, PR comments and personal brains. After the delivery is complete, they are cleared just like temporary debug branches.
I prefer to put this matter more directly: if an AI-assisted development only leaves the final code, it is only a delivery; if it also leaves requirements, context, plans, tool calls, failure logs, manual corrections and acceptance evidence, it has the opportunity to become a reusable Coding Mentor training sample.
The previous articles discussed how to evaluate AI, how to design questions, how to organize collaboration and review cases. Article 7 cannot continue to be written as “How to use an AI programming assistant”, nor should it stop at “Building an evaluation platform”. This article will answer a question closer to the engineering system: how to involve AI in the entire process of product engineering delivery and design it into a closed loop that can continuously accumulate high-quality training data and evaluation data.
Four things happen simultaneously in this closed loop:
- AI participates in real product delivery, not doing toy problems in a sandbox.
- Human Coding Mentor gives structured feedback at key nodes instead of just saying “This section doesn’t work.”
- Delivery traces are routed to eval, SFT, preference data, knowledge base or drop samples rather than all mixed up in the log.
- Models, prompts, toolchains, and team processes are continuously adjusted based on evaluation results, rather than upgrading based on personal taste.
The core judgment of this article is that being able to write code with AI is not a barrier; being able to transform the entire process of AI participation in delivery into training assets, evaluation assets, and organizational memory is the real barrier for teams to serve as Coding Mentors for AI.
Problem boundaries and value judgments
Calibrate the direction first: This article is not an advance version of Part 8
This series already has an article dedicated to SFT data generation. That article will go into more detailed engineering implementation: data structure, cleaning, quality scoring, export format, and training process. This article does not repeat those details.
The responsibilities of Part 7 are more advanced: first design “where the data comes from, why it is trustworthy, how to divert it, who is responsible for judgment, what cannot be included in the training set, and how the evaluation in turn affects delivery.” If you think of Part 8 as a data processing factory, this article discusses the raw material production line, quality inspection rules and closed-loop operation methods.
I suggest using the following boundary to separate the two articles:
| article | Following objects | core issues | Unexpanded content |
|---|---|---|---|
| Article 7 | Data closed loop in AI engineering delivery | How every AI collaboration can become a trainable, evaluable, and reusable Mentor signal | Specific training scripts, export formats, and model fine-tuning parameters |
| Article 8 | SFT data generation pipeline | How to process the screened engineering assets into training samples and connect them to the training process | Organizational closed loop, data ownership, online delivery indicator system |
This boundary is important. When many teams talk about “training data”, they immediately jump to JSONL, SFTTrainer, LoRA, and evaluation scripts. That’s the question in the second half. If the first half is not designed well, the second half will only process the dirty data more beautifully.
The value of AI programming assistant is not equal to the value of Coding Mentor
GitHub’s early research on Copilot focused on developer speed and experience: AI can help developers complete tasks faster and improve subjective satisfaction. Later, enterprise research by GitHub and Accenture further put the perspective into enterprise practice, measuring the impact of AI programming tools based on dimensions such as telemetry, code quality, and developer feedback. This type of research is valuable because it proves that the involvement of AI in software delivery is not a gimmick but will actually change the development process.
But there is a bifurcation here that is easily overlooked: improving delivery efficiency and having a team with Coding Mentor capabilities are two different things.
When using AI programming assistants, the team’s concern is “Can AI help me complete the task faster?” When working as a Coding Mentor for AI, the team is concerned about “what capability boundaries of AI have been exposed through this collaboration, and can I transform this boundary into a trainable, evaluable, and reusable feedback signal?” The former is the ability to consume models, and the latter is the ability to build models.
These two goals are not in conflict, but engineering design is completely different.
| Target | main action | Typical products | risk |
|---|---|---|---|
| Use AI to improve delivery efficiency | Let AI write code, explain errors, generate tests, and assist in reviews | PR, commit, test results, delivery report | Leave only the results, not the process |
| Be a Coding Mentor for AI | Record AI inputs, decisions, failures, revisions, validations, and human judgments | Structured trajectories, error labels, evaluation tasks, training candidate samples | Mistaking all logs as training data |
The former can be measured by delivery cycle, task completion speed, developer satisfaction, and PR rework rate. The latter measurement indicators are more difficult: whether the AI reduces repeated mistakes, whether it is better at using team norms, whether it can make more stable choices in similar contexts, whether it can pass private eval sets, and whether it can internalize problems repeatedly pointed out by human reviewers into subsequent behaviors.
A mature team should not separate these two lines. The right direction is: use AI-assisted delivery to create real task scenarios, use real task scenarios to produce auditable trajectories, use human mentor signals to screen data, and then use training and evaluation to in turn improve the quality of the next round of delivery.
Closed Loop Overview: From a PR to a Training Flywheel
If you think of AI-assisted development as an ordinary tool call, the process is usually very short: the developer asks a question, the AI gives the code, the human copies the modifications, the test passes, and the PR is submitted. This process increases speed, but it does not produce reusable organizational assets.
To transform it into a Coding Mentor data closed loop, the process must have three more layers: observation layer, judgment layer and routing layer.
The first layer is the observation layer. What the team wants to capture is not “what the AI finally said”, but how the AI acted in the engineering environment: what files it looked at, what context it relied on, what plan it proposed, what commands it executed, how it corrected after failure, and what the final diff and test results were.
The second layer is the judgment layer. Human Coding Mentors not only accept the code, but write judgments into learnable signals: Is this error a requirement misunderstanding, context omission, interface contract violation, insufficient test design, or an engineering choice error? Why is the correct restoration in this direction? Can this experience be transferred to other tasks?
The third layer is the routing layer. Not all trajectories can be included in the training set. A delivery track may be more suitable for entering a private eval, it may only be suitable for entering the team knowledge base, or it may have to be discarded because it contains sensitive information. Routing determines the fate of data assets.
This set of closed loops can be summarized in one sentence:
The delivery process is responsible for generating real tasks, Mentor feedback is responsible for creating high-quality supervision signals, data routing is responsible for determining usage, and evaluation results are responsible for driving the next round of delivery strategy adjustments.
Anthropic has repeatedly emphasized in the practice of agent systems that the system should not be made into a complex autonomous body from the beginning, but should start with a simple and controllable workflow and gradually add tools, memory and autonomy. This suggestion also holds true in this article. The team does not need to build a fully automated training platform on the first day. The real first step is to make the process of AI participation in delivery observable, replayable, and annotated.
LangChain’s discussion on the agent improvement loop gives a good engineering perspective: trace is not a debugging by-product, but the entrance to the agent improvement loop. Without trace, failure can only rely on impression review; with trace, the team can cluster failures, construct data sets, run eval, and then promote system improvement.
Data Acquisition and Mentor Signals
Data collection points in the delivery process: Don’t just focus on prompts and answers
Many people say “Save the AI dialogue and use it for training later.” This sentence is only half correct. Conversations are certainly important, but in software engineering scenarios, the most valuable signals are often not found in chat text, but in context, tool feedback, and human corrections.
A real AI-assisted delivery should be considered at least eight stages.
| stage | Data that should be collected | What Mentor needs to judge | Assets that may be precipitated |
|---|---|---|---|
| Demand entry | Original requirements, business goals, acceptance criteria, non-goals, risk constraints | Does the AI understand real boundaries rather than just literal functions? | Private eval question stem, requirement understanding case |
| mission planning | AI dismantling, assumptions, dependencies, modification scope, and verification plans | Does the plan miss migration, compatibility, error paths, rollback strategies? | Planning capability samples, planning evaluation criteria |
| Contextual search | Files read by AI, referenced documents, search paths, missing files | Whether the model has obtained the necessary context and whether it references outdated information | RAG index optimization, context selection samples |
| code generation | Initial diff, alternatives, interface changes, test additions | Does the code comply with architectural, naming, error handling, and security requirements? | SFT candidate samples, bad example samples |
| tool execution | shell commands, test logs, lint, type checking, build failures | Can AI locate problems based on environmental feedback instead of blindly fixing them? | Tool usage traces and failure repair cases |
| Human review | reviewer comments, reasons for rejection, design disputes, modification suggestions | Which questions must be calibrated by humans and which can be automated | Preference data, review rules |
| Repair iteration | failed attempts, corrected patch, final patch | What is the correct repair path and why wrong attempts are wrong? | DPO/RFT candidate data, error pattern library |
| Acceptance review | CI, regression testing, online results, defect escape, rework records | Does the sample really represent “success” or just didn’t blow up at the time? | holdout eval, sample weight reduction basis |
Here is a practical judgment: If the team can only save three types of data, I will give priority to saving “requirements and acceptance criteria”, “failure trajectories and tool feedback” and “human correction reasons”. In the end, the code warehouse will be saved, and what is really easy to lose are these intermediate signals.
Don’t make this a heavy form-filling process. Filling out forms will be bypassed by developers. A better approach is to embed data collection into existing engineering actions: PR templates, CI annotations, review bots, agent traces, issue status transfers, and test report archiving. Let developers write one more key judgment instead of filling in twenty more fields.
Mentor signal: Turn “I don’t think it works” into a structure that the model can learn
The AI won’t learn much from saying “bad writing here”. Human reviewers themselves often cannot explain the basis for a certain judgment. The first ability of Coding Mentor is to break down implicit engineering judgments into explicit feedback signals.
I recommend splitting Mentor signals into six categories.
| Signal type | questions to answer | Typical annotation |
|---|---|---|
| task understanding signals | Does AI understand the real problem to be solved? | Misunderstanding of requirements, non-target expansion, and omission of acceptance criteria |
| context selection signal | Has the AI found the necessary context? | Interface contract ignored, old implementation referenced, caller not read |
| Achieve quality signals | Does the code meet team engineering standards? | Missing error handling, missing boundary conditions, excessive abstraction |
| Verify behavioral signals | Does AI know how to prove itself right? | No tests added, only running the happy path, ignoring type checking |
| Fix strategy signals | How AI adjusts when faced with failure | Blind modification, partial repair, root cause location, rollback and redo |
| organizational constraint signals | Does the AI comply with team invisibility rules? | Security red lines, performance budgets, compatibility strategies, release rhythm |
These signals cannot be too general. For example, “poor code quality” is not a good label because it cannot guide training and evaluation. A better way to write it is: “The error path does not retain the original exception, causing the upper-layer retry strategy to be unable to distinguish between temporary failures and permanent failures.” This sentence is a bit longer than “Error handling is bad”, but it contains the cause and effect chain.
When I look at AI-generated code, what I am most wary of is the implementation of “it can run but does not pay the price”. It passed the test, but implicitly violated the team’s long-term constraints: synchronous calls were stuffed into hot paths, temporary cover-ups were written as default logic, and one-time migrations were written as permanent branches. This type of problem is most suitable to enter the Mentor signal library, because it is difficult to cover with ordinary benchmarks, but it will occur repeatedly in real projects.
The quality of the Mentor signal determines the upper limit of subsequent data. Without structured signals, the training data is just “code modified by humans”; with structured signals, the training data becomes learnable evidence of “why humans changed it this way”.
Data Contract: A delivery trace must answer at least ten questions
The team does not need to define a huge schema from the beginning, but there must be a minimum data contract. The goal of this contract is not to serve the database design, but to ensure that when looking back at the sample in the future, we can still judge whether it is trustworthy.
| question | why must answer |
|---|---|
| What is the true goal of this mission? | Without goals, there is no way to tell whether the AI output solves the right problem |
| What are the acceptance criteria? | Without acceptance criteria, a passing test may be an illusion |
| What context does AI use? | Able to distinguish between insufficient model capabilities and insufficient context supply |
| What plan does AI come up with? | Plan to expose model engineering judgment, not just coding capabilities |
| Where does the initial output fail? | Failure points are one of the most valuable training signals |
| What have humans modified? | diff shows the fact changes, but not why |
| Why did humans change like this? | This is the core of the Coding Mentor signal |
| What are the results of automated verification? | Prevent subjective satisfaction from being mistaken for success |
| Whether to rework or roll back later | Prevent short-term success from being mistaken for long-term success |
| Where should this trajectory go? | Decide if it is eval, SFT, preference data, knowledge base or drop samples |
There is a principle behind this contract: training data must contain “causal information”, not only “result information”. Only outcome information encourages the model to imitate surface answers; causal information has the opportunity to allow the model to learn engineering judgment.
The research value of SWE-agent is not only that it allows the model to solve SWE-bench tasks, but also that it puts the action process of the software engineering agent into the “agent-computer interface”: the model does not output answers at once, but reads files, edits, runs tests, and observes feedback in the environment. For the closed data loop within the team, this perspective is closer to real engineering than a single round of prompt-answer.
Data routing: not everything should go into the training set
I object to the statement “collect all AI collaboration logs and use them for future training”. It sounds data asset aware, but actually creates three problems: sensitive information leakage, low-quality sample contamination, and eval and train confusion.
A more reasonable approach is to establish data routing. After each delivery track is quality-gated, it can only enter one or a few destinations.
| Where to go | data suitable for entry | Data not suitable for entry | main user |
|---|---|---|---|
| private eval set | Real tasks, reproducible, clear acceptance, stable testing, answers not leaked | flakey test, vague requirements, samples used for training | Model evaluation, tool chain regression |
| SFT candidate set | Manual corrections are of high quality, clearly explained, transferable, and fully verified. | Unreviewed AI output, occasional pass, only applicable to one-time business details | Model fine-tuning and behavioral demonstration |
| Preference data | Multiple options can be compared, review reasons are clear, and the boundaries between advantages and disadvantages are clear. | There is no clear basis for preference, just personal style differences | DPO/RFT, strategy selection training |
| Team knowledge base | Architectural constraints, common mistakes, review rules, and review conclusions | Keys, customer data, temporary workarounds | RAG, prompt word context, engineering specifications |
| discard area | Sensitive, contaminated, irreproducible, low value, no verification | There should be no return to training or assessment | Data governance, auditing |
The most common mistake here is to mistake “human beings can change” with “can be trained.” Human correction only shows that there is a problem with the original output, but does not mean that the final sample is suitable for learning. A patch may be a temporary fire extinguisher, it may bypass the root cause, it may sacrifice long-term maintainability, or it may contain business logic that cannot be leaked. Without routing gating, the training data set swallows this historical baggage along with it.
OpenAI later stopped using SWE-bench Verified as the main public evaluation standard for cutting-edge models. There was a practical reason behind it: public benchmarks would be repeatedly optimized by the community, causing contamination and overfitting, and also exposing the limitations of the test itself. The same goes for internal team eval sets. If the training set, parameter tuning set, and evaluation set are mixed together, an increase in indicators only means that the system is better at “memorizing questions”, but it does not mean that it is better at delivering.
Quality Gating: The brake of the data flywheel is more important than the accelerator
The word “data flywheel” is very exciting, as if as long as enough trajectories are collected, the model will naturally become stronger and stronger. In engineering it’s just the opposite. What really determines whether the flywheel can spin is not the acquisition speed, but the gating quality.
I recommend dividing the gate control into seven lanes.
| Gate | What to stop? | The consequences of not stopping |
|---|---|---|
| Privacy and security | Keys, tokens, customer data, internal addresses, production log sensitive fields | Data assets become a source of security incidents |
| IP and Licensing | Third-party code and restricted protocol content that cannot be used for training | The scope of use of subsequent models is limited |
| data pollution | Public benchmark answers and eval samples that have entered the training set | Evaluation metric distortion |
| Reproducibility | Unreproducible problems, missing environments, and tests that cannot be run | Model learns unverifiable experience |
| Verify adequacy | No testing, no review, no evidence of acceptance | Mistaking “it looks right” for “engineering is right” |
| teaching value | Contains only one-time business details and no transferable judgments | Adding noise does not improve model capabilities |
| life cycle | Expired schema, temporary workaround, deprecated API | Training historical debt into default behavior |
Among these seven gates, I value “life cycle” the most. Training data is not an archive. If the expired experience is not removed from the shelves, the model will adhere to the old constraints that the team has abandoned. For example, after the team switches from REST to event-driven architecture, the “synchronous query of downstream services” model in the old sample should be downgraded or eliminated. Models have no sense of time as organizational memory, and data governance must provide it for it.
Gating should not be entirely manual. Humans are responsible for boundary determination and spot checks, while automation is responsible for duplicate checks: sensitive field scanning, license marking, test stability statistics, sample deduplication, train/eval split checks, and expired API detection. The most scarce time of human tutors should be used to judge “whether this sample has teaching value” instead of manually finding tokens.
Evaluation closed loop: private eval is the team’s engineering physical examination form
If the training data determines what the model learns, eval determines what the team believes.
Public benchmarks are still important. SWE-bench introduces real GitHub issues into software engineering evaluation, which is closer to engineering reality than traditional algorithm questions. SWE-Gym further attempts to convert real issues, environments and trajectories into trainable tasks. These works indicate a direction: the evaluation of coding agents is moving from static questions to real warehouses, real environments, and real feedback.
But the team can’t just look at public benchmarks. The public benchmark measures general capabilities, and the team’s private eval measures “whether this AI is reliable in our engineering system.” The relationship between the two is similar to that of a physical examination report and a job trial: the former can discover basic problems, and the latter can judge whether you are suitable for a specific position.
Team-private eval should cover at least four categories of tasks.
| Task type | Assessment ability | Sample source |
|---|---|---|
| Return to repair tasks | Can the root cause be located and fixed in the real code base? | Historical bugs, online defects, CI failures |
| architectural constraint tasks | Can you abide by team boundaries, contracts, and performance budgets? | review disputes, architectural decision records |
| Test reinforcement tasks | Can the critical path, boundary conditions, and error paths be supplemented? | Test escape, defect review |
| Refactoring trade-off tasks | Can the structure be improved without changing the behavior? | Historical reconstruction, technical debt governance |
OpenAI’s practice on eval emphasizes that evaluation should serve real business results, rather than just creating a beautiful score. In an AI programming scenario, private eval should not only look at the pass rate, but also whether the repair path is reasonable, whether the context reference is correct, whether it breaks architectural constraints, and whether verification evidence can be given.
I suggest dividing the evaluation indicators into three tiers.
| Hierarchy | What to see | Indicator example |
|---|---|---|
| Offline capability | The performance of a model or agent on a fixed set of tasks | pass rate, repair success rate, failure type distribution |
| collaborative process | AI delivery quality in real PR/issues | Number of manual corrections, review defect density, CI failure rate |
| business results | The long-term impact of AI on engineering delivery systems | lead time, rollback, defect escape rate, maintenance cost |
These three layers cannot replace each other. Offline eval can return quickly, but is easily disconnected from real work; process indicators are close to the team’s daily routine, but are affected by fluctuations in task difficulty; business results are the most realistic, but feedback is slow. A mature closed loop requires looking at all three together.
New responsibilities of human Coding Mentor: from reviewer to signal calibrator
In traditional software engineering, the reviewer’s main responsibility is to prevent bad code from entering the trunk. In the closed loop of AI data, reviewers still have to guard the gate, but they have an additional responsibility: to transform their judgments into signals that can be absorbed by the system.
This thing will change the way review is written.
| normal review comment | Mentorization Review |
|---|---|
| Additional testing is needed here | There is a lack of error path testing here; the current implementation only covers success returns and cannot prove that the retry strategy at timeout and downstream 500 is correct |
| This abstraction is not necessary | This abstraction is only used by one call point, but introduces a new life cycle and configuration branch, and the maintenance cost is higher than the reuse benefit. |
| Don’t check the database like this | This query is executed on the hot path of the list page and lacks paging and index constraints. As the data volume increases, the risk of delay will be transferred to the user request link. |
| This does not meet the standards | This interface bypasses the existing permission middleware and destroys the team’s architectural constraint of “concentrating authentication at the gateway layer” |
Mentoring comments is not to write longer, but to supplement three types of information: problem type, cause and effect, and transferable principles. This way the comments themselves can become training samples, eval criteria, or knowledge base entries.
The team can divide Mentor feedback into several fixed fields, but do not make the process rigid.
| Field | Example |
|---|---|
| Question type | Missing context, boundary conditions, architectural constraints, security risks, insufficient validation |
| trigger evidence | Which file, which test, which log, which PR comment |
| Root cause judgment | AI did not read the caller, leading to the mistaken belief that the interface only serves a single scenario. |
| Correction strategy | First add the contract test, then change the implementation, and finally add the migration instructions. |
| Transferable experience | All consumers must be checked before modifying the shared contract |
| Data routing recommendations | Enter eval set, do not enter SFT, because it contains customer fields and needs to be desensitized before evaluation. |
The key here is not that “humans are more advanced than AI”, but that humans have mastered organizational constraints outside the model. Many engineering judgments do not exist in the public code: why the team does not upgrade a certain dependency, why an old interface is retained, why this module cannot import a certain package, and why a seemingly inefficient implementation is actually for compatibility with historical customers. The value of Coding Mentor lies in making these implicit constraints explicit.
Engineering Practice: From PR to Evaluation Closed Loop
Engineering Practice 1: Start with a PR template, not a training platform
Many teams want to build a platform as soon as they talk about closed loop. My advice is the opposite: start with a PR template and review spec.
The reason is simple. PR is the smallest audit unit of real project delivery. Requirements, code, testing, review, CI, and merge results can all be gathered here. Turning PR into a data closed-loop entrance is easier to implement than building an “AI data collection system” alone.
A PR template for Coding Mentor data closed loop can only add five fields.
| Field | Purpose |
|---|---|
| AI participation scope | Distinguish whether AI writes code, generates tests, interprets logs, assists in reviews, or just documents |
| key context | Record which files, documents, issues, and architectural constraints are used by AI |
| Failure and Correction | Save at least one valuable failed attempt and human correction reason |
| verification evidence | Correlation testing, lint, screenshots, performance data, security scans |
| Data routing recommendations | Whether the tag can go into eval, SFT, preference data, knowledge base or must be discarded |
These five fields do not need to be filled every time. Low-risk changes can be abbreviated, while high-risk changes must be complete. The key point is to let the team develop a habit: AI is not a black box tool, and the process it participates in delivery should be auditable.
CI should also work with this set of templates. For example, when the PR is marked “AI generation core implementation”, CI can require more complete test evidence; when the PR modifies the shared contract, the system can remind the reviewer to check the consumer; when the PR is marked as an eval candidate, the system can archive issue, diff, test and review comments as candidate tasks.
This is not to burden the process, but to harden engineering judgments that would otherwise occur verbally into the system.
Engineering practice two: Trace store should save “replayable facts” and do not save chat screenshots
If the team uses Claude Code, Copilot Agent, Cursor, Aider, OpenHands or internal coding agent, they will all encounter the same problem: the process of an AI collaboration is very long, and the content is scattered in the editor, terminal, browser, PR and chat interface.
This requires a trace store. It’s not a log bin, but a system that holds “replayable facts.”
The trace store must store at least five types of facts.
| type | content | use |
|---|---|---|
| Enter facts | User tasks, system prompts, context files, search results | Determine what the model sees |
| action facts | Planning, tool calling, file editing, command execution | What does the restored model do? |
| environmental facts | Test output, compilation errors, lint, run logs | Determine whether failure feedback is sufficient |
| human facts | review comments, manual modifications, acceptance conclusions | Provide Mentor signals |
| Result fact | Final diff, merge status, online feedback, rework records | Calibration success and failure |
Don’t save non-auditable screenshots as your primary data source. Screenshots can aid understanding, but cannot serve as basic facts for training and evaluation. To be truly usable, data must be parsed, masked, retrieved, versioned, and correlated.
The trace store should also not keep everything permanently by default. Sensitive fields must be desensitized as soon as possible, low-value traces must be expired and cleared, and data entering eval or training candidates must have a version number. If a data system does not have life cycle management, it will soon become a historical black box that no one dares to use.
Engineering Practice 3: Treat test failure as the most valuable mentor signal
In AI programming collaboration, test failures are not noise, but the cheapest supervision signal.
A patch that passes the test can only show that it meets the existing tests; a patch that is correctly repaired after failure can often tell us what the model originally misunderstood, how to recover from environmental feedback, and which types of errors are most likely to reoccur. The latter has a higher pedagogical value for training and assessment.
I recommend that teams specifically keep track of three types of failures.
| Failure type | why valuable | Data usage |
|---|---|---|
| Initial implementation failed | Exposing model default assumptions and common omissions | Error pattern library, SFT counterexample explanation |
| Repair failed after tool feedback | Will the exposed model misread the log or blindly modify it? | agent eval, tool usage evaluation |
| The repair was successful after being pointed out by humans | How exposing Mentor signals changes results | Preference data, repair strategy training |
Here’s a detail: Don’t just save the final successful version. Keep at least one link between a failed version, evidence of the failure, human or environmental feedback, and the final fix. Without this link, the model can only learn “what a successful answer looks like” but cannot learn “how to go from failure to success.”
Anthropic emphasizes in Claude Code practice ways to provide model verification work, such as testing, lint, screenshots, logs, and command feedback. The essence of this suggestion is not to “run multiple tests”, but to turn the verification path into an environmental signal that the model can use. The Coding Mentor data closed loop should further preserve these environmental signals and turn them into evidence for subsequent evaluation and training.
Engineering Practice 4: The eval set should be maintained like a test set, not collected like a document
Many teams will build an “AI assessment question bank”, but the maintenance method is like a document: it is added occasionally, there are few versions, the coverage is unclear, and no one keeps track of expiration. Such question banks will soon become ineffective.
Private eval sets should be maintained like test sets.
| Maintenance action | Engineering meaning |
|---|---|
| Versioning | Every time a task is added, deleted, or modified, the reasons can be traced |
| coverage statistics | Know which languages, modules, risk types and capability dimensions eval covers |
| isolation train/eval | Prevent training data from contaminating evaluation data |
| Flakiness monitoring | Unstable tasks cannot be used as a basis for judging model capabilities. |
| Difficulty Calibration | Avoid all simple bugs or all extreme problems |
| Invalid and removed from shelves | Outdated architecture, obsolete APIs, and historical temporary solutions must be exited |
Each task in the eval set should preferably have a reason why it is worth evaluating. for example:
| Task | Reasons for evaluation |
|---|---|
| Fix payment callback idempotent issue | Test whether the model can identify concurrency and duplicate message risks |
| Patch testing for cache penetration | Test whether the model understands boundary inputs and abnormal paths |
| Reconstruct order status flow | Test whether the model maintains behavior and respects state machine constraints |
| Handling third-party API timeouts | Test whether the model can differentiate between retries, degradation and error propagation |
If there is no reason for the eval question, it will be difficult to judge whether it is worth retaining later.
Engineering Practice 5: Training candidate samples should be few and hard, not too many and scattered
When it comes to training data, many teams will naturally pursue quantity. I value hardness more.
The so-called hardness means that the sample contains real engineering constraints, clear error modes, clear reasons for correction, and reliable verification evidence. A hard sample may be more valuable than twenty generic questions and answers.
Samples suitable for entering the SFT candidate set usually have the following characteristics:
| feature | Judgment criteria |
|---|---|
| The mission is real | From real issues, real PRs, real defects, not just made-up questions |
| Contextually appropriate | Contains necessary background without relying on a large number of undisclosed details |
| Correction to clarify | Human modifications are not stylistic preferences, but solve specific engineering problems |
| Fully verified | Supported by testing, CI, review or online feedback |
| Migrant | Experience can be transferred to similar tasks rather than one-time business exceptions |
| risk clean | Desensitized, no licensing issues, no contamination eval |
Samples that are not suitable for entering the training set must also be identified.
| Sample type | why not suitable |
|---|---|
| Relying solely on human intuition to judge the output of “okay” | lack of verifiable evidence |
| The patch that finally passed but the process was confusing | Possibly training bad strategies into default behavior |
| Traces containing customer data or production logs | High security risk |
| Full answer from public benchmark | pollution assessment |
| Modifications that only reflect personal coding style preferences | Low transferable value |
| Historical fixes for deprecated architecture | Will solidify expired experience |
Part 8 will discuss in detail how to process these candidate samples into SFT data. This article only gives one principle: the training data does not feed the “successful results” to the model, but feeds the “engineering judgment behind the successful results” to the model.
Engineering Practice 6: Preference data comes from controversy, not from pretty answers
SFT is suitable for teaching a model “how it should be done”. But there is no single answer to many engineering questions: should this abstraction be torn down, which layer should the cache be placed at, whether errors should be swallowed or propagated upwards, whether testing should be done with units or integrated, and whether refactoring should be done this time or split into subsequent tasks.
These scenarios are more suitable for settling preference data. Preference data is not as simple as “answer A is better than answer B”, but rather “under what constraints is A better than answer B”.
A good preference sample usually comes from review disputes, not from code that passed smoothly in the first place.
| scene | preference judgment |
|---|---|
| AI proposes massive refactoring, humans demand minimal fixes | The current mission goal is to stop bleeding, and the risks of reconstruction outweigh the benefits. |
| AI abstracts common components while humans retain local implementations | There is only one call point and the life cycle cost introduced by the abstraction is not worth it |
| AI uses caching to improve performance, but humans reject it | High data consistency requirements and incomplete cache invalidation strategy |
| AI supplements a large number of snapshot tests, and humans require behavioral assertions | Snapshot will solidify implementation details and cannot prove business semantics |
These controversial samples are valuable because they make the team’s engineering tastes explicit. For a model to become a reliable collaborator in the team, it must not only write the right code, but also learn to make choices within constraints.
Engineering Practice 7: Knowledge base is not a trash can for training sets
A lot of data that cannot enter training can still enter the team knowledge base. Such as architectural decisions, review rules, common error patterns, migration guides, and module boundary descriptions. These can influence the AI’s next behavior through RAG or cue word context.
But the knowledge base cannot become a trash can for training sets. Just because a piece of data is not suitable for SFT, you cannot just throw it into the knowledge base. The content in the knowledge base will enter the model context, and poor quality will also contaminate the output.
Knowledge base entries should ideally meet four conditions:
| condition | illustrate |
|---|---|
| clear rules | Can guide subsequent tasks, rather than just recording historical facts |
| clear life cycle | Know when it expires and who is responsible for updating it |
| Clear scope of application | Identify modules, languages, scenarios, exceptions |
| associated with eval | It is best to have corresponding eval task verification for important rules. |
For example, “The order service cannot directly call the inventory database” is a knowledge base rule; “A certain order service caused problems because it directly checked the inventory database” is review material; “AI once wrote this error code” is a trace. The three are related, but cannot be mixed into one type of data.
Organizational Division of Labor and Maturity Model
Organizational division of labor: Who owns this closed loop
Data closed loop is not a one-man show of a certain tool team. It spans product, R&D, test, security, platform, and model teams. Without clear ownership, you end up with a dashboard that no one maintains.
I recommend using the following boundaries of responsibility.
| Role | Main responsibilities |
|---|---|
| Business leader | Define real test acceptance criteria and determine task value and risk level |
| Developer | Document AI engagement scope, key context, failure fixes, and verification evidence |
| Reviewer / Coding Mentor | Provide structured feedback, labeling error types and transferable experiences |
| QA/Test Leader | Maintain verification evidence, flaky tags, regression tasks |
| Security and Compliance | Define desensitization, permissions, retention periods, and untrainable boundaries |
| Platform engineering | Build trace store, data routing, eval runner, quality gate control |
| Model/AI Engineering | Use data for prompt, RAG, SFT, RFT or toolchain optimization |
The most critical one is the Coding Mentor role. This role is not equivalent to a senior engineer, nor is it equivalent to a model training engineer. He needs to understand the engineering context and also needs to know what feedback has learning value for the model. What many teams lack is not AI tools, but this kind of intermediate role that “understands engineering, understands feedback, and understands data boundaries.”
Maturity Model: From Personal Habits to Organizational Flywheel
Closed-loop construction cannot be achieved in one step. The mature path can be roughly divided into four levels.
The first level is personal habits. Developers themselves document the scope of AI involvement, failure cases, and reasons for corrections. This stage does not require a platform, only discipline. The goal is to allow individuals to review why an AI collaboration succeeded or failed.
The second level is team norms. PR templates, review tags, AI usage records, and verification evidence begin to be unified. This stage begins to produce comparable data, and the team can see which tasks the AI is reliable on and which tasks it reworks frequently.
The third level is platform closed loop. Trace store, eval runner, data masking, sample routing, and quality gating start to be automated. The team no longer relies on manual sorting, but continues to generate candidate data in daily delivery.
The fourth level is model and tool chain optimization. Private eval, SFT candidate set, preference data, knowledge base and prompt version form a closed loop. Model upgrades, prompt word changes, and tool chain adjustments all require private eval and online indicator regression.
| stage | main goal | minimum feasible action | Don’t do it in a hurry |
|---|---|---|---|
| personal habits | Make AI collaboration replayable | Save key prompts, failure logs, and reasons for manual corrections | Training model |
| team norms | Make data structures consistent | PR template, review tag, verification evidence field | Fully automatic collection |
| Platform closed loop | Make trajectories searchable, gated, and routable | trace store, desensitization, eval runner, sample version | Multi-model complex scheduling |
| Model optimization | Let data feed back capabilities | Private eval, SFT/RFT candidate set, A/B comparison | Blind pursuit of large-scale fine-tuning |
This maturity model has a realistic premise: each level must be able to generate value independently. Personal habits can improve review quality, team standards can reduce repeated disputes, platform closed-loop can reduce data collection costs, and model optimization can improve the quality of the next round of delivery. If one level only serves the next level, it can easily fall by the wayside.
Anti-pattern: Where closed loops are most likely to fail
Anti-pattern 1: Treat AI logs as training data
Logs are not training data. Logs are just raw material.
A raw AI conversation may contain faulty context, expired constraints, sensitive information, half-baked reasoning, invalid attempts, and ad hoc human instructions. Taking it directly for training is equivalent to letting the model learn the most chaotic side of the team.
The correct approach is to layer the logs:
| Hierarchy | Processing method |
|---|---|
| Original log | Short-term retention for audit and problem review |
| Structured trace | Extract facts and associate tasks, tools, diffs, tests, reviews |
| candidate sample | After desensitization, duplication removal, quality scoring and manual inspection |
| Training/evaluation assets | Clarify purpose, version, life cycle and isolation relationship |
Logs that are not structured and gated are at most debugging materials, not model assets.
Anti-pattern 2: Only reward “last pass”
If the team only regards the output that “finally passes the test” as a good sample, the model will learn a dangerous preference: as long as it can pass in the end, it doesn’t matter how the process goes.
Software engineering is not about answering questions. An implementation that finally passes may rely on expanding the scope, bypassing interfaces, adding hidden states, sacrificing performance, or creating subsequent maintenance costs. Real teams care about “passing maintainably”, not “passing by chance”.
Therefore, sample scoring cannot only look at the results. Look at at least five dimensions simultaneously:
| Dimensions | question |
|---|---|
| correctness | Does the function meet the acceptance criteria? |
| minimality | Whether the scope of the modification is reasonable and whether irrelevant changes are introduced |
| maintainability | Is the structure in line with the team’s long-term evolution direction? |
| Verifiability | Are sufficient tests and evidence provided? |
| constraint compliance | Adherence to security, performance, compatibility and architectural boundaries |
This is why human mentor feedback is important. Automated testing can judge many things, but it cannot completely judge engineering choices.
Anti-pattern 3: Use public benchmarks to replace team private tasks
Public benchmarks can help the team compare the basic capabilities of the model, but they cannot replace the team’s private tasks. Especially for AI programming, whether a model “can write code” is only the threshold. Whether it “understands your code base, your constraints, and your release method” determines whether it can enter real delivery.
The value of SWE-bench is that it brings real GitHub issues, code repositories and tests into the evaluation, approaching real software engineering. But for a specific team, the most critical assessment tasks should come from its own historical bugs, architectural constraints, test escapes, and review disputes.
I suggest that the public benchmark only answers two questions:
- Does this model have the basic capabilities to enter the team trial?
- Is there any obvious degradation in general capabilities after the model is upgraded?
Team-private eval answers more critical questions:
- Can this model reliably locate problems in our warehouse?
- Does it adhere to our architectural boundaries and security redlines?
- Does it reduce duplication of work for the reviewer, or does it create new rework?
- Can it continue to improve from the Mentor signals we give it?
Anti-pattern 4: Make Coding Mentor an additional burden for a few experts
If all structured feedback relied on the handwriting of a handful of experts, the system would break down quickly. Expert time is too expensive to be used for reorganizing.
The correct approach is to layer.
| Hierarchy | who is responsible | degree of automation |
|---|---|---|
| Basic fact collection | system | High, automatically extracted from PR, CI, agent trace |
| General label | Developers and reviewers | , completed through templates and preset tags |
| High value sample judgment | Coding Mentor | Low, manual judgment of teaching value and boundaries |
| Data set sampling | AI/Platform/Security Federation | Medium, automatic scanning plus manual sampling |
Experts should only deal with high-value judgments: Does this sample represent a certain type of engineering capability? Is this error typical enough? Is this fix portable? Has this experience expired? The rest is automated as much as possible.
90-day roadmap, indicator system and integration boundaries
An actionable 90-day roadmap
If a team wants to start this closed loop today, I don’t recommend directly setting up a half-year platform project. Start by running for 90 days.
Days 1-30: Make AI collaboration auditable
The goal is not to collect big data, but to have a minimal record of each AI engagement delivery.
action:
- Modify the PR template to include AI engagement scope, key context, verification evidence, failure remediation, and data routing recommendations.
- Define 6-8 error type tags, no more than 10.
- Choose 2 real projects to pilot, not company-wide.
- Select 5 AI collaboration cases to review every week to determine which ones have Mentor value.
Acceptance criteria:
| index | Target |
|---|---|
| Record completeness rate of AI participation in PR | more than 70% |
| Cases that can be reviewed every week | at least 5 |
| First version error type label | Can cover 80% of frequently asked questions |
| Clear discard rules | Cover at least three categories: sensitive information, non-reproducible, and low value |
Days 31-60: Building a private eval seed set
The goal is to solidify high-value tasks in real delivery into regressible evaluations.
action:
- Pick 20-50 tasks from historical bugs, review controversies, and test escapes.
- Complete question stems, warehouse versions, acceptance criteria, test commands and reference fixes for each task.
- Explicit train/eval isolation rules.
- Select 2-3 models or toolchain versions for offline evaluation.
Acceptance criteria:
| index | Target |
|---|---|
| eval number of tasks | 20-50 pcs |
| Each task is reproducible | 100% |
| flaky quest tag | 100% status |
| Model comparison report | at least 1 serving |
Days 61-90: Open up sample routing and feedback for improvements
The goal is for data to start feeding back into the tool chain, rather than just reporting.
action:
- To create a minimal version of the trace store, you can first use structured files or internal tables, without having to install a complex system at the beginning.
- Route delivery trajectories into evals, SFT candidates, preference data, knowledge bases, and drop zones.
- Make a prompt, RAG or toolchain fix for a high-frequency error.
- Regression correction effect with private eval.
Acceptance criteria:
| index | Target |
|---|---|
| Data routing coverage | More than 60% of pilot project AI PRs |
| SFT candidate sample | 30-100 hard samples |
| High frequency error correction | There is a significant decrease in at least Type 1 error |
| eval regression mechanism | Can run stably before and after tool chain changes |
A decision on whether to platform will be made after 90 days. If PR templates, eval seed sets, and sample routing cannot run, hastily building a platform will only solidify process problems into system problems.
Indicator system: Don’t just ask how much time AI saves
Of course we need to look at the productivity indicators of the AI programming assistant, but the Coding Mentor data closed loop also needs to look at several other sets of indicators.
| indicator group | Representative indicators | illustrate |
|---|---|---|
| Delivery efficiency | lead time, AI participation task completion time, PR cycle | Determining whether AI is actually helping delivery |
| Project quality | CI failure rate, review defect density, defect escape, rollback | Determine whether AI creates quality debt |
| Collaboration burden | Number of manual corrections, review rounds, frequency of repeated feedback | Determine whether AI reduces tutor burden |
| data assets | Number of available evals, number of hard samples, proportion of samples passing the gate | Determine whether the closed loop produces reusable assets |
| Model improvements | Private eval improvement, high-frequency error reduction, tool chain return stability | Determine whether the data really feeds back the ability |
The most misleading is “AI participation rate.” A high AI participation rate does not mean high value. A team can let AI write a lot of code, while making the reviewers more tired, with more defects, and the architecture more messy. What really needs to be seen is whether the participation of AI reduces repeated errors and makes engineering judgments more reusable.
Integration boundaries with existing engineering systems
This closed loop should not be reinvented. It should be embedded into existing engineering systems.
| Existing system | Integration method |
|---|---|
| Issue system | Save requirements, acceptance criteria, defect classification, and business priorities |
| Git/PR | Save diff, review, merge status, AI participation fields |
| CI/CD | Save test, build, security scan, deployment results |
| Logging/Monitoring | Save online feedback, error rate, and performance changes |
| Document system | Save architectural constraints, specifications, reviews and knowledge bases |
| model platform | Run eval, prompt versions, SFT/RFT experiments and A/B comparisons |
Boundaries also need to be clear. AI data closed loop does not replace ALM, CI, code review or knowledge base, but connects the key signals in these systems. It is more like an “AI engineering learning layer” that is responsible for converting delivery facts into assets that the model can learn and the team can evaluate.
What problem does this closed loop really solve?
Writing the article here may easily make people think that this is a very heavy system. It is indeed heavier than “opening the AI tool and writing code directly”. But it solves a problem that the latter can never solve.
First, it makes AI progress attributable. Is the model getting better because the model has been changed, the prompt has been changed, the context has been added, the training samples have been improved, or the task has become simpler? Without eval and trace, the team can only guess.
Second, it makes human experience reusable. The judgment of senior engineers in each review no longer only serves the current PR, but becomes an asset for subsequent models, knowledge bases and evals.
Third, it makes AI risks governable. Sensitive data, expired experience, benchmark contamination, and unreproducible samples no longer rely on personal awareness, but enter the gate control system.
Fourth, it lets the model optimize service delivery rather than service scores. If the improvement on the public list cannot reduce the team’s rework, it is not the real benefit of the team.
Fifth, it allows Coding Mentor to transform from individual capabilities to organizational capabilities. A person who can train AI is valuable, but a team that can stably produce Mentor signals is even more valuable.
Conclusion: Build the feedback system first, then talk about the training system
My advice is clear: don’t start with “We’re going to train a team-specific model.” Let’s start with “Can we turn the process of AI participation in delivery into a reliable feedback system?”
Don’t rush into SFT if the team hasn’t documented why the AI failed. If the team doesn’t have a private eval yet, don’t trust the fine-tuned score. If your team doesn’t already have data gating, don’t treat collaboration logs as an asset. If the reviewer’s judgment still remains at “It doesn’t work here”, first train humans how to write Mentor signals.
The real order should be:
- Make the AI delivery process observable.
- Make human feedback structured.
- Make data routing gated.
- Make private eval returnable.
- Keep training data small and hard.
- Let models and toolchains continue to improve based on evaluation.
This is what this article calls a closed loop: AI-assisted product engineering delivery is not the end point, it is the production site of training data and evaluation data; the human Coding Mentor is not the final acceptor, but the designer of feedback signals; SFT does not pour logs into the model, but precipitates gated engineering judgments into learnable assets.
The next article (Part 8) moves into a more specific issue: when these trajectories, feedback, and candidate samples have been generated, how to clean, filter, label, convert them into high-quality SFT data, and connect them to the training pipeline. After completing this step, the final chapter 9 returns to long-term evolution and future prospects.
References and Acknowledgments
- GitHub: Research: quantifying GitHub Copilot’s impact on developer productivity and happiness
- GitHub: Research: quantifying GitHub Copilot’s impact in the enterprise with Accenture
- Anthropic: Building effective agents
- Anthropic: Claude Code best practices
- LangChain: Traces start the agent improvement loop
- OpenAI: OpenAI Evals API
- OpenAI: How evals drive business results
- OpenAI: Model optimization
- Jimenez et al.: SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
- Yang et al.: SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
- Jimenez et al.: SWE-Gym: Advancing the State-of-the-Art of Software Engineering Agents
Series context
You are reading: AI Coding Mentor Series
This is article 7 of 9. Reading progress is stored only in this browser so the full series page can resume from the right entry.
Series Path
Current series chapters
Chapter clicks store reading progress only in this browser so the series page can resume from the right entry.
- Why do you need to be a coding mentor for AI? When AI programming assistants become standard equipment, the real competitiveness is no longer whether they can use AI, but whether they can judge, calibrate and constrain the engineering output of AI. This article starts from trust gaps, feedback protocols, evaluation standards and closed-loop capabilities to establish the core framework of "Humans as Coding Mentors".
- Panorama of AI programming ability evaluation: from HumanEval to SWE-bench, the evolution and selection of benchmarks Public benchmarks are not a decoration for model rankings, but a measurement tool for understanding the boundaries of AI programming capabilities. This article starts from benchmarks such as HumanEval, APPS, CodeContests, SWE-bench, LiveCodeBench and Aider, and explains how to read the rankings, how to choose benchmarks, and how to convert public evaluations into the team's own Coding Mentor evaluation system.
- How to design high-quality programming questions: from question surface to evaluation contract High-quality programming questions are not longer prompts, but assessment contracts that can stably expose the boundaries of abilities. This article starts from Bloom level, difficulty calibration, task contract, test design and question bank management to explain how to build a reproducible question system for AI Coding Mentor.
- Four-step approach to AI capability assessment: from one test to continuous system evaluation Serving as a coding mentor for AI is not about doing a model evaluation, but establishing an evaluation operation system that can continuously expose the boundaries of capabilities, record failure evidence, drive special improvements, and support collaborative decision-making.
- Best Practices for Collaborating with AI: Task Agreement, Dialogue Control and Feedback Closed Loop The core skill of being a Coding Mentor for AI is not to write longer prompt words, but to design task protocols, control the rhythm of conversations, identify error patterns, and precipitate the collaboration process into verifiable and reusable feedback signals.
- Practical cases: feedback protocol, evaluation closed loop, code review and programming education data Case studies should not stop at “how to use AI tools better”. This article uses four engineering scenarios: model selection evaluation, feedback protocol design, code review signal precipitation, and programming education data closed loop to explain how humans can transform the AI collaboration process into evaluable, trainable, and reusable mentor signals.
- From delivery to training: How to turn AI programming collaboration into a Coding Mentor data closed loop The real organizational value of AI programming assistants is not just to increase delivery speed, but to precipitate trainable, evaluable, and reusable mentor signals in every requirement disassembly, code generation, review and revision, test verification, and online review. This article reconstructs the closed-loop framework of AI training, AI-assisted product engineering delivery, high-quality SFT data precipitation, and model evaluation.
- From engineering practice to training data: a systematic method for automatically generating SFT data in AI engineering Following the data closed loop in Part 7, this article focuses on how to process the screened engineering assets into high-quality SFT samples and connect them to a manageable, evaluable, and iterable training pipeline.
- Future Outlook: Evolutionary Trends and Long-term Thinking of AI Programming Assessment As the final article in the series, this article reconstructs the future route of AI Coding Mentor from the perspective of engineering decision-making: how evaluation objects evolve, how organizational capabilities are layered, and how governance boundaries are advanced.
Reading path
Continue along this topic path
Follow the recommended order for AI programming assessment instead of jumping through random articles in the same topic.
Next step
Go deeper into this topic
If this article is useful, continue from the topic page or subscribe to follow later updates.
Loading comments...
Comments and discussion
Sign in with GitHub to join the discussion. Comments are synced to GitHub Discussions