Article
From engineering practice to training data: a systematic method for automatically generating SFT data in AI engineering
Following the data closed loop in Part 7, this article focuses on how to process the screened engineering assets into high-quality SFT samples and connect them to a manageable, evaluable, and iterable training pipeline.
Copyright Statement and Disclaimer This article is based on InstructGPT, Flan and other SFT training research, combined with engineering practical experience for a comprehensive interpretation.
Original Nature The reverse data generation framework, data quality assessment standards and construction process proposed in this article are original to the author.
Beginning: From “There are many records” to “There are trainable samples”
In Part 7, we address the closed-loop entry problem: which AI collaboration trajectories are worth keeping, which ones must be discarded, and which ones should be entered into eval. What will really stuck the team next is another more specific problem:
They are also “records left by AI-assisted delivery”. Why can some of them only be used for review, some can be entered into the knowledge base, and only a few can become training samples?
This article only does one thing: explain this screening and processing process clearly. The focus is not “how to export logs to JSONL”, but how to define the quality boundaries, data contracts and routing rules of trainable samples in the engineering context.
You can think of this article as a mid-stage pipeline:
- The inputs are the engineering assets that have been managed in Part 7 (task contracts, error types, review feedback, verification evidence).
- The middle stage of processing is sample construction, quality gating, desensitization, bucketing, exporting and versioning.
- The outputs are candidate samples that can go into SFT, and evaluation assets that can be regression verified.
Therefore, this article does not discuss the strategic question of “should the model be trained?” but answers the question of “how to avoid training low-quality engineering noise as the model’s default behavior when the team decides to train.”
Overall architecture: automated pipeline from development to training data
Before going into the details, let’s take a look at the overall architecture. The ideal SFT data generation is not manually organized, but automatically generated during the development process:
This pipeline seamlessly integrates the development process of BMAD-Speckit-SDD-Flow with SFT data extraction to achieve “development as training”**.
Core concept: What is SFT training data?
Basic concepts of SFT
SFT (supervised fine-tuning) is a method of letting a pre-trained model learn a specific task.
Basic form: input (instruction/question) → output (expected answer)
SFT data for programming scenarios:
{
"instruction": "Implement a cache class with LRU eviction",
"input": "Requirements: \n- support get and put operations\n- time complexity O(1)\n- configurable capacity",
"output": "class LRUCache:\n def __init__(self, capacity: int):\n ...",
"metadata": {
"difficulty": "medium",
"source": "engineering_project",
"verified": true,
"quality_score": 0.92
}
}
Why engineering outputs are suitable for conversion to SFT data
Feature 1: Strong authenticity
- From actual project requirements
- Tested by practice
- Not a theoretical hypothesis
Feature 2: Controllable quality
- manually reviewed
- Have clear standards of right and wrong
- There is a track record of improvements
Feature 3: Domain related
- Align with team technology stack
- Suitable for business scenarios
- Comply with coding standards
Feature 4: Continuous generation
- Every project is generating new data
- Data volume grows over time
- Models can be continuously updated
Convertible Engineering Asset Types
Asset Type 1: Assessment Question Set
Source: Questions accumulated when giving questions to AI
Conversion method:
Original problem: description plus test cases
↓
High-quality answer: human-written or selected AI answer
↓
SFT data: instruction-input-output tuple
Example:
{
"instruction": "Write a function that finds the kth largest element in an array",
"input": "Requirements: \n- function signature: find_kth_largest(nums: List[int], k: int) -> int\n- time complexity: O(n) or O(n log n)\n- do not use built-in sorting",
"output": "def find_kth_largest(nums: List[int], k: int) -> int:\n def partition(left, right, pivot_idx):\n ...",
"metadata": {
"algorithm": "quickselect",
"difficulty": "medium",
"source": "evaluation_suite",
"pass_rate": 0.75
}
}
Asset Type 2: Code Review Feedback
Source: Review comments on AI generated code
Conversion method:
Problematic code generated by AI
↓
Review feedback: issues plus improvement suggestions
↓
Corrected high-quality code after human revision
↓
SFT data for training code-repair capability
Example:
{
"instruction": "Fix the security issue in the following code",
"input": "```python\ndef query_user(username):\n sql = f\"SELECT * FROM users WHERE name = '{username}'\"\n return db.execute(sql)\n```",
"output": "```python\ndef query_user(username):\n sql = \"SELECT * FROM users WHERE name = %s\"\n return db.execute(sql, (username,))\n```\n\nFix explanation: \n1. use parameterized queries to prevent SQL injection\n2. never concatenate user input into SQL",
"metadata": {
"issue_type": "security",
"severity": "high",
"source": "code_review",
" CWE_id": "CWE-89"
}
}
Asset Type 3: Multiple rounds of conversation records
Source: The full conversation on collaborating with AI to solve problems
Conversion method:
Full conversation: question -> AI answer -> feedback -> improvement ->...
↓
Extract high-quality turns
↓
SFT data for training dialogue capability
Example:
{
"messages": [
{
"role": "user",
"content": "Implement a thread-safe singleton pattern"
},
{
"role": "assistant",
"content": "class Singleton:\n _instance = None\n \n def __new__(cls):\n if cls._instance is None:\n cls._instance = super().__new__(cls)\n return cls._instance"
},
{
"role": "user",
"content": "This implementation is unsafe in a multithreaded environment. If two threads see _instance as None at the same time, two instances can be created."
},
{
"role": "assistant",
"content": "class Singleton:\n _instance = None\n _lock = threading.Lock()\n \n def __new__(cls):\n if cls._instance is None:\n with cls._lock:\n if cls._instance is None:\n cls._instance = super().__new__(cls)\n return cls._instance"
}
],
"metadata": {
"conversation_type": "iterative_refinement",
"topic": "concurrency",
"source": "chat_logs"
}
}
Asset Type 4: Successful Project Cases
Source: Complete project code completed with AI assistance
Conversion method:
Project code: modular, commented, and high quality
↓
Break down into teaching cases
↓
SFT data for training complex task decomposition
Example:
{
"instruction": "Design and implement a JWT authentication system",
"input": "Requirements: \n1. support token generation and verification\n2. support token refresh\n3. support token revocation\n4. use Redis to store sessions\n\nimplement it in the following steps: \n1. design the data model\n2. implement core functionality\n3. add error handling\n4. write test cases",
"output": "## Step 1: data model design\n...\n\n## Step 2: core functionality implementation\n...\n\n## Step 3: error handling\n...\n\n## Step 4: test cases\n...",
"metadata": {
"project_type": "authentication_system",
"complexity": "high",
"source": "successful_project",
"verified_in_production": true
}
}
BMAD-Speckit-SDD-Flow architecture: automated SFT data generation system
In previous chapters, we discussed methods of manually extracting SFT data from engineering assets. However, manual methods are inefficient and difficult to scale. This section will provide an in-depth introduction to the BMAD-Speckit-SDD-Flow architecture - a systematic solution that can automatically generate SFT training data during the AI engineering development process.
Architecture design concept
BMAD-Speckit-SDD-Flow combines BMAD method (multi-agent agile development) and Spec-Driven Development (specification-driven development) to automatically capture and transform training data in the following links:
Core data model: CanonicalSftSample
The core of the system is a standardized data model CanonicalSftSample, which unifies training data from all sources into a standardized format:
interface CanonicalSftSample {
// unique sample identifier
sample_id: string;
sample_version: 'v1';
// data source tracking
source: {
run_id: string; // execution run ID
stage: string; // development stage
flow: string; // workflow type
epic_id?: string; // owning epic
story_id?: string; // owning story
artifact_refs: Array<{ // original artifact references
path: string;
content_hash: string;
kind: string;
}>;
};
// messages compatible with OpenAI format
messages: Array<{
role: 'system' | 'user' | 'assistant' | 'tool';
content: string;
tool_calls?: ToolCall[];
tool_call_id?: string;
weight?: 0 | 1;
}>;
// tool definitions for tool-calling training
tools?: Tool[];
// metadata
metadata: {
schema_targets: string[]; // target formats
language: string; // language
tags?: string[];
notes?: string[];
};
// quality evaluation
quality: {
acceptance_decision: 'accepted' | 'rejected' | 'downgraded';
phase_score: number | null;
dimension_scores?: Record<string, number>;
veto_triggered: boolean;
iteration_count: number;
has_code_pair: boolean;
token_estimate: number;
rejection_reasons: string[];
warnings: string[];
};
// data source traceability
provenance: {
base_commit_hash: string | null;
content_hash: string | null;
source_path: string | null;
patch_ref: string | null;
lineage: string[];
generated_at: string;
};
// dataset split
split: {
assignment: 'train' | 'validation' | 'test' | 'holdout';
seed: number;
strategy: string;
group_key: string | null;
};
// data redaction information
redaction: {
status: 'clean' | 'redacted' | 'blocked';
applied_rules: string[];
findings: Array<{
kind: string;
severity: 'low' | 'medium' | 'high' | 'critical';
field_path: string;
action?: string;
}>;
redacted_fields: string[];
};
// export compatibility
export_compatibility: {
openai_chat: ExportDecision;
hf_conversational: ExportDecision;
hf_tool_calling: ExportDecision;
};
}
Detailed explanation of SFT data extraction pipeline
The data extraction pipeline consists of four core stages:
Phase 1: Candidate Builder (candidate build)
// build candidate samples from run records
function buildCanonicalSample(
record: RunScoreRecord, // evaluation run record
sourceContent: string, // original artifact content
codePair: { input: string; output: string }, // code pair comparison
options: BuildOptions
): CanonicalSftSample {
// 1. extract the instruction from the audit report: section 1 problem description plus section 4 fix plan
const instruction = extractInstruction(sourceContent);
// 2. build conversation messages
const messages = buildCanonicalMessages(
instruction,
codePair.input, // code before the change
codePair.output // code after the change
);
// 3. calculate deterministic dataset split based on the story hash
const split = assignDeterministicSplit({
seed: options.splitSeed ?? 42,
groupKey: parsedStory
? `epic-${parsedStory.epicId}/story-${parsedStory.storyId}`
: record.run_id,
});
// 4. build the complete sample
return {
sample_id: buildCanonicalSampleId({...}),
source: { run_id, stage, epic_id, story_id, artifact_refs },
messages,
quality: { phase_score, iteration_count, has_code_pair, ... },
provenance: { base_commit_hash, content_hash, patch_ref, ... },
split,
// ... other fields
};
}
Key Features:
- Git Diff Extraction: Automatically extract code changes from base_commit to the current HEAD
- Instruction Extraction: Extract training instructions from the standard sections of the audit report (§1 Questions, §4 Plans)
- Caching mechanism: avoid repeated construction and improve performance
Stage 2: Quality Gates (Quality Gate Control)
The quality gating system evaluates candidate samples in multiple dimensions and decides whether to accept:
interface QualityGateOptions {
minScore?: number; // minimum score threshold, default 90
maxIterations?: number; // maximum iteration count
maxTokens?: number; // maximum token count
requireCodePair?: boolean; // whether a code pair is required
}
function applyQualityGates(
sample: CanonicalSftSample,
options: QualityGateOptions
): CanonicalSftSample {
const hardReasons: string[] = []; // hard rejection reasons
const softReasons: string[] = []; // soft warning reasons
// hard checks
if (!sample.provenance.base_commit_hash) {
hardReasons.push('prov_missing_hash');
}
if ((sample.quality.phase_score ?? 0) < minScore) {
hardReasons.push('score_below_floor');
}
if (sample.quality.veto_triggered) {
hardReasons.push('veto_triggered'); // critical audit item vetoed
}
if (sample.redaction.status === 'blocked') {
hardReasons.push('redaction_blocked'); // redaction blocked
}
// soft checks
if (sample.quality.iteration_count > maxIterations) {
softReasons.push('too_many_iterations');
}
if (!sample.quality.has_code_pair) {
softReasons.push('missing_code_pair');
}
// decision: accepted / rejected / downgraded
const acceptanceDecision =
hardReasons.length > 0 ? 'rejected' :
softReasons.length > 0 ? 'downgraded' : 'accepted';
return { ...sample, quality: { ...sample.quality,
acceptanceDecision,
rejection_reasons: [...hardReasons, ...softReasons]
}};
}
Quality Dimension:
| Check items | type | illustrate |
|---|---|---|
| source integrity | rigid | Must have commit hash, source path |
| Score threshold | rigid | phase_score >= 90 (configurable) |
| Veto trigger | rigid | Key audit items failed |
| desensitization block | rigid | Sensitive information such as private keys detected |
| message integrity | rigid | Must contain user and assistant messages |
| Number of iterations | Soft | Downgrade beyond maxIterations |
| Code comparison | Soft | No input/output code pair degradation |
Stage 3: Redaction (data desensitization)
Automatically detect and desensitize sensitive information to ensure training data security:
// redaction rules
const REDACTION_RULES = {
// email address -> medium risk, redact it
email: {
pattern: /\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b/gi,
severity: 'medium',
action: 'redact', // replace with [REDACTED_EMAIL]
},
// Secret Token -> high risk, redact it
secretToken: {
pattern: /\bsk-[A-Za-z0-9]{16,}\b/g, // OpenAI API keys and similar secrets
severity: 'high',
action: 'redact', // replace with [REDACTED_SECRET]
},
// private key -> critical risk, block the sample
privateKey: {
pattern: /BEGIN [A-Z ]+ PRIVATE KEY/,
severity: 'critical',
action: 'block', // sample is rejected
},
};
function applyCanonicalRedaction(sample: CanonicalSftSample): CanonicalSftSample {
let status: 'clean' | 'redacted' | 'blocked' = 'clean';
const findings: RedactionFinding[] = [];
const messages = sample.messages.map((message, index) => {
let content = message.content;
// apply each rule
for (const [ruleName, rule] of Object.entries(REDACTION_RULES)) {
if (rule.pattern.test(content)) {
if (rule.action === 'block') {
status = 'blocked';
findings.push({ kind: ruleName, severity: rule.severity, ... });
} else if (rule.action === 'redact') {
status = status === 'blocked' ? 'blocked' : 'redacted';
content = content.replace(rule.pattern, `[REDACTED_${ruleName.toUpperCase()}]`);
findings.push({ kind: ruleName, severity: rule.severity, action: 'redact' });
}
}
}
return { ...message, content };
});
return { ...sample, messages, redaction: { status, findings } };
}
Stage 4: Split Assignment (data set division)
Use a deterministic algorithm to allocate training/validation/test sets to ensure reproducibility:
function assignDeterministicSplit(options: {
seed: number;
groupKey: string | null;
}): CanonicalSplit {
const stableKey = `${options.seed}:${options.groupKey ?? 'ungrouped'}`;
const hash = sha256(stableKey).digest('hex');
const bucket = parseInt(hash.slice(0, 8), 16) % 100;
// 80% train / 10% validation / 10% test
let assignment: 'train' | 'validation' | 'test' = 'train';
if (bucket >= 80 && bucket < 90) assignment = 'validation';
if (bucket >= 90) assignment = 'test';
return { assignment, seed: options.seed, strategy: 'story_hash_v1' };
}
Multi-format export system
The system supports exporting to a variety of popular SFT training formats:
OpenAI Chat format
// export to OpenAI fine-tuning format
{
"messages": [
{ "role": "system", "content": "You are a senior coding agent." },
{ "role": "user", "content": "Fix the following SQL injection issue...\n\nCurrent implementation:\ndef query(user):\n sql = f\"SELECT * FROM users WHERE name = '{user}'\"" },
{ "role": "assistant", "content": "def query(user):\n sql = \"SELECT * FROM users WHERE name = %s\"\n return db.execute(sql, (user,))" }
],
"tools": [...], // optional tool definitions
"parallel_tool_calls": false
}
HuggingFace ConversationalFormat
// export to HuggingFace conversational format
{
"system": "You are a senior coding agent.",
"conversations": [
{ "from": "human", "value": "Fix the following SQL injection issue..." },
{ "from": "gpt", "value": "def query(user):\n sql = \"SELECT * FROM users WHERE name = %s\"..." }
]
}
HuggingFace Tool CallingFormat
// export to HuggingFace tool-calling format
{
"system": "You are a coding assistant with tool access.",
"conversations": [
{ "from": "human", "value": "Analyze the complexity of this code" },
{ "from": "gpt", "value": "", "tool_calls": [...] },
{ "from": "tool", "value": "{\"complexity\": \"O(n^2)\", ...}" },
{ "from": "gpt", "value": "The time complexity of this code is O(n^2)..." }
],
"tools": [...]
}
CLI usage example
# basic extraction, default phase_score >= 90
npx ts-node scripts/sft-extract.ts
# specify the minimum score
npx ts-node scripts/sft-extract.ts --min-score 85
# specify the output path
npx ts-node scripts/sft-extract.ts --output ./custom-sft-data.jsonl
# complete example
npx ts-node scripts/sft-extract.ts \
--min-score 90 \
--output ./training-data/sft-v1.0.jsonl
Example of output summary:
Extracted 156 samples covering 12 stories; skipped 23 samples: missing source_path: 10, git diff failed: 8, phase_score below threshold: 5
Data construction process: from raw assets to training data
Phase 1: Asset Collection and Classification
Collection Scope:
- Assessment questions and quality answers
- Code review record (problem code + feedback + correction)
- Multi-turn conversation log
- Project documentation and code
- Error case analysis
Classification criteria:
DATA_CATEGORIES = {
"code_generation": {
"description": "code generation task",
"sources": ["evaluation_questions", "project_code"],
"format": "instruction-input-output"
},
"code_review": {
"description": "code review and repair",
"sources": ["review_feedback", "bug_fixes"],
"format": "problem-feedback-solution"
},
"conversation": {
"description": "multi-turn dialogue",
"sources": ["chat_logs"],
"format": "messages"
},
"architecture": {
"description": "architecture design",
"sources": ["design_docs", "system_docs"],
"format": "instruction-context-output"
}
}
Phase 2: Data Cleaning and Filtering
Cleaning Steps:
-
Remove duplicates
def deduplicate(data): # deduplicate by code similarity # deduplicate by problem description # keep the higher-quality version -
Quality Screening
def quality_filter(item): # code must pass tests # code quality score must exceed the threshold # problem-answer mapping must be explicit # no sensitive information such as passwords or keys -
Format Standardization
def normalize_format(item, target_format): # normalize field names # normalize code style # add metadata
Filtering criteria:
| Dimensions | standard | weight |
|---|---|---|
| functional correctness | Pass all test cases | 40% |
| Code quality | Static analysis score > 0.8 | 25% |
| security | No high-risk vulnerabilities | 20% |
| understandability | With clear annotations and instructions | 15% |
Phase 3: Data enhancement and augmentation
Enhancement Method 1: Variant Generation
# generate multiple phrasings for the same problem
def generate_variants(original_instruction, n=5):
variants = []
# use synonym replacement
# change sentence structure
# add or remove detail
return variants
Enhancement method 2: Difficulty adjustment
# generate versions of the same problem at different difficulty levels
def adjust_difficulty(item, target_difficulty):
if target_difficulty == "easy":
# provide more hints
# reduce constraints
elif target_difficulty == "hard":
# add boundary conditions
# add performance requirements
Enhancement method 3: Cross-language migration
# migrate Python problems to JavaScript, Go, and other languages
def migrate_language(item, target_language):
# syntax conversion
# idiom adjustment
# ecosystem adaptation
Stage 4: Format conversion and export
Universal SFT format:
{
"dataset_info": {
"name": "company_coding_sft_v1",
"version": "1.0.0",
"created_at": "2024-04-01",
"total_samples": 10000,
"categories": {
"code_generation": 6000,
"code_review": 2000,
"conversation": 1500,
"architecture": 500
}
},
"data": [
{
"id": "cg_0001",
"type": "code_generation",
"instruction": "...",
"input": "...",
"output": "...",
"metadata": {
"source": "evaluation_suite",
"difficulty": "medium",
"language": "python",
"quality_score": 0.92,
"verified": true
}
}
]
}
Framework specific format:
- Alpaca format
- ShareGPT format
- OpenAI fine-tuning format
- HuggingFace datasets format
Data quality assessment criteria
Assessment Dimensions
Dimension 1: Accuracy
- Does the code run correctly?
- Does it meet the requirement description?
- Test case pass rate
Dimension 2: Integrity
- Does it contain necessary information?
- Whether to include edge case handling
- Whether to include error handling
Dimension 3: Consistency
- Clear correspondence between input and output
- Unified style
- Terminology consistent
Dimension 4: Diversity
- Question type coverage
- Reasonable difficulty distribution
- Language/framework coverage
Dimension 5: Security
- No malicious code
- No sensitive information leaked
- Comply with safety regulations
Quality scoring model
def calculate_quality_score(item):
scores = {
'correctness': evaluate_correctness(item), # 40%
'completeness': evaluate_completeness(item), # 25%
'consistency': evaluate_consistency(item), # 20%
'safety': evaluate_safety(item), # 15%
}
weights = {
'correctness': 0.40,
'completeness': 0.25,
'consistency': 0.20,
'safety': 0.15
}
total_score = sum(scores[k] * weights[k] for k in scores)
return total_score
Key points for manual review
Required review items:
- Code functional correctness (runtime verification)
- No security holes (static scanning)
- No sensitive information (regular matching)
- Format specification (automated inspection)
Sampling items (20% sampling):
- Code quality (human assessment)
- Teaching value (expert evaluation)
- Scenario authenticity (confirmed by the business party)
SFT training basics: how to use this data
Training process overview
raw data
↓
data preprocessing: cleaning and formatting
↓
dataset construction: train/validation/test split
↓
(SFT training)
↓
model evaluation
↓
model deployment
Simple training example (using HuggingFace)
2026-03 Source Caliber: The training API and LoRA configuration are subject to the TRL SFTTrainer and PEFT LoRA documents; the basic model uses placeholders ID, as actual projects must first review model cards, licenses, training data boundaries, and internal team evaluation snapshots.
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
# 1. Load a reviewed base model (example placeholder, March 2026).
base_model_id = "your-org/code-model-base"
model = AutoModelForCausalLM.from_pretrained(base_model_id)
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
# 2. Configure LoRA to reduce training cost.
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
# 3. data
dataset = load_dataset("json", data_files="company_coding_sft_v1.json")
# 4. Configure training.
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
save_steps=100,
logging_steps=10,
)
# 5. training
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
args=training_args,
)
trainer.train()
# 6.
model.save_pretrained("./company_coding_model")
training suggestions
Data volume recommendations:
- Minimum: 1,000 high-quality samples
- Recommended: 10,000+ samples
- Ideal: 100,000+ samples
Quality > Quantity:
- 1,000 high-quality data > 10,000 low-quality data
- It’s better to be less and more refined than more and more miscellaneous
Continuous iteration:
- First use a small data set to verify feasibility
- Gradually increase data based on performance
- Establish a closed loop of data-training-evaluation
Practical case: complete data construction process
background
A team hopes to build a dedicated code generation model based on 6 months of engineering practice.
Asset inventory
| Asset type | original quantity | available quantity | Remark |
|---|---|---|---|
| Assessment questions | 300 | 250 | After filtering |
| Code review records | 500 | 400 | After deduplication |
| Conversation log | 1000 items | 800 items | After filtering |
| Project code | 100,000 lines | 20,000 lines | Featured modules |
Data processing
Step 1: Extraction and Cleaning
# extract from evaluation tasks
sft_items = []
for question in evaluation_questions:
if question.quality_score > 0.8:
sft_items.append({
"instruction": question.description,
"input": question.requirements,
"output": question.reference_solution
})
# record
for review in code_reviews:
if review.severity in ["high", "critical"]:
sft_items.append({
"instruction": f"{review.issue_type}",
"input": review.problematic_code,
"output": review.fixed_code + "\n\n" + review.explanation
})
Step 2: Quality Score
for item in sft_items:
item["quality_score"] = calculate_quality_score(item)
# qualitydata
high_quality_items = [i for i in sft_items if i["quality_score"] > 0.85]
Step 3: Data enhancement
# generate
augmented_items = []
for item in high_quality_items:
augmented_items.append(item)
# generate3
for variant in generate_variants(item, n=3):
augmented_items.append(variant)
Step 4: Manual review
# humanreview
sample = random.sample(augmented_items, k=len(augmented_items)//5)
for item in sample:
review_result = manual_review(item)
if not review_result.approved:
augmented_items.remove(item)
final data set
## data
- sample: 5,000
-:
- code generation: 60% (3,000)
- code review and repair: 25% (1,250)
- architecture reasoning: 10% (500)
- architecture design: 5% (250)
- language:
- Python: 70%
- JavaScript: 20%
- Go: 10%
-:
- Easy: 30%
- Medium: 50%
- Hard: 20%
- quality: 0.88
training effect
Baseline Model: Code base model that the team reviewed in 2026-03 with model cards, licenses, and internal benchmarks Training method: LoRA fine-tuning Training time: 4 hours (single card A100)
Effect comparison:
| index | baseline model | After fine-tuning | promote |
|---|---|---|---|
| Team internal test set pass rate | 65% | 82% | +17% |
| Code style matching | 60% | 85% | +25% |
| security breach rate | 15% | 5% | -10% |
| team satisfaction | 70% | 90% | +20% |
Data Capitalization: Long-term Operation Suggestions
Establish data collection mechanism
Automatic Collection Points:
- After every code review, ask if it can be used for training
- Export conversation logs regularly
- Screen high-quality modules when archiving project code
Data Pipeline:
engineering practice
↓
automatic collection
↓
initial filtering
↓
qualityscore
↓
humanreview()
↓
training
↓
trainingUpdate
Version management
datasets/
├── v1.0.0_2024q1/
│ ├── train.jsonl
│ ├── validation.jsonl
│ └── metadata.json
├── v1.1.0_2024q2/
│ └── ...
└── v2.0.0_2024q3/
└── ...
Continuous iteration
Monthly: Collect new data Quarter: Update dataset version Half a year: Retrain the model Annual: Evaluate the overall effect and adjust strategies
Screening criteria and material selection: building high-quality training sets
When using BMAD-Speckit-SDD-Flow to automatically extract SFT data, it is crucial to establish scientific screening criteria. Here is a proven screening framework:
Three-tier screening system
Three-layer screening ensures data quality while maintaining diversity, ultimately retaining approximately 50% of high-quality samples.
Material selection strategy
1. Scenario-based selection
BMAD-Speckit-SDD-Flow distinguishes different scenarios and gives priority to high-quality scenarios:
| Scenario | describe | priority | reason |
|---|---|---|---|
real_dev | Real development scenarios | high | From the actual coding process, highest quality |
evaluation | Assessment scenario | middle | Have clear standards of right and wrong |
synthetic | synthetic data | Low | May lack authenticity |
2. Selection based on Score Pattern
// score patternrecord
const HIGH_VALUE_PATTERNS = [
// + =, sample
{ initialScore: '>80', iterations: '<=2' },
// +final+ = sample
{ initialScore: '<60', finalScore: '>90', iterations: '3-5' },
// vetofinalpass = sample
{ vetoTriggered: true, finalScore: '>90' },
];
// pattern
const LOW_VALUE_PATTERNS = [
// = quality
{ iterations: '>5', finalScore: '<80' },
// code pair comparison =
{ hasCodePair: false },
];
3. Balance based on Content Category
// training
const TARGET_COMPOSITION = {
codeGeneration: 0.50, // 50% code generation
codeReview: 0.25, // 25% code
bugFix: 0.15, // 15% Bug
architecture: 0.10, // 10% architecture design
};
// subdivide within each category
const CODE_GEN_SUBCATEGORIES = {
algorithm: 0.30, // implementation
dataStructure: 0.25, // data
apiDesign: 0.25, // API
utility: 0.20, // toolfunction
};
Quality gate configuration recommendations
Adjust Quality Gates parameters according to different usage scenarios:
Scenario A: Building a basic model (high rigor)
const STRICT_GATES = {
minScore: 95, // select only the highest scores
maxIterations: 2, // at most two iterations
requireCodePair: true, // a code pair is required
maxTokens: 4096, // constrainttoken
};
// keep: 20-30%
Scenario B: Building an augmented dataset (moderately rigorous)
const BALANCED_GATES = {
minScore: 85, // a good score is sufficient
maxIterations: 4, // allow more iterations
requireCodePair: false, // a code pair is optional
maxTokens: 8192, // tokenconstraint
};
// keep: 50-60%
Scenario C: Construct experimental data set (relaxed)
const EXPERIMENTAL_GATES = {
minScore: 70, // passing is sufficient
maxIterations: 10, // do not limit iterations
requireCodePair: false, // do not require a code pair
maxTokens: 16384, // tokenconstraint
};
// keep: 80-90%
Practical case: complete data generation process
background
A team has been using BMAD-Speckit-SDD-Flow for AI-assisted development for three months, hoping to build a dedicated code repair model. They accumulated the following raw data:
| Asset type | original quantity | source |
|---|---|---|
| Story implementation record | 48 | BMAD-Speckit workflow |
| Evaluation run record | 1,200 items | Scoring System |
| Audit report | 48 servings | Audit module |
| Code submission | 156 commits | Git history |
Step 1: Data extraction
# SFT
npx ts-node scripts/sft-extract.ts \
--min-score 85 \
--output ./sft-training/bugfix-v1.0.jsonl
# extraction process log
[INFO] Loading scoring records from packages/scoring/data...
[INFO] Found 1,200 scoring records
[INFO] Filtering by scenario: real_dev
[INFO] Filtering by phase_score >= 85
[INFO] Building candidate samples...
[INFO] Applying quality gates...
[INFO] Applying redaction rules...
[WARNING] Blocked 3 samples containing private keys
[INFO] Assigning deterministic splits...
[INFO] Exporting to JSONL...
# output
312, cover 48 Story
- Accepted: 298
- Downgraded: 14
- Rejected: 888
-: 720
- source_path: 68
- code pair comparison: 56
- redaction blocked: 44 ()
dataset split:
- Train: 238 (76%)
- Validation: 37 (12%)
- Test: 37 (12%)
Step 2: Quality analysis
// qualityanalysis
import { analyzeDataset } from './analytics';
const analysis = analyzeDataset('./sft-training/bugfix-v1.0.jsonl');
console.log(analysis.summary);
/*
{
totalSamples: 312,
avgPhaseScore: 91.5,
avgTokenCount: 2847,
categories: {
securityFix: 89, // SQL and XSS fixes
performanceFix: 67, // algorithm optimization
styleFix: 76, // code
logicFix: 80 // logic
},
languages: {
typescript: 180,
python: 89,
go: 43
},
redactionSummary: {
clean: 308,
redacted: 4, // email address was redacted
blocked: 0 // sample
}
}
*/
Step 3: Format conversion and export
// trainingformat
import { exportDataset } from './export';
// export to OpenAI format
await exportDataset({
input: './sft-training/bugfix-v1.0.jsonl',
output: './sft-training/openai-format/',
format: 'openai_chat',
filter: s => s.quality.acceptance_decision === 'accepted'
});
// exportHuggingFaceformat
await exportDataset({
input: './sft-training/bugfix-v1.0.jsonl',
output: './sft-training/hf-format/',
format: 'hf_conversational',
splits: ['train', 'validation', 'test']
});
Step 4: Training and evaluation
# train with the exported data, example
from transformers import AutoModelForCausalLM, TrainingArguments
from trl import SFTTrainer
# Use a placeholder model ID; real runs must record model card, license, and eval snapshot.
model = AutoModelForCausalLM.from_pretrained("your-org/code-model-base")
# Configure training.
training_args = TrainingArguments(
output_dir="./bugfix-model",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-4,
)
# use the exported dataset
dataset = load_dataset("json", data_files={
"train": "./sft-training/hf-format/train.jsonl",
"validation": "./sft-training/hf-format/validation.jsonl",
})
# training
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
args=training_args,
)
trainer.train()
Training effect comparison
| index | baseline model | After fine-tuning | promote |
|---|---|---|---|
| Security vulnerability repair success rate | 62% | 88% | +26% |
| Performance problem identification rate | 58% | 82% | +24% |
| Code style compliance rate | 71% | 89% | +18% |
| Average repair recommendation quality | 3.2/5 | 4.3/5 | +1.1 |
Conclusion: Get the training samples right first, and then talk about the training scale
The really difficult part of converting engineering collaboration data into SFT samples is never “what format to export”, but “which samples are worth training and which samples must be rejected”. This requires task contracts, error classification, manual feedback, verification evidence and governance gates to work together.
The core framework given in this article can be summarized into three points:
- Build the data contract first, then build the data volume. Samples must be traceable, verifiable, and interpretable before they can enter the training pool.
- Do quality gating first, then do automation. Without hard and soft gating, automation only amplifies noise.
- First maintain the isolation of evaluation, and then pursue improvement of indicators. The train/eval confusion turns “model progress” into a statistical illusion.
If the team is just starting out, the most practical sequence is still: first go through the complete process with a small sample, and then gradually expand the scale; first increase the sample hardness, and then increase the training frequency.
This article completes the mid-stage project of “from delivery trajectory to trainable samples”. The last article in the series will return to a longer-term issue: when this closed loop operates stably, how can organizations judge the evolution direction of future AI programming evaluation and collaboration paradigms, and avoid treating today’s processes as tomorrow’s upper limit.
References and Acknowledgments
- InstructGPT — Ouyang et al., OpenAI
- Flan Collection — Longpre et al., Google
- BMAD-Speckit-SDD-Flow — BMAD Method Team
- TRL SFTTrainer — Hugging Face
- PEFT LoRA — Hugging Face
Series context
You are reading: AI Coding Mentor Series
This is article 8 of 9. Reading progress is stored only in this browser so the full series page can resume from the right entry.
Series Path
Current series chapters
Chapter clicks store reading progress only in this browser so the series page can resume from the right entry.
- Why do you need to be a coding mentor for AI? When AI programming assistants become standard equipment, the real competitiveness is no longer whether they can use AI, but whether they can judge, calibrate and constrain the engineering output of AI. This article starts from trust gaps, feedback protocols, evaluation standards and closed-loop capabilities to establish the core framework of "Humans as Coding Mentors".
- Panorama of AI programming ability evaluation: from HumanEval to SWE-bench, the evolution and selection of benchmarks Public benchmarks are not a decoration for model rankings, but a measurement tool for understanding the boundaries of AI programming capabilities. This article starts from benchmarks such as HumanEval, APPS, CodeContests, SWE-bench, LiveCodeBench and Aider, and explains how to read the rankings, how to choose benchmarks, and how to convert public evaluations into the team's own Coding Mentor evaluation system.
- How to design high-quality programming questions: from question surface to evaluation contract High-quality programming questions are not longer prompts, but assessment contracts that can stably expose the boundaries of abilities. This article starts from Bloom level, difficulty calibration, task contract, test design and question bank management to explain how to build a reproducible question system for AI Coding Mentor.
- Four-step approach to AI capability assessment: from one test to continuous system evaluation Serving as a coding mentor for AI is not about doing a model evaluation, but establishing an evaluation operation system that can continuously expose the boundaries of capabilities, record failure evidence, drive special improvements, and support collaborative decision-making.
- Best Practices for Collaborating with AI: Task Agreement, Dialogue Control and Feedback Closed Loop The core skill of being a Coding Mentor for AI is not to write longer prompt words, but to design task protocols, control the rhythm of conversations, identify error patterns, and precipitate the collaboration process into verifiable and reusable feedback signals.
- Practical cases: feedback protocol, evaluation closed loop, code review and programming education data Case studies should not stop at “how to use AI tools better”. This article uses four engineering scenarios: model selection evaluation, feedback protocol design, code review signal precipitation, and programming education data closed loop to explain how humans can transform the AI collaboration process into evaluable, trainable, and reusable mentor signals.
- From delivery to training: How to turn AI programming collaboration into a Coding Mentor data closed loop The real organizational value of AI programming assistants is not just to increase delivery speed, but to precipitate trainable, evaluable, and reusable mentor signals in every requirement disassembly, code generation, review and revision, test verification, and online review. This article reconstructs the closed-loop framework of AI training, AI-assisted product engineering delivery, high-quality SFT data precipitation, and model evaluation.
- From engineering practice to training data: a systematic method for automatically generating SFT data in AI engineering Following the data closed loop in Part 7, this article focuses on how to process the screened engineering assets into high-quality SFT samples and connect them to a manageable, evaluable, and iterable training pipeline.
- Future Outlook: Evolutionary Trends and Long-term Thinking of AI Programming Assessment As the final article in the series, this article reconstructs the future route of AI Coding Mentor from the perspective of engineering decision-making: how evaluation objects evolve, how organizational capabilities are layered, and how governance boundaries are advanced.
Reading path
Continue along this topic path
Follow the recommended order for AI programming assessment instead of jumping through random articles in the same topic.
Next step
Go deeper into this topic
If this article is useful, continue from the topic page or subscribe to follow later updates.
Loading comments...
Comments and discussion
Sign in with GitHub to join the discussion. Comments are synced to GitHub Discussions