Article

From engineering practice to training data: a systematic method for automatically generating SFT data in AI engineering

Following the data closed loop in Part 7, this article focuses on how to process the screened engineering assets into high-quality SFT samples and connect them to a manageable, evaluable, and iterable training pipeline.

Topic · AI programming assessment Series AI Coding Mentor Series 8/9

Ai Coding Mentor Sft Training Original Interpretation Data Generation Bmad Method Spec Driven Development

Beginning: From “There are many records” to “There are trainable samples”

In Part 7, we address the closed-loop entry problem: which AI collaboration trajectories are worth keeping, which ones must be discarded, and which ones should be entered into eval. What will really stuck the team next is another more specific problem:

They are also “records left by AI-assisted delivery”. Why can some of them only be used for review, some can be entered into the knowledge base, and only a few can become training samples?

This article only does one thing: explain this screening and processing process clearly. The focus is not “how to export logs to JSONL”, but how to define the quality boundaries, data contracts and routing rules of trainable samples in the engineering context.

You can think of this article as a mid-stage pipeline:

The inputs are the engineering assets that have been managed in Part 7 (task contracts, error types, review feedback, verification evidence).
The middle stage of processing is sample construction, quality gating, desensitization, bucketing, exporting and versioning.
The outputs are candidate samples that can go into SFT, and evaluation assets that can be regression verified.

Therefore, this article does not discuss the strategic question of “should the model be trained?” but answers the question of “how to avoid training low-quality engineering noise as the model’s default behavior when the team decides to train.”

Overall architecture: automated pipeline from development to training data

Before going into the details, let’s take a look at the overall architecture. The ideal SFT data generation is not manually organized, but automatically generated during the development process:

The total process from project delivery to SFT sample export

This pipeline seamlessly integrates the development process of BMAD-Speckit-SDD-Flow with SFT data extraction to achieve “development as training”**.

Core concept: What is SFT training data?

Basic concepts of SFT

SFT (supervised fine-tuning) is a method of letting a pre-trained model learn a specific task.

Basic form: input (instruction/question) → output (expected answer)

SFT data for programming scenarios:

{
    "instruction": "Implement a cache class with LRU eviction",
    "input": "Requirements: \n- support get and put operations\n- time complexity O(1)\n- configurable capacity",
    "output": "class LRUCache:\n    def __init__(self, capacity: int):\n        ...",
    "metadata": {
        "difficulty": "medium",
        "source": "engineering_project",
        "verified": true,
        "quality_score": 0.92
    }
}

Why engineering outputs are suitable for conversion to SFT data

Feature 1: Strong authenticity

From actual project requirements
Tested by practice
Not a theoretical hypothesis

Feature 2: Controllable quality

manually reviewed
Have clear standards of right and wrong
There is a track record of improvements

Feature 3: Domain related

Align with team technology stack
Suitable for business scenarios
Comply with coding standards

Feature 4: Continuous generation

Every project is generating new data
Data volume grows over time
Models can be continuously updated

Convertible Engineering Asset Types

Asset Type 1: Assessment Question Set

Source: Questions accumulated when giving questions to AI

Conversion method:

Original problem: description plus test cases
    ↓
High-quality answer: human-written or selected AI answer
    ↓
SFT data: instruction-input-output tuple

Example:

{
    "instruction": "Write a function that finds the kth largest element in an array",
    "input": "Requirements: \n- function signature: find_kth_largest(nums: List[int], k: int) -> int\n- time complexity: O(n) or O(n log n)\n- do not use built-in sorting",
    "output": "def find_kth_largest(nums: List[int], k: int) -> int:\n    def partition(left, right, pivot_idx):\n        ...",
    "metadata": {
        "algorithm": "quickselect",
        "difficulty": "medium",
        "source": "evaluation_suite",
        "pass_rate": 0.75
    }
}

Asset Type 2: Code Review Feedback

Source: Review comments on AI generated code

Conversion method:

Problematic code generated by AI
    ↓
Review feedback: issues plus improvement suggestions
    ↓
Corrected high-quality code after human revision
    ↓
SFT data for training code-repair capability

Example:

{
    "instruction": "Fix the security issue in the following code",
    "input": "```python\ndef query_user(username):\n    sql = f\"SELECT * FROM users WHERE name = '{username}'\"\n    return db.execute(sql)\n```",
    "output": "```python\ndef query_user(username):\n    sql = \"SELECT * FROM users WHERE name = %s\"\n    return db.execute(sql, (username,))\n```\n\nFix explanation: \n1. use parameterized queries to prevent SQL injection\n2. never concatenate user input into SQL",
    "metadata": {
        "issue_type": "security",
        "severity": "high",
        "source": "code_review",
        " CWE_id": "CWE-89"
    }
}

Asset Type 3: Multiple rounds of conversation records

Source: The full conversation on collaborating with AI to solve problems

Conversion method:

Full conversation: question -> AI answer -> feedback -> improvement ->...
    ↓
Extract high-quality turns
    ↓
SFT data for training dialogue capability

Example:

{
    "messages": [
        {
            "role": "user",
            "content": "Implement a thread-safe singleton pattern"
        },
        {
            "role": "assistant",
            "content": "class Singleton:\n    _instance = None\n    \n    def __new__(cls):\n        if cls._instance is None:\n            cls._instance = super().__new__(cls)\n        return cls._instance"
        },
        {
            "role": "user",
            "content": "This implementation is unsafe in a multithreaded environment. If two threads see _instance as None at the same time, two instances can be created."
        },
        {
            "role": "assistant",
            "content": "class Singleton:\n    _instance = None\n    _lock = threading.Lock()\n    \n    def __new__(cls):\n        if cls._instance is None:\n            with cls._lock:\n                if cls._instance is None:\n                    cls._instance = super().__new__(cls)\n        return cls._instance"
        }
    ],
    "metadata": {
        "conversation_type": "iterative_refinement",
        "topic": "concurrency",
        "source": "chat_logs"
    }
}

Asset Type 4: Successful Project Cases

Source: Complete project code completed with AI assistance

Conversion method:

Project code: modular, commented, and high quality
    ↓
Break down into teaching cases
    ↓
SFT data for training complex task decomposition

Example:

{
    "instruction": "Design and implement a JWT authentication system",
    "input": "Requirements: \n1. support token generation and verification\n2. support token refresh\n3. support token revocation\n4. use Redis to store sessions\n\nimplement it in the following steps: \n1. design the data model\n2. implement core functionality\n3. add error handling\n4. write test cases",
    "output": "## Step 1: data model design\n...\n\n## Step 2: core functionality implementation\n...\n\n## Step 3: error handling\n...\n\n## Step 4: test cases\n...",
    "metadata": {
        "project_type": "authentication_system",
        "complexity": "high",
        "source": "successful_project",
        "verified_in_production": true
    }
}

BMAD-Speckit-SDD-Flow architecture: automated SFT data generation system

In previous chapters, we discussed methods of manually extracting SFT data from engineering assets. However, manual methods are inefficient and difficult to scale. This section will provide an in-depth introduction to the BMAD-Speckit-SDD-Flow architecture - a systematic solution that can automatically generate SFT training data during the AI engineering development process.

Architecture design concept

BMAD-Speckit-SDD-Flow combines BMAD method (multi-agent agile development) and Spec-Driven Development (specification-driven development) to automatically capture and transform training data in the following links:

BMAD-Speckit-SDD-Flow to SFT architecture mapping

Core data model: CanonicalSftSample

The core of the system is a standardized data model CanonicalSftSample, which unifies training data from all sources into a standardized format:

interface CanonicalSftSample {
  // unique sample identifier
  sample_id: string;
  sample_version: 'v1';

  // data source tracking
  source: {
    run_id: string;              // execution run ID
    stage: string;               // development stage
    flow: string;                // workflow type
    epic_id?: string;            // owning epic
    story_id?: string;           // owning story
    artifact_refs: Array<{       // original artifact references
      path: string;
      content_hash: string;
      kind: string;
    }>;
  };

  // messages compatible with OpenAI format
  messages: Array<{
    role: 'system' | 'user' | 'assistant' | 'tool';
    content: string;
    tool_calls?: ToolCall[];
    tool_call_id?: string;
    weight?: 0 | 1;
  }>;

  // tool definitions for tool-calling training
  tools?: Tool[];

  // metadata
  metadata: {
    schema_targets: string[];    // target formats
    language: string;            // language
    tags?: string[];
    notes?: string[];
  };

  // quality evaluation
  quality: {
    acceptance_decision: 'accepted' | 'rejected' | 'downgraded';
    phase_score: number | null;
    dimension_scores?: Record<string, number>;
    veto_triggered: boolean;
    iteration_count: number;
    has_code_pair: boolean;
    token_estimate: number;
    rejection_reasons: string[];
    warnings: string[];
  };

  // data source traceability
  provenance: {
    base_commit_hash: string | null;
    content_hash: string | null;
    source_path: string | null;
    patch_ref: string | null;
    lineage: string[];
    generated_at: string;
  };

  // dataset split
  split: {
    assignment: 'train' | 'validation' | 'test' | 'holdout';
    seed: number;
    strategy: string;
    group_key: string | null;
  };

  // data redaction information
  redaction: {
    status: 'clean' | 'redacted' | 'blocked';
    applied_rules: string[];
    findings: Array<{
      kind: string;
      severity: 'low' | 'medium' | 'high' | 'critical';
      field_path: string;
      action?: string;
    }>;
    redacted_fields: string[];
  };

  // export compatibility
  export_compatibility: {
    openai_chat: ExportDecision;
    hf_conversational: ExportDecision;
    hf_tool_calling: ExportDecision;
  };
}

Detailed explanation of SFT data extraction pipeline

The data extraction pipeline consists of four core stages:

Four stages of SFT candidate sample extraction

Phase 1: Candidate Builder (candidate build)

// build candidate samples from run records
function buildCanonicalSample(
  record: RunScoreRecord,        // evaluation run record
  sourceContent: string,          // original artifact content
  codePair: { input: string; output: string },  // code pair comparison
  options: BuildOptions
): CanonicalSftSample {

  // 1. extract the instruction from the audit report: section 1 problem description plus section 4 fix plan
  const instruction = extractInstruction(sourceContent);

  // 2. build conversation messages
  const messages = buildCanonicalMessages(
    instruction,
    codePair.input,    // code before the change
    codePair.output    // code after the change
  );

  // 3. calculate deterministic dataset split based on the story hash
  const split = assignDeterministicSplit({
    seed: options.splitSeed ?? 42,
    groupKey: parsedStory
      ? `epic-${parsedStory.epicId}/story-${parsedStory.storyId}`
      : record.run_id,
  });

  // 4. build the complete sample
  return {
    sample_id: buildCanonicalSampleId({...}),
    source: { run_id, stage, epic_id, story_id, artifact_refs },
    messages,
    quality: { phase_score, iteration_count, has_code_pair, ... },
    provenance: { base_commit_hash, content_hash, patch_ref, ... },
    split,
    // ... other fields
  };
}

Key Features:

Git Diff Extraction: Automatically extract code changes from base_commit to the current HEAD
Instruction Extraction: Extract training instructions from the standard sections of the audit report (§1 Questions, §4 Plans)
Caching mechanism: avoid repeated construction and improve performance

Stage 2: Quality Gates (Quality Gate Control)

The quality gating system evaluates candidate samples in multiple dimensions and decides whether to accept:

interface QualityGateOptions {
  minScore?: number;          // minimum score threshold, default 90
  maxIterations?: number;     // maximum iteration count
  maxTokens?: number;         // maximum token count
  requireCodePair?: boolean;  // whether a code pair is required
}

function applyQualityGates(
  sample: CanonicalSftSample,
  options: QualityGateOptions
): CanonicalSftSample {
  const hardReasons: string[] = [];  // hard rejection reasons
  const softReasons: string[] = [];  // soft warning reasons

  // hard checks
  if (!sample.provenance.base_commit_hash) {
    hardReasons.push('prov_missing_hash');
  }
  if ((sample.quality.phase_score ?? 0) < minScore) {
    hardReasons.push('score_below_floor');
  }
  if (sample.quality.veto_triggered) {
    hardReasons.push('veto_triggered');  // critical audit item vetoed
  }
  if (sample.redaction.status === 'blocked') {
    hardReasons.push('redaction_blocked');  // redaction blocked
  }

  // soft checks
  if (sample.quality.iteration_count > maxIterations) {
    softReasons.push('too_many_iterations');
  }
  if (!sample.quality.has_code_pair) {
    softReasons.push('missing_code_pair');
  }

  // decision: accepted / rejected / downgraded
  const acceptanceDecision =
    hardReasons.length > 0 ? 'rejected' :
    softReasons.length > 0 ? 'downgraded' : 'accepted';

  return { ...sample, quality: { ...sample.quality,
    acceptanceDecision,
    rejection_reasons: [...hardReasons, ...softReasons]
  }};
}

Quality Dimension:

Check items	type	illustrate
source integrity	rigid	Must have commit hash, source path
Score threshold	rigid	phase_score >= 90 (configurable)
Veto trigger	rigid	Key audit items failed
desensitization block	rigid	Sensitive information such as private keys detected
message integrity	rigid	Must contain user and assistant messages
Number of iterations	Soft	Downgrade beyond maxIterations
Code comparison	Soft	No input/output code pair degradation

Stage 3: Redaction (data desensitization)

Automatically detect and desensitize sensitive information to ensure training data security:

// redaction rules
const REDACTION_RULES = {
  // email address -> medium risk, redact it
  email: {
    pattern: /\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b/gi,
    severity: 'medium',
    action: 'redact',  // replace with [REDACTED_EMAIL]
  },

  // Secret Token -> high risk, redact it
  secretToken: {
    pattern: /\bsk-[A-Za-z0-9]{16,}\b/g,  // OpenAI API keys and similar secrets
    severity: 'high',
    action: 'redact',  // replace with [REDACTED_SECRET]
  },

  // private key -> critical risk, block the sample
  privateKey: {
    pattern: /BEGIN [A-Z ]+ PRIVATE KEY/,
    severity: 'critical',
    action: 'block',  // sample is rejected
  },
};

function applyCanonicalRedaction(sample: CanonicalSftSample): CanonicalSftSample {
  let status: 'clean' | 'redacted' | 'blocked' = 'clean';
  const findings: RedactionFinding[] = [];

  const messages = sample.messages.map((message, index) => {
    let content = message.content;

    // apply each rule
    for (const [ruleName, rule] of Object.entries(REDACTION_RULES)) {
      if (rule.pattern.test(content)) {
        if (rule.action === 'block') {
          status = 'blocked';
          findings.push({ kind: ruleName, severity: rule.severity, ... });
        } else if (rule.action === 'redact') {
          status = status === 'blocked' ? 'blocked' : 'redacted';
          content = content.replace(rule.pattern, `[REDACTED_${ruleName.toUpperCase()}]`);
          findings.push({ kind: ruleName, severity: rule.severity, action: 'redact' });
        }
      }
    }

    return { ...message, content };
  });

  return { ...sample, messages, redaction: { status, findings } };
}

Stage 4: Split Assignment (data set division)

Use a deterministic algorithm to allocate training/validation/test sets to ensure reproducibility:

function assignDeterministicSplit(options: {
  seed: number;
  groupKey: string | null;
}): CanonicalSplit {
  const stableKey = `${options.seed}:${options.groupKey ?? 'ungrouped'}`;
  const hash = sha256(stableKey).digest('hex');
  const bucket = parseInt(hash.slice(0, 8), 16) % 100;

  // 80% train / 10% validation / 10% test
  let assignment: 'train' | 'validation' | 'test' = 'train';
  if (bucket >= 80 && bucket < 90) assignment = 'validation';
  if (bucket >= 90) assignment = 'test';

  return { assignment, seed: options.seed, strategy: 'story_hash_v1' };
}

Multi-format export system

The system supports exporting to a variety of popular SFT training formats:

OpenAI Chat format

// export to OpenAI fine-tuning format
{
  "messages": [
    { "role": "system", "content": "You are a senior coding agent." },
    { "role": "user", "content": "Fix the following SQL injection issue...\n\nCurrent implementation:\ndef query(user):\n  sql = f\"SELECT * FROM users WHERE name = '{user}'\"" },
    { "role": "assistant", "content": "def query(user):\n  sql = \"SELECT * FROM users WHERE name = %s\"\n  return db.execute(sql, (user,))" }
  ],
  "tools": [...],  // optional tool definitions
  "parallel_tool_calls": false
}

HuggingFace ConversationalFormat

// export to HuggingFace conversational format
{
  "system": "You are a senior coding agent.",
  "conversations": [
    { "from": "human", "value": "Fix the following SQL injection issue..." },
    { "from": "gpt", "value": "def query(user):\n  sql = \"SELECT * FROM users WHERE name = %s\"..." }
  ]
}

HuggingFace Tool CallingFormat

// export to HuggingFace tool-calling format
{
  "system": "You are a coding assistant with tool access.",
  "conversations": [
    { "from": "human", "value": "Analyze the complexity of this code" },
    { "from": "gpt", "value": "", "tool_calls": [...] },
    { "from": "tool", "value": "{\"complexity\": \"O(n^2)\", ...}" },
    { "from": "gpt", "value": "The time complexity of this code is O(n^2)..." }
  ],
  "tools": [...]
}

CLI usage example

# basic extraction, default phase_score >= 90
npx ts-node scripts/sft-extract.ts

# specify the minimum score
npx ts-node scripts/sft-extract.ts --min-score 85

# specify the output path
npx ts-node scripts/sft-extract.ts --output ./custom-sft-data.jsonl

# complete example
npx ts-node scripts/sft-extract.ts \
  --min-score 90 \
  --output ./training-data/sft-v1.0.jsonl

Example of output summary:

Extracted 156 samples covering 12 stories; skipped 23 samples: missing source_path: 10, git diff failed: 8, phase_score below threshold: 5

Data construction process: from raw assets to training data

Phase 1: Asset Collection and Classification

Collection Scope:

Assessment questions and quality answers
Code review record (problem code + feedback + correction)
Multi-turn conversation log
Project documentation and code
Error case analysis

Classification criteria:

DATA_CATEGORIES = {
    "code_generation": {
        "description": "code generation task",
        "sources": ["evaluation_questions", "project_code"],
        "format": "instruction-input-output"
    },
    "code_review": {
        "description": "code review and repair",
        "sources": ["review_feedback", "bug_fixes"],
        "format": "problem-feedback-solution"
    },
    "conversation": {
        "description": "multi-turn dialogue",
        "sources": ["chat_logs"],
        "format": "messages"
    },
    "architecture": {
        "description": "architecture design",
        "sources": ["design_docs", "system_docs"],
        "format": "instruction-context-output"
    }
}

Phase 2: Data Cleaning and Filtering

Cleaning Steps:

Remove duplicates

def deduplicate(data):
    # deduplicate by code similarity
    # deduplicate by problem description
    # keep the higher-quality version

Quality Screening

def quality_filter(item):
    # code must pass tests
    # code quality score must exceed the threshold
    # problem-answer mapping must be explicit
    # no sensitive information such as passwords or keys

Format Standardization

def normalize_format(item, target_format):
    # normalize field names
    # normalize code style
    # add metadata

Filtering criteria:

Dimensions	standard	weight
functional correctness	Pass all test cases	40%
Code quality	Static analysis score > 0.8	25%
security	No high-risk vulnerabilities	20%
understandability	With clear annotations and instructions	15%

Phase 3: Data enhancement and augmentation

Enhancement Method 1: Variant Generation

# generate multiple phrasings for the same problem
def generate_variants(original_instruction, n=5):
    variants = []
    # use synonym replacement
    # change sentence structure
    # add or remove detail
    return variants

Enhancement method 2: Difficulty adjustment

# generate versions of the same problem at different difficulty levels
def adjust_difficulty(item, target_difficulty):
    if target_difficulty == "easy":
        # provide more hints
        # reduce constraints
    elif target_difficulty == "hard":
        # add boundary conditions
        # add performance requirements

Enhancement method 3: Cross-language migration

# migrate Python problems to JavaScript, Go, and other languages
def migrate_language(item, target_language):
    # syntax conversion
    # idiom adjustment
    # ecosystem adaptation

Stage 4: Format conversion and export

Universal SFT format:

{
    "dataset_info": {
        "name": "company_coding_sft_v1",
        "version": "1.0.0",
        "created_at": "2024-04-01",
        "total_samples": 10000,
        "categories": {
            "code_generation": 6000,
            "code_review": 2000,
            "conversation": 1500,
            "architecture": 500
        }
    },
    "data": [
        {
            "id": "cg_0001",
            "type": "code_generation",
            "instruction": "...",
            "input": "...",
            "output": "...",
            "metadata": {
                "source": "evaluation_suite",
                "difficulty": "medium",
                "language": "python",
                "quality_score": 0.92,
                "verified": true
            }
        }
    ]
}

Framework specific format:

Alpaca format
ShareGPT format
OpenAI fine-tuning format
HuggingFace datasets format

Data quality assessment criteria

Assessment Dimensions

Dimension 1: Accuracy

Does the code run correctly?
Does it meet the requirement description?
Test case pass rate

Dimension 2: Integrity

Does it contain necessary information?
Whether to include edge case handling
Whether to include error handling

Dimension 3: Consistency

Clear correspondence between input and output
Unified style
Terminology consistent

Dimension 4: Diversity

Question type coverage
Reasonable difficulty distribution
Language/framework coverage

Dimension 5: Security

No malicious code
No sensitive information leaked
Comply with safety regulations

Quality scoring model

def calculate_quality_score(item):
    scores = {
        'correctness': evaluate_correctness(item),      # 40%
        'completeness': evaluate_completeness(item),    # 25%
        'consistency': evaluate_consistency(item),      # 20%
        'safety': evaluate_safety(item),                # 15%
    }

    weights = {
        'correctness': 0.40,
        'completeness': 0.25,
        'consistency': 0.20,
        'safety': 0.15
    }

    total_score = sum(scores[k] * weights[k] for k in scores)
    return total_score

Key points for manual review

Required review items:

Code functional correctness (runtime verification)
No security holes (static scanning)
No sensitive information (regular matching)
Format specification (automated inspection)

Sampling items (20% sampling):

Code quality (human assessment)
Teaching value (expert evaluation)
Scenario authenticity (confirmed by the business party)

SFT training basics: how to use this data

Training process overview

raw data
    ↓
data preprocessing: cleaning and formatting
    ↓
dataset construction: train/validation/test split
    ↓
(SFT training)
    ↓
model evaluation
    ↓
model deployment

Simple training example (using HuggingFace)

2026-03 Source Caliber: The training API and LoRA configuration are subject to the TRL SFTTrainer and PEFT LoRA documents; the basic model uses placeholders ID, as actual projects must first review model cards, licenses, training data boundaries, and internal team evaluation snapshots.

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer

# 1. Load a reviewed base model (example placeholder, March 2026).
base_model_id = "your-org/code-model-base"
model = AutoModelForCausalLM.from_pretrained(base_model_id)
tokenizer = AutoTokenizer.from_pretrained(base_model_id)

# 2. Configure LoRA to reduce training cost.
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)

# 3. data
dataset = load_dataset("json", data_files="company_coding_sft_v1.json")

# 4. Configure training.
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    save_steps=100,
    logging_steps=10,
)

# 5. training
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    args=training_args,
)
trainer.train()

# 6.
model.save_pretrained("./company_coding_model")

training suggestions

Data volume recommendations:

Minimum: 1,000 high-quality samples
Recommended: 10,000+ samples
Ideal: 100,000+ samples

Quality > Quantity:

1,000 high-quality data > 10,000 low-quality data
It’s better to be less and more refined than more and more miscellaneous

Continuous iteration:

First use a small data set to verify feasibility
Gradually increase data based on performance
Establish a closed loop of data-training-evaluation

Practical case: complete data construction process

background

A team hopes to build a dedicated code generation model based on 6 months of engineering practice.

Asset inventory

Asset type	original quantity	available quantity	Remark
Assessment questions	300	250	After filtering
Code review records	500	400	After deduplication
Conversation log	1000 items	800 items	After filtering
Project code	100,000 lines	20,000 lines	Featured modules

Data processing

Step 1: Extraction and Cleaning

# extract from evaluation tasks
sft_items = []
for question in evaluation_questions:
    if question.quality_score > 0.8:
        sft_items.append({
            "instruction": question.description,
            "input": question.requirements,
            "output": question.reference_solution
        })

# record
for review in code_reviews:
    if review.severity in ["high", "critical"]:
        sft_items.append({
            "instruction": f"{review.issue_type}",
            "input": review.problematic_code,
            "output": review.fixed_code + "\n\n" + review.explanation
        })

Step 2: Quality Score

for item in sft_items:
    item["quality_score"] = calculate_quality_score(item)

# qualitydata
high_quality_items = [i for i in sft_items if i["quality_score"] > 0.85]

Step 3: Data enhancement

# generate
augmented_items = []
for item in high_quality_items:
    augmented_items.append(item)
    # generate3
    for variant in generate_variants(item, n=3):
        augmented_items.append(variant)

Step 4: Manual review

# humanreview
sample = random.sample(augmented_items, k=len(augmented_items)//5)
for item in sample:
    review_result = manual_review(item)
    if not review_result.approved:
        augmented_items.remove(item)

final data set

## data

- sample: 5,000
-:
  - code generation: 60% (3,000)
- code review and repair: 25% (1,250)
- architecture reasoning: 10% (500)
  - architecture design: 5% (250)
- language:
  - Python: 70%
  - JavaScript: 20%
  - Go: 10%
-:
  - Easy: 30%
  - Medium: 50%
  - Hard: 20%
- quality: 0.88

training effect

Baseline Model: Code base model that the team reviewed in 2026-03 with model cards, licenses, and internal benchmarks Training method: LoRA fine-tuning Training time: 4 hours (single card A100)

Effect comparison:

index	baseline model	After fine-tuning	promote
Team internal test set pass rate	65%	82%	+17%
Code style matching	60%	85%	+25%
security breach rate	15%	5%	-10%
team satisfaction	70%	90%	+20%

Data Capitalization: Long-term Operation Suggestions

Establish data collection mechanism

Automatic Collection Points:

After every code review, ask if it can be used for training
Export conversation logs regularly
Screen high-quality modules when archiving project code

Data Pipeline:

engineering practice
    ↓
automatic collection
    ↓
initial filtering
    ↓
qualityscore
    ↓
humanreview()
    ↓
training
    ↓
trainingUpdate

Version management

datasets/
├── v1.0.0_2024q1/
│   ├── train.jsonl
│   ├── validation.jsonl
│   └── metadata.json
├── v1.1.0_2024q2/
│   └── ...
└── v2.0.0_2024q3/
    └── ...

Continuous iteration

Monthly: Collect new data Quarter: Update dataset version Half a year: Retrain the model Annual: Evaluate the overall effect and adjust strategies

Screening criteria and material selection: building high-quality training sets

When using BMAD-Speckit-SDD-Flow to automatically extract SFT data, it is crucial to establish scientific screening criteria. Here is a proven screening framework:

Three-tier screening system

Three-tier screening and quality gating

Three-layer screening ensures data quality while maintaining diversity, ultimately retaining approximately 50% of high-quality samples.

Material selection strategy

1. Scenario-based selection

BMAD-Speckit-SDD-Flow distinguishes different scenarios and gives priority to high-quality scenarios:

Scenario	describe	priority	reason
`real_dev`	Real development scenarios	high	From the actual coding process, highest quality
`evaluation`	Assessment scenario	middle	Have clear standards of right and wrong
`synthetic`	synthetic data	Low	May lack authenticity

2. Selection based on Score Pattern

// score patternrecord
const HIGH_VALUE_PATTERNS = [
  // + =, sample
  { initialScore: '>80', iterations: '<=2' },

  // +final+ = sample
  { initialScore: '<60', finalScore: '>90', iterations: '3-5' },

  // vetofinalpass = sample
  { vetoTriggered: true, finalScore: '>90' },
];

// pattern
const LOW_VALUE_PATTERNS = [
  // = quality
  { iterations: '>5', finalScore: '<80' },

  // code pair comparison =
  { hasCodePair: false },
];

3. Balance based on Content Category

// training
const TARGET_COMPOSITION = {
  codeGeneration: 0.50,     // 50% code generation
  codeReview: 0.25,         // 25% code
  bugFix: 0.15,             // 15% Bug
  architecture: 0.10,       // 10% architecture design
};

// subdivide within each category
const CODE_GEN_SUBCATEGORIES = {
  algorithm: 0.30,          // implementation
  dataStructure: 0.25,      // data
  apiDesign: 0.25,          // API
  utility: 0.20,            // toolfunction
};

Quality gate configuration recommendations

Adjust Quality Gates parameters according to different usage scenarios:

Scenario A: Building a basic model (high rigor)

const STRICT_GATES = {
  minScore: 95,             // select only the highest scores
  maxIterations: 2,         // at most two iterations
  requireCodePair: true,    // a code pair is required
  maxTokens: 4096,          // constrainttoken
};
// keep: 20-30%

Scenario B: Building an augmented dataset (moderately rigorous)

const BALANCED_GATES = {
  minScore: 85,             // a good score is sufficient
  maxIterations: 4,         // allow more iterations
  requireCodePair: false,   // a code pair is optional
  maxTokens: 8192,          // tokenconstraint
};
// keep: 50-60%

Scenario C: Construct experimental data set (relaxed)

const EXPERIMENTAL_GATES = {
  minScore: 70,             // passing is sufficient
  maxIterations: 10,        // do not limit iterations
  requireCodePair: false,   // do not require a code pair
  maxTokens: 16384,         // tokenconstraint
};
// keep: 80-90%

Practical case: complete data generation process

background

A team has been using BMAD-Speckit-SDD-Flow for AI-assisted development for three months, hoping to build a dedicated code repair model. They accumulated the following raw data:

Asset type	original quantity	source
Story implementation record	48	BMAD-Speckit workflow
Evaluation run record	1,200 items	Scoring System
Audit report	48 servings	Audit module
Code submission	156 commits	Git history

Step 1: Data extraction

# SFT
npx ts-node scripts/sft-extract.ts \
  --min-score 85 \
  --output ./sft-training/bugfix-v1.0.jsonl

# extraction process log
[INFO] Loading scoring records from packages/scoring/data...
[INFO] Found 1,200 scoring records
[INFO] Filtering by scenario: real_dev
[INFO] Filtering by phase_score >= 85
[INFO] Building candidate samples...
[INFO] Applying quality gates...
[INFO] Applying redaction rules...
[WARNING] Blocked 3 samples containing private keys
[INFO] Assigning deterministic splits...
[INFO] Exporting to JSONL...

# output
312, cover 48  Story
- Accepted: 298
- Downgraded: 14
- Rejected: 888
-: 720
- source_path: 68
- code pair comparison: 56
- redaction blocked: 44 ()

dataset split:
- Train: 238  (76%)
- Validation: 37  (12%)
- Test: 37  (12%)

Step 2: Quality analysis

// qualityanalysis
import { analyzeDataset } from './analytics';

const analysis = analyzeDataset('./sft-training/bugfix-v1.0.jsonl');

console.log(analysis.summary);
/*
{
  totalSamples: 312,
  avgPhaseScore: 91.5,
  avgTokenCount: 2847,
  categories: {
    securityFix: 89,      // SQL and XSS fixes
    performanceFix: 67,   // algorithm optimization
    styleFix: 76,         // code
    logicFix: 80          // logic
  },
  languages: {
    typescript: 180,
    python: 89,
    go: 43
  },
  redactionSummary: {
    clean: 308,
    redacted: 4,          // email address was redacted
    blocked: 0            // sample
  }
}
*/

Step 3: Format conversion and export

// trainingformat
import { exportDataset } from './export';

// export to OpenAI format
await exportDataset({
  input: './sft-training/bugfix-v1.0.jsonl',
  output: './sft-training/openai-format/',
  format: 'openai_chat',
  filter: s => s.quality.acceptance_decision === 'accepted'
});

// exportHuggingFaceformat
await exportDataset({
  input: './sft-training/bugfix-v1.0.jsonl',
  output: './sft-training/hf-format/',
  format: 'hf_conversational',
  splits: ['train', 'validation', 'test']
});

Step 4: Training and evaluation

# train with the exported data, example
from transformers import AutoModelForCausalLM, TrainingArguments
from trl import SFTTrainer

# Use a placeholder model ID; real runs must record model card, license, and eval snapshot.
model = AutoModelForCausalLM.from_pretrained("your-org/code-model-base")

# Configure training.
training_args = TrainingArguments(
    output_dir="./bugfix-model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-4,
)

# use the exported dataset
dataset = load_dataset("json", data_files={
    "train": "./sft-training/hf-format/train.jsonl",
    "validation": "./sft-training/hf-format/validation.jsonl",
})

# training
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    args=training_args,
)
trainer.train()

Training effect comparison

index	baseline model	After fine-tuning	promote
Security vulnerability repair success rate	62%	88%	+26%
Performance problem identification rate	58%	82%	+24%
Code style compliance rate	71%	89%	+18%
Average repair recommendation quality	3.2/5	4.3/5	+1.1

Conclusion: Get the training samples right first, and then talk about the training scale

The really difficult part of converting engineering collaboration data into SFT samples is never “what format to export”, but “which samples are worth training and which samples must be rejected”. This requires task contracts, error classification, manual feedback, verification evidence and governance gates to work together.

The core framework given in this article can be summarized into three points:

Build the data contract first, then build the data volume. Samples must be traceable, verifiable, and interpretable before they can enter the training pool.
Do quality gating first, then do automation. Without hard and soft gating, automation only amplifies noise.
First maintain the isolation of evaluation, and then pursue improvement of indicators. The train/eval confusion turns “model progress” into a statistical illusion.

If the team is just starting out, the most practical sequence is still: first go through the complete process with a small sample, and then gradually expand the scale; first increase the sample hardness, and then increase the training frequency.

This article completes the mid-stage project of “from delivery trajectory to trainable samples”. The last article in the series will return to a longer-term issue: when this closed loop operates stably, how can organizations judge the evolution direction of future AI programming evaluation and collaboration paradigms, and avoid treating today’s processes as tomorrow’s upper limit.

References and Acknowledgments

InstructGPT — Ouyang et al., OpenAI
Flan Collection — Longpre et al., Google
BMAD-Speckit-SDD-Flow — BMAD Method Team
TRL SFTTrainer — Hugging Face
PEFT LoRA — Hugging Face

Series context

You are reading: AI Coding Mentor Series

This is article 8 of 9. Reading progress is stored only in this browser so the full series page can resume from the right entry.

View full series →

Reading path

Continue along this topic path

Follow the recommended order for AI programming assessment instead of jumping through random articles in the same topic.

View full topic path →

Next step

Go deeper into this topic

If this article is useful, continue from the topic page or subscribe to follow later updates.

Beginning: From “There are many records” to “There are trainable samples”

Overall architecture: automated pipeline from development to training data

Core concept: What is SFT training data?

Basic concepts of SFT

Why engineering outputs are suitable for conversion to SFT data

Convertible Engineering Asset Types

Asset Type 1: Assessment Question Set

Asset Type 2: Code Review Feedback

Asset Type 3: Multiple rounds of conversation records

Asset Type 4: Successful Project Cases

BMAD-Speckit-SDD-Flow architecture: automated SFT data generation system

Architecture design concept

Core data model: CanonicalSftSample

Detailed explanation of SFT data extraction pipeline

Phase 1: Candidate Builder (candidate build)

Stage 2: Quality Gates (Quality Gate Control)

Stage 3: Redaction (data desensitization)

Stage 4: Split Assignment (data set division)

Multi-format export system

OpenAI Chat format

HuggingFace ConversationalFormat

HuggingFace Tool CallingFormat

CLI usage example

Data construction process: from raw assets to training data

Phase 1: Asset Collection and Classification

Phase 2: Data Cleaning and Filtering

Phase 3: Data enhancement and augmentation

Stage 4: Format conversion and export

Data quality assessment criteria

Assessment Dimensions

Quality scoring model

Key points for manual review

SFT training basics: how to use this data

Training process overview

Simple training example (using HuggingFace)

training suggestions

Practical case: complete data construction process

background

Asset inventory

Data processing

final data set

training effect

Data Capitalization: Long-term Operation Suggestions

Establish data collection mechanism

Version management

Continuous iteration

Screening criteria and material selection: building high-quality training sets

Three-tier screening system

Material selection strategy

1. Scenario-based selection

2. Selection based on Score Pattern

3. Balance based on Content Category

Quality gate configuration recommendations

Scenario A: Building a basic model (high rigor)

Scenario B: Building an augmented dataset (moderately rigorous)

Scenario C: Construct experimental data set (relaxed)

Practical case: complete data generation process

background

Step 1: Data extraction

Step 2: Quality analysis

Step 3: Format conversion and export

Step 4: Training and evaluation

Training effect comparison

Conclusion: Get the training samples right first, and then talk about the training scale

References and Acknowledgments

You are reading: AI Coding Mentor Series

Current series chapters

Continue along this topic path

Why do you need to be a coding mentor for AI?

Panorama of AI programming ability evaluation: from HumanEval to SWE-bench, the evolution and selection of benchmarks

How to design high-quality programming questions: from question surface to evaluation contract

Continue with this topic

Four-step approach to AI capability assessment: from one test to continuous system evaluation

Best Practices for Collaborating with AI: Task Agreement, Dialogue Control and Feedback Closed Loop

Practical cases: feedback protocol, evaluation closed loop, code review and programming education data

Go deeper into this topic

Subscribe to updates

Comments and discussion