Hualin Luan Cloud Native · Quant Trading · AI Engineering
Back to articles

Article

From engineering practice to training data: a systematic method for automatically generating SFT data in AI engineering

Following the data closed loop in Part 7, this article focuses on how to process the screened engineering assets into high-quality SFT samples and connect them to a manageable, evaluable, and iterable training pipeline.

Meta

Published

3/30/2026

Category

interpretation

Reading Time

23 min read

Copyright Statement and Disclaimer This article is based on InstructGPT, Flan and other SFT training research, combined with engineering practical experience for a comprehensive interpretation.

Original Nature The reverse data generation framework, data quality assessment standards and construction process proposed in this article are original to the author.


Beginning: From “There are many records” to “There are trainable samples”

In Part 7, we address the closed-loop entry problem: which AI collaboration trajectories are worth keeping, which ones must be discarded, and which ones should be entered into eval. What will really stuck the team next is another more specific problem:

They are also “records left by AI-assisted delivery”. Why can some of them only be used for review, some can be entered into the knowledge base, and only a few can become training samples?

This article only does one thing: explain this screening and processing process clearly. The focus is not “how to export logs to JSONL”, but how to define the quality boundaries, data contracts and routing rules of trainable samples in the engineering context.

You can think of this article as a mid-stage pipeline:

  1. The inputs are the engineering assets that have been managed in Part 7 (task contracts, error types, review feedback, verification evidence).
  2. The middle stage of processing is sample construction, quality gating, desensitization, bucketing, exporting and versioning.
  3. The outputs are candidate samples that can go into SFT, and evaluation assets that can be regression verified.

Therefore, this article does not discuss the strategic question of “should the model be trained?” but answers the question of “how to avoid training low-quality engineering noise as the model’s default behavior when the team decides to train.”


Overall architecture: automated pipeline from development to training data

Before going into the details, let’s take a look at the overall architecture. The ideal SFT data generation is not manually organized, but automatically generated during the development process:

The total process from project delivery to SFT sample export

This pipeline seamlessly integrates the development process of BMAD-Speckit-SDD-Flow with SFT data extraction to achieve “development as training”**.

Core concept: What is SFT training data?

Basic concepts of SFT

SFT (supervised fine-tuning) is a method of letting a pre-trained model learn a specific task.

Basic form: input (instruction/question) → output (expected answer)

SFT data for programming scenarios:

{
    "instruction": "Implement a cache class with LRU eviction",
    "input": "Requirements: \n- support get and put operations\n- time complexity O(1)\n- configurable capacity",
    "output": "class LRUCache:\n    def __init__(self, capacity: int):\n        ...",
    "metadata": {
        "difficulty": "medium",
        "source": "engineering_project",
        "verified": true,
        "quality_score": 0.92
    }
}

Why engineering outputs are suitable for conversion to SFT data

Feature 1: Strong authenticity

  • From actual project requirements
  • Tested by practice
  • Not a theoretical hypothesis

Feature 2: Controllable quality

  • manually reviewed
  • Have clear standards of right and wrong
  • There is a track record of improvements

Feature 3: Domain related

  • Align with team technology stack
  • Suitable for business scenarios
  • Comply with coding standards

Feature 4: Continuous generation

  • Every project is generating new data
  • Data volume grows over time
  • Models can be continuously updated

Convertible Engineering Asset Types

Asset Type 1: Assessment Question Set

Source: Questions accumulated when giving questions to AI

Conversion method:

Original problem: description plus test cases

High-quality answer: human-written or selected AI answer

SFT data: instruction-input-output tuple

Example:

{
    "instruction": "Write a function that finds the kth largest element in an array",
    "input": "Requirements: \n- function signature: find_kth_largest(nums: List[int], k: int) -> int\n- time complexity: O(n) or O(n log n)\n- do not use built-in sorting",
    "output": "def find_kth_largest(nums: List[int], k: int) -> int:\n    def partition(left, right, pivot_idx):\n        ...",
    "metadata": {
        "algorithm": "quickselect",
        "difficulty": "medium",
        "source": "evaluation_suite",
        "pass_rate": 0.75
    }
}

Asset Type 2: Code Review Feedback

Source: Review comments on AI generated code

Conversion method:

Problematic code generated by AI

Review feedback: issues plus improvement suggestions

Corrected high-quality code after human revision

SFT data for training code-repair capability

Example:

{
    "instruction": "Fix the security issue in the following code",
    "input": "```python\ndef query_user(username):\n    sql = f\"SELECT * FROM users WHERE name = '{username}'\"\n    return db.execute(sql)\n```",
    "output": "```python\ndef query_user(username):\n    sql = \"SELECT * FROM users WHERE name = %s\"\n    return db.execute(sql, (username,))\n```\n\nFix explanation: \n1. use parameterized queries to prevent SQL injection\n2. never concatenate user input into SQL",
    "metadata": {
        "issue_type": "security",
        "severity": "high",
        "source": "code_review",
        " CWE_id": "CWE-89"
    }
}

Asset Type 3: Multiple rounds of conversation records

Source: The full conversation on collaborating with AI to solve problems

Conversion method:

Full conversation: question -> AI answer -> feedback -> improvement ->...

Extract high-quality turns

SFT data for training dialogue capability

Example:

{
    "messages": [
        {
            "role": "user",
            "content": "Implement a thread-safe singleton pattern"
        },
        {
            "role": "assistant",
            "content": "class Singleton:\n    _instance = None\n    \n    def __new__(cls):\n        if cls._instance is None:\n            cls._instance = super().__new__(cls)\n        return cls._instance"
        },
        {
            "role": "user",
            "content": "This implementation is unsafe in a multithreaded environment. If two threads see _instance as None at the same time, two instances can be created."
        },
        {
            "role": "assistant",
            "content": "class Singleton:\n    _instance = None\n    _lock = threading.Lock()\n    \n    def __new__(cls):\n        if cls._instance is None:\n            with cls._lock:\n                if cls._instance is None:\n                    cls._instance = super().__new__(cls)\n        return cls._instance"
        }
    ],
    "metadata": {
        "conversation_type": "iterative_refinement",
        "topic": "concurrency",
        "source": "chat_logs"
    }
}

Asset Type 4: Successful Project Cases

Source: Complete project code completed with AI assistance

Conversion method:

Project code: modular, commented, and high quality

Break down into teaching cases

SFT data for training complex task decomposition

Example:

{
    "instruction": "Design and implement a JWT authentication system",
    "input": "Requirements: \n1. support token generation and verification\n2. support token refresh\n3. support token revocation\n4. use Redis to store sessions\n\nimplement it in the following steps: \n1. design the data model\n2. implement core functionality\n3. add error handling\n4. write test cases",
    "output": "## Step 1: data model design\n...\n\n## Step 2: core functionality implementation\n...\n\n## Step 3: error handling\n...\n\n## Step 4: test cases\n...",
    "metadata": {
        "project_type": "authentication_system",
        "complexity": "high",
        "source": "successful_project",
        "verified_in_production": true
    }
}

BMAD-Speckit-SDD-Flow architecture: automated SFT data generation system

In previous chapters, we discussed methods of manually extracting SFT data from engineering assets. However, manual methods are inefficient and difficult to scale. This section will provide an in-depth introduction to the BMAD-Speckit-SDD-Flow architecture - a systematic solution that can automatically generate SFT training data during the AI ​​engineering development process.

Architecture design concept

BMAD-Speckit-SDD-Flow combines BMAD method (multi-agent agile development) and Spec-Driven Development (specification-driven development) to automatically capture and transform training data in the following links:

BMAD-Speckit-SDD-Flow to SFT architecture mapping

Core data model: CanonicalSftSample

The core of the system is a standardized data model CanonicalSftSample, which unifies training data from all sources into a standardized format:

interface CanonicalSftSample {
  // unique sample identifier
  sample_id: string;
  sample_version: 'v1';

  // data source tracking
  source: {
    run_id: string;              // execution run ID
    stage: string;               // development stage
    flow: string;                // workflow type
    epic_id?: string;            // owning epic
    story_id?: string;           // owning story
    artifact_refs: Array<{       // original artifact references
      path: string;
      content_hash: string;
      kind: string;
    }>;
  };

  // messages compatible with OpenAI format
  messages: Array<{
    role: 'system' | 'user' | 'assistant' | 'tool';
    content: string;
    tool_calls?: ToolCall[];
    tool_call_id?: string;
    weight?: 0 | 1;
  }>;

  // tool definitions for tool-calling training
  tools?: Tool[];

  // metadata
  metadata: {
    schema_targets: string[];    // target formats
    language: string;            // language
    tags?: string[];
    notes?: string[];
  };

  // quality evaluation
  quality: {
    acceptance_decision: 'accepted' | 'rejected' | 'downgraded';
    phase_score: number | null;
    dimension_scores?: Record<string, number>;
    veto_triggered: boolean;
    iteration_count: number;
    has_code_pair: boolean;
    token_estimate: number;
    rejection_reasons: string[];
    warnings: string[];
  };

  // data source traceability
  provenance: {
    base_commit_hash: string | null;
    content_hash: string | null;
    source_path: string | null;
    patch_ref: string | null;
    lineage: string[];
    generated_at: string;
  };

  // dataset split
  split: {
    assignment: 'train' | 'validation' | 'test' | 'holdout';
    seed: number;
    strategy: string;
    group_key: string | null;
  };

  // data redaction information
  redaction: {
    status: 'clean' | 'redacted' | 'blocked';
    applied_rules: string[];
    findings: Array<{
      kind: string;
      severity: 'low' | 'medium' | 'high' | 'critical';
      field_path: string;
      action?: string;
    }>;
    redacted_fields: string[];
  };

  // export compatibility
  export_compatibility: {
    openai_chat: ExportDecision;
    hf_conversational: ExportDecision;
    hf_tool_calling: ExportDecision;
  };
}

Detailed explanation of SFT data extraction pipeline

The data extraction pipeline consists of four core stages:

Four stages of SFT candidate sample extraction

Phase 1: Candidate Builder (candidate build)

// build candidate samples from run records
function buildCanonicalSample(
  record: RunScoreRecord,        // evaluation run record
  sourceContent: string,          // original artifact content
  codePair: { input: string; output: string },  // code pair comparison
  options: BuildOptions
): CanonicalSftSample {

  // 1. extract the instruction from the audit report: section 1 problem description plus section 4 fix plan
  const instruction = extractInstruction(sourceContent);

  // 2. build conversation messages
  const messages = buildCanonicalMessages(
    instruction,
    codePair.input,    // code before the change
    codePair.output    // code after the change
  );

  // 3. calculate deterministic dataset split based on the story hash
  const split = assignDeterministicSplit({
    seed: options.splitSeed ?? 42,
    groupKey: parsedStory
      ? `epic-${parsedStory.epicId}/story-${parsedStory.storyId}`
      : record.run_id,
  });

  // 4. build the complete sample
  return {
    sample_id: buildCanonicalSampleId({...}),
    source: { run_id, stage, epic_id, story_id, artifact_refs },
    messages,
    quality: { phase_score, iteration_count, has_code_pair, ... },
    provenance: { base_commit_hash, content_hash, patch_ref, ... },
    split,
    // ... other fields
  };
}

Key Features:

  • Git Diff Extraction: Automatically extract code changes from base_commit to the current HEAD
  • Instruction Extraction: Extract training instructions from the standard sections of the audit report (§1 Questions, §4 Plans)
  • Caching mechanism: avoid repeated construction and improve performance

Stage 2: Quality Gates (Quality Gate Control)

The quality gating system evaluates candidate samples in multiple dimensions and decides whether to accept:

interface QualityGateOptions {
  minScore?: number;          // minimum score threshold, default 90
  maxIterations?: number;     // maximum iteration count
  maxTokens?: number;         // maximum token count
  requireCodePair?: boolean;  // whether a code pair is required
}

function applyQualityGates(
  sample: CanonicalSftSample,
  options: QualityGateOptions
): CanonicalSftSample {
  const hardReasons: string[] = [];  // hard rejection reasons
  const softReasons: string[] = [];  // soft warning reasons

  // hard checks
  if (!sample.provenance.base_commit_hash) {
    hardReasons.push('prov_missing_hash');
  }
  if ((sample.quality.phase_score ?? 0) < minScore) {
    hardReasons.push('score_below_floor');
  }
  if (sample.quality.veto_triggered) {
    hardReasons.push('veto_triggered');  // critical audit item vetoed
  }
  if (sample.redaction.status === 'blocked') {
    hardReasons.push('redaction_blocked');  // redaction blocked
  }

  // soft checks
  if (sample.quality.iteration_count > maxIterations) {
    softReasons.push('too_many_iterations');
  }
  if (!sample.quality.has_code_pair) {
    softReasons.push('missing_code_pair');
  }

  // decision: accepted / rejected / downgraded
  const acceptanceDecision =
    hardReasons.length > 0 ? 'rejected' :
    softReasons.length > 0 ? 'downgraded' : 'accepted';

  return { ...sample, quality: { ...sample.quality,
    acceptanceDecision,
    rejection_reasons: [...hardReasons, ...softReasons]
  }};
}

Quality Dimension:

Check itemstypeillustrate
source integrityrigidMust have commit hash, source path
Score thresholdrigidphase_score >= 90 (configurable)
Veto triggerrigidKey audit items failed
desensitization blockrigidSensitive information such as private keys detected
message integrityrigidMust contain user and assistant messages
Number of iterationsSoftDowngrade beyond maxIterations
Code comparisonSoftNo input/output code pair degradation

Stage 3: Redaction (data desensitization)

Automatically detect and desensitize sensitive information to ensure training data security:

// redaction rules
const REDACTION_RULES = {
  // email address -> medium risk, redact it
  email: {
    pattern: /\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b/gi,
    severity: 'medium',
    action: 'redact',  // replace with [REDACTED_EMAIL]
  },

  // Secret Token -> high risk, redact it
  secretToken: {
    pattern: /\bsk-[A-Za-z0-9]{16,}\b/g,  // OpenAI API keys and similar secrets
    severity: 'high',
    action: 'redact',  // replace with [REDACTED_SECRET]
  },

  // private key -> critical risk, block the sample
  privateKey: {
    pattern: /BEGIN [A-Z ]+ PRIVATE KEY/,
    severity: 'critical',
    action: 'block',  // sample is rejected
  },
};

function applyCanonicalRedaction(sample: CanonicalSftSample): CanonicalSftSample {
  let status: 'clean' | 'redacted' | 'blocked' = 'clean';
  const findings: RedactionFinding[] = [];

  const messages = sample.messages.map((message, index) => {
    let content = message.content;

    // apply each rule
    for (const [ruleName, rule] of Object.entries(REDACTION_RULES)) {
      if (rule.pattern.test(content)) {
        if (rule.action === 'block') {
          status = 'blocked';
          findings.push({ kind: ruleName, severity: rule.severity, ... });
        } else if (rule.action === 'redact') {
          status = status === 'blocked' ? 'blocked' : 'redacted';
          content = content.replace(rule.pattern, `[REDACTED_${ruleName.toUpperCase()}]`);
          findings.push({ kind: ruleName, severity: rule.severity, action: 'redact' });
        }
      }
    }

    return { ...message, content };
  });

  return { ...sample, messages, redaction: { status, findings } };
}

Stage 4: Split Assignment (data set division)

Use a deterministic algorithm to allocate training/validation/test sets to ensure reproducibility:

function assignDeterministicSplit(options: {
  seed: number;
  groupKey: string | null;
}): CanonicalSplit {
  const stableKey = `${options.seed}:${options.groupKey ?? 'ungrouped'}`;
  const hash = sha256(stableKey).digest('hex');
  const bucket = parseInt(hash.slice(0, 8), 16) % 100;

  // 80% train / 10% validation / 10% test
  let assignment: 'train' | 'validation' | 'test' = 'train';
  if (bucket >= 80 && bucket < 90) assignment = 'validation';
  if (bucket >= 90) assignment = 'test';

  return { assignment, seed: options.seed, strategy: 'story_hash_v1' };
}

Multi-format export system

The system supports exporting to a variety of popular SFT training formats:

OpenAI Chat format

// export to OpenAI fine-tuning format
{
  "messages": [
    { "role": "system", "content": "You are a senior coding agent." },
    { "role": "user", "content": "Fix the following SQL injection issue...\n\nCurrent implementation:\ndef query(user):\n  sql = f\"SELECT * FROM users WHERE name = '{user}'\"" },
    { "role": "assistant", "content": "def query(user):\n  sql = \"SELECT * FROM users WHERE name = %s\"\n  return db.execute(sql, (user,))" }
  ],
  "tools": [...],  // optional tool definitions
  "parallel_tool_calls": false
}

HuggingFace ConversationalFormat

// export to HuggingFace conversational format
{
  "system": "You are a senior coding agent.",
  "conversations": [
    { "from": "human", "value": "Fix the following SQL injection issue..." },
    { "from": "gpt", "value": "def query(user):\n  sql = \"SELECT * FROM users WHERE name = %s\"..." }
  ]
}

HuggingFace Tool CallingFormat

// export to HuggingFace tool-calling format
{
  "system": "You are a coding assistant with tool access.",
  "conversations": [
    { "from": "human", "value": "Analyze the complexity of this code" },
    { "from": "gpt", "value": "", "tool_calls": [...] },
    { "from": "tool", "value": "{\"complexity\": \"O(n^2)\", ...}" },
    { "from": "gpt", "value": "The time complexity of this code is O(n^2)..." }
  ],
  "tools": [...]
}

CLI usage example

# basic extraction, default phase_score >= 90
npx ts-node scripts/sft-extract.ts

# specify the minimum score
npx ts-node scripts/sft-extract.ts --min-score 85

# specify the output path
npx ts-node scripts/sft-extract.ts --output ./custom-sft-data.jsonl

# complete example
npx ts-node scripts/sft-extract.ts \
  --min-score 90 \
  --output ./training-data/sft-v1.0.jsonl

Example of output summary:

Extracted 156 samples covering 12 stories; skipped 23 samples: missing source_path: 10, git diff failed: 8, phase_score below threshold: 5

Data construction process: from raw assets to training data

Phase 1: Asset Collection and Classification

Collection Scope:

  • Assessment questions and quality answers
  • Code review record (problem code + feedback + correction)
  • Multi-turn conversation log
  • Project documentation and code
  • Error case analysis

Classification criteria:

DATA_CATEGORIES = {
    "code_generation": {
        "description": "code generation task",
        "sources": ["evaluation_questions", "project_code"],
        "format": "instruction-input-output"
    },
    "code_review": {
        "description": "code review and repair",
        "sources": ["review_feedback", "bug_fixes"],
        "format": "problem-feedback-solution"
    },
    "conversation": {
        "description": "multi-turn dialogue",
        "sources": ["chat_logs"],
        "format": "messages"
    },
    "architecture": {
        "description": "architecture design",
        "sources": ["design_docs", "system_docs"],
        "format": "instruction-context-output"
    }
}

Phase 2: Data Cleaning and Filtering

Cleaning Steps:

  1. Remove duplicates

    def deduplicate(data):
        # deduplicate by code similarity
        # deduplicate by problem description
        # keep the higher-quality version
  2. Quality Screening

    def quality_filter(item):
        # code must pass tests
        # code quality score must exceed the threshold
        # problem-answer mapping must be explicit
        # no sensitive information such as passwords or keys
  3. Format Standardization

    def normalize_format(item, target_format):
        # normalize field names
        # normalize code style
        # add metadata

Filtering criteria:

Dimensionsstandardweight
functional correctnessPass all test cases40%
Code qualityStatic analysis score > 0.825%
securityNo high-risk vulnerabilities20%
understandabilityWith clear annotations and instructions15%

Phase 3: Data enhancement and augmentation

Enhancement Method 1: Variant Generation

# generate multiple phrasings for the same problem
def generate_variants(original_instruction, n=5):
    variants = []
    # use synonym replacement
    # change sentence structure
    # add or remove detail
    return variants

Enhancement method 2: Difficulty adjustment

# generate versions of the same problem at different difficulty levels
def adjust_difficulty(item, target_difficulty):
    if target_difficulty == "easy":
        # provide more hints
        # reduce constraints
    elif target_difficulty == "hard":
        # add boundary conditions
        # add performance requirements

Enhancement method 3: Cross-language migration

# migrate Python problems to JavaScript, Go, and other languages
def migrate_language(item, target_language):
    # syntax conversion
    # idiom adjustment
    # ecosystem adaptation

Stage 4: Format conversion and export

Universal SFT format:

{
    "dataset_info": {
        "name": "company_coding_sft_v1",
        "version": "1.0.0",
        "created_at": "2024-04-01",
        "total_samples": 10000,
        "categories": {
            "code_generation": 6000,
            "code_review": 2000,
            "conversation": 1500,
            "architecture": 500
        }
    },
    "data": [
        {
            "id": "cg_0001",
            "type": "code_generation",
            "instruction": "...",
            "input": "...",
            "output": "...",
            "metadata": {
                "source": "evaluation_suite",
                "difficulty": "medium",
                "language": "python",
                "quality_score": 0.92,
                "verified": true
            }
        }
    ]
}

Framework specific format:

  • Alpaca format
  • ShareGPT format
  • OpenAI fine-tuning format
  • HuggingFace datasets format

Data quality assessment criteria

Assessment Dimensions

Dimension 1: Accuracy

  • Does the code run correctly?
  • Does it meet the requirement description?
  • Test case pass rate

Dimension 2: Integrity

  • Does it contain necessary information?
  • Whether to include edge case handling
  • Whether to include error handling

Dimension 3: Consistency

  • Clear correspondence between input and output
  • Unified style
  • Terminology consistent

Dimension 4: Diversity

  • Question type coverage
  • Reasonable difficulty distribution
  • Language/framework coverage

Dimension 5: Security

  • No malicious code
  • No sensitive information leaked
  • Comply with safety regulations

Quality scoring model

def calculate_quality_score(item):
    scores = {
        'correctness': evaluate_correctness(item),      # 40%
        'completeness': evaluate_completeness(item),    # 25%
        'consistency': evaluate_consistency(item),      # 20%
        'safety': evaluate_safety(item),                # 15%
    }

    weights = {
        'correctness': 0.40,
        'completeness': 0.25,
        'consistency': 0.20,
        'safety': 0.15
    }

    total_score = sum(scores[k] * weights[k] for k in scores)
    return total_score

Key points for manual review

Required review items:

  • Code functional correctness (runtime verification)
  • No security holes (static scanning)
  • No sensitive information (regular matching)
  • Format specification (automated inspection)

Sampling items (20% sampling):

  • Code quality (human assessment)
  • Teaching value (expert evaluation)
  • Scenario authenticity (confirmed by the business party)

SFT training basics: how to use this data

Training process overview

raw data

data preprocessing: cleaning and formatting

dataset construction: train/validation/test split

(SFT training)

model evaluation

model deployment

Simple training example (using HuggingFace)

2026-03 Source Caliber: The training API and LoRA configuration are subject to the TRL SFTTrainer and PEFT LoRA documents; the basic model uses placeholders ID, as actual projects must first review model cards, licenses, training data boundaries, and internal team evaluation snapshots.

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer

# 1. Load a reviewed base model (example placeholder, March 2026).
base_model_id = "your-org/code-model-base"
model = AutoModelForCausalLM.from_pretrained(base_model_id)
tokenizer = AutoTokenizer.from_pretrained(base_model_id)

# 2. Configure LoRA to reduce training cost.
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)

# 3. data
dataset = load_dataset("json", data_files="company_coding_sft_v1.json")

# 4. Configure training.
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    save_steps=100,
    logging_steps=10,
)

# 5. training
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    args=training_args,
)
trainer.train()

# 6.
model.save_pretrained("./company_coding_model")

training suggestions

Data volume recommendations:

  • Minimum: 1,000 high-quality samples
  • Recommended: 10,000+ samples
  • Ideal: 100,000+ samples

Quality > Quantity:

  • 1,000 high-quality data > 10,000 low-quality data
  • It’s better to be less and more refined than more and more miscellaneous

Continuous iteration:

  • First use a small data set to verify feasibility
  • Gradually increase data based on performance
  • Establish a closed loop of data-training-evaluation

Practical case: complete data construction process

background

A team hopes to build a dedicated code generation model based on 6 months of engineering practice.

Asset inventory

Asset typeoriginal quantityavailable quantityRemark
Assessment questions300250After filtering
Code review records500400After deduplication
Conversation log1000 items800 itemsAfter filtering
Project code100,000 lines20,000 linesFeatured modules

Data processing

Step 1: Extraction and Cleaning

# extract from evaluation tasks
sft_items = []
for question in evaluation_questions:
    if question.quality_score > 0.8:
        sft_items.append({
            "instruction": question.description,
            "input": question.requirements,
            "output": question.reference_solution
        })

# record
for review in code_reviews:
    if review.severity in ["high", "critical"]:
        sft_items.append({
            "instruction": f"{review.issue_type}",
            "input": review.problematic_code,
            "output": review.fixed_code + "\n\n" + review.explanation
        })

Step 2: Quality Score

for item in sft_items:
    item["quality_score"] = calculate_quality_score(item)

# qualitydata
high_quality_items = [i for i in sft_items if i["quality_score"] > 0.85]

Step 3: Data enhancement

# generate
augmented_items = []
for item in high_quality_items:
    augmented_items.append(item)
    # generate3
    for variant in generate_variants(item, n=3):
        augmented_items.append(variant)

Step 4: Manual review

# humanreview
sample = random.sample(augmented_items, k=len(augmented_items)//5)
for item in sample:
    review_result = manual_review(item)
    if not review_result.approved:
        augmented_items.remove(item)

final data set

## data

- sample: 5,000
-:
  - code generation: 60% (3,000)
- code review and repair: 25% (1,250)
- architecture reasoning: 10% (500)
  - architecture design: 5% (250)
- language:
  - Python: 70%
  - JavaScript: 20%
  - Go: 10%
-:
  - Easy: 30%
  - Medium: 50%
  - Hard: 20%
- quality: 0.88

training effect

Baseline Model: Code base model that the team reviewed in 2026-03 with model cards, licenses, and internal benchmarks Training method: LoRA fine-tuning Training time: 4 hours (single card A100)

Effect comparison:

indexbaseline modelAfter fine-tuningpromote
Team internal test set pass rate65%82%+17%
Code style matching60%85%+25%
security breach rate15%5%-10%
team satisfaction70%90%+20%

Data Capitalization: Long-term Operation Suggestions

Establish data collection mechanism

Automatic Collection Points:

  • After every code review, ask if it can be used for training
  • Export conversation logs regularly
  • Screen high-quality modules when archiving project code

Data Pipeline:

engineering practice

automatic collection

initial filtering

qualityscore

humanreview()

training

trainingUpdate

Version management

datasets/
├── v1.0.0_2024q1/
│   ├── train.jsonl
│   ├── validation.jsonl
│   └── metadata.json
├── v1.1.0_2024q2/
│   └── ...
└── v2.0.0_2024q3/
    └── ...

Continuous iteration

Monthly: Collect new data Quarter: Update dataset version Half a year: Retrain the model Annual: Evaluate the overall effect and adjust strategies


Screening criteria and material selection: building high-quality training sets

When using BMAD-Speckit-SDD-Flow to automatically extract SFT data, it is crucial to establish scientific screening criteria. Here is a proven screening framework:

Three-tier screening system

Three-tier screening and quality gating

Three-layer screening ensures data quality while maintaining diversity, ultimately retaining approximately 50% of high-quality samples.

Material selection strategy

1. Scenario-based selection

BMAD-Speckit-SDD-Flow distinguishes different scenarios and gives priority to high-quality scenarios:

Scenariodescribepriorityreason
real_devReal development scenarioshighFrom the actual coding process, highest quality
evaluationAssessment scenariomiddleHave clear standards of right and wrong
syntheticsynthetic dataLowMay lack authenticity

2. Selection based on Score Pattern

// score patternrecord
const HIGH_VALUE_PATTERNS = [
  // + =, sample
  { initialScore: '>80', iterations: '<=2' },

  // +final+ = sample
  { initialScore: '<60', finalScore: '>90', iterations: '3-5' },

  // vetofinalpass = sample
  { vetoTriggered: true, finalScore: '>90' },
];

// pattern
const LOW_VALUE_PATTERNS = [
  // = quality
  { iterations: '>5', finalScore: '<80' },

  // code pair comparison =
  { hasCodePair: false },
];

3. Balance based on Content Category

// training
const TARGET_COMPOSITION = {
  codeGeneration: 0.50,     // 50% code generation
  codeReview: 0.25,         // 25% code
  bugFix: 0.15,             // 15% Bug
  architecture: 0.10,       // 10% architecture design
};

// subdivide within each category
const CODE_GEN_SUBCATEGORIES = {
  algorithm: 0.30,          // implementation
  dataStructure: 0.25,      // data
  apiDesign: 0.25,          // API
  utility: 0.20,            // toolfunction
};

Quality gate configuration recommendations

Adjust Quality Gates parameters according to different usage scenarios:

Scenario A: Building a basic model (high rigor)

const STRICT_GATES = {
  minScore: 95,             // select only the highest scores
  maxIterations: 2,         // at most two iterations
  requireCodePair: true,    // a code pair is required
  maxTokens: 4096,          // constrainttoken
};
// keep: 20-30%

Scenario B: Building an augmented dataset (moderately rigorous)

const BALANCED_GATES = {
  minScore: 85,             // a good score is sufficient
  maxIterations: 4,         // allow more iterations
  requireCodePair: false,   // a code pair is optional
  maxTokens: 8192,          // tokenconstraint
};
// keep: 50-60%

Scenario C: Construct experimental data set (relaxed)

const EXPERIMENTAL_GATES = {
  minScore: 70,             // passing is sufficient
  maxIterations: 10,        // do not limit iterations
  requireCodePair: false,   // do not require a code pair
  maxTokens: 16384,         // tokenconstraint
};
// keep: 80-90%

Practical case: complete data generation process

background

A team has been using BMAD-Speckit-SDD-Flow for AI-assisted development for three months, hoping to build a dedicated code repair model. They accumulated the following raw data:

Asset typeoriginal quantitysource
Story implementation record48BMAD-Speckit workflow
Evaluation run record1,200 itemsScoring System
Audit report48 servingsAudit module
Code submission156 commitsGit history

Step 1: Data extraction

# SFT
npx ts-node scripts/sft-extract.ts \
  --min-score 85 \
  --output ./sft-training/bugfix-v1.0.jsonl

# extraction process log
[INFO] Loading scoring records from packages/scoring/data...
[INFO] Found 1,200 scoring records
[INFO] Filtering by scenario: real_dev
[INFO] Filtering by phase_score >= 85
[INFO] Building candidate samples...
[INFO] Applying quality gates...
[INFO] Applying redaction rules...
[WARNING] Blocked 3 samples containing private keys
[INFO] Assigning deterministic splits...
[INFO] Exporting to JSONL...

# output
312, cover 48  Story
- Accepted: 298
- Downgraded: 14
- Rejected: 888
-: 720
- source_path: 68
- code pair comparison: 56
- redaction blocked: 44 ()

dataset split:
- Train: 238  (76%)
- Validation: 37  (12%)
- Test: 37  (12%)

Step 2: Quality analysis

// qualityanalysis
import { analyzeDataset } from './analytics';

const analysis = analyzeDataset('./sft-training/bugfix-v1.0.jsonl');

console.log(analysis.summary);
/*
{
  totalSamples: 312,
  avgPhaseScore: 91.5,
  avgTokenCount: 2847,
  categories: {
    securityFix: 89,      // SQL and XSS fixes
    performanceFix: 67,   // algorithm optimization
    styleFix: 76,         // code
    logicFix: 80          // logic
  },
  languages: {
    typescript: 180,
    python: 89,
    go: 43
  },
  redactionSummary: {
    clean: 308,
    redacted: 4,          // email address was redacted
    blocked: 0            // sample
  }
}
*/

Step 3: Format conversion and export

// trainingformat
import { exportDataset } from './export';

// export to OpenAI format
await exportDataset({
  input: './sft-training/bugfix-v1.0.jsonl',
  output: './sft-training/openai-format/',
  format: 'openai_chat',
  filter: s => s.quality.acceptance_decision === 'accepted'
});

// exportHuggingFaceformat
await exportDataset({
  input: './sft-training/bugfix-v1.0.jsonl',
  output: './sft-training/hf-format/',
  format: 'hf_conversational',
  splits: ['train', 'validation', 'test']
});

Step 4: Training and evaluation

# train with the exported data, example
from transformers import AutoModelForCausalLM, TrainingArguments
from trl import SFTTrainer

# Use a placeholder model ID; real runs must record model card, license, and eval snapshot.
model = AutoModelForCausalLM.from_pretrained("your-org/code-model-base")

# Configure training.
training_args = TrainingArguments(
    output_dir="./bugfix-model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-4,
)

# use the exported dataset
dataset = load_dataset("json", data_files={
    "train": "./sft-training/hf-format/train.jsonl",
    "validation": "./sft-training/hf-format/validation.jsonl",
})

# training
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    args=training_args,
)
trainer.train()

Training effect comparison

indexbaseline modelAfter fine-tuningpromote
Security vulnerability repair success rate62%88%+26%
Performance problem identification rate58%82%+24%
Code style compliance rate71%89%+18%
Average repair recommendation quality3.2/54.3/5+1.1

Conclusion: Get the training samples right first, and then talk about the training scale

The really difficult part of converting engineering collaboration data into SFT samples is never “what format to export”, but “which samples are worth training and which samples must be rejected”. This requires task contracts, error classification, manual feedback, verification evidence and governance gates to work together.

The core framework given in this article can be summarized into three points:

  1. Build the data contract first, then build the data volume. Samples must be traceable, verifiable, and interpretable before they can enter the training pool.
  2. Do quality gating first, then do automation. Without hard and soft gating, automation only amplifies noise.
  3. First maintain the isolation of evaluation, and then pursue improvement of indicators. The train/eval confusion turns “model progress” into a statistical illusion.

If the team is just starting out, the most practical sequence is still: first go through the complete process with a small sample, and then gradually expand the scale; first increase the sample hardness, and then increase the training frequency.

This article completes the mid-stage project of “from delivery trajectory to trainable samples”. The last article in the series will return to a longer-term issue: when this closed loop operates stably, how can organizations judge the evolution direction of future AI programming evaluation and collaboration paradigms, and avoid treating today’s processes as tomorrow’s upper limit.


References and Acknowledgments

  • InstructGPT — Ouyang et al., OpenAI
  • Flan Collection — Longpre et al., Google
  • BMAD-Speckit-SDD-Flow — BMAD Method Team
  • TRL SFTTrainer — Hugging Face
  • PEFT LoRA — Hugging Face

Series context

You are reading: AI Coding Mentor Series

This is article 8 of 9. Reading progress is stored only in this browser so the full series page can resume from the right entry.

View full series →

Series Path

Current series chapters

Chapter clicks store reading progress only in this browser so the series page can resume from the right entry.

9 chapters
  1. Part 1 Previous in path Why do you need to be a coding mentor for AI? When AI programming assistants become standard equipment, the real competitiveness is no longer whether they can use AI, but whether they can judge, calibrate and constrain the engineering output of AI. This article starts from trust gaps, feedback protocols, evaluation standards and closed-loop capabilities to establish the core framework of "Humans as Coding Mentors".
  2. Part 2 Previous in path Panorama of AI programming ability evaluation: from HumanEval to SWE-bench, the evolution and selection of benchmarks Public benchmarks are not a decoration for model rankings, but a measurement tool for understanding the boundaries of AI programming capabilities. This article starts from benchmarks such as HumanEval, APPS, CodeContests, SWE-bench, LiveCodeBench and Aider, and explains how to read the rankings, how to choose benchmarks, and how to convert public evaluations into the team's own Coding Mentor evaluation system.
  3. Part 3 Previous in path How to design high-quality programming questions: from question surface to evaluation contract High-quality programming questions are not longer prompts, but assessment contracts that can stably expose the boundaries of abilities. This article starts from Bloom level, difficulty calibration, task contract, test design and question bank management to explain how to build a reproducible question system for AI Coding Mentor.
  4. Part 4 Previous in path Four-step approach to AI capability assessment: from one test to continuous system evaluation Serving as a coding mentor for AI is not about doing a model evaluation, but establishing an evaluation operation system that can continuously expose the boundaries of capabilities, record failure evidence, drive special improvements, and support collaborative decision-making.
  5. Part 5 Previous in path Best Practices for Collaborating with AI: Task Agreement, Dialogue Control and Feedback Closed Loop The core skill of being a Coding Mentor for AI is not to write longer prompt words, but to design task protocols, control the rhythm of conversations, identify error patterns, and precipitate the collaboration process into verifiable and reusable feedback signals.
  6. Part 6 Previous in path Practical cases: feedback protocol, evaluation closed loop, code review and programming education data Case studies should not stop at “how to use AI tools better”. This article uses four engineering scenarios: model selection evaluation, feedback protocol design, code review signal precipitation, and programming education data closed loop to explain how humans can transform the AI ​​collaboration process into evaluable, trainable, and reusable mentor signals.
  7. Part 7 Previous in path From delivery to training: How to turn AI programming collaboration into a Coding Mentor data closed loop The real organizational value of AI programming assistants is not just to increase delivery speed, but to precipitate trainable, evaluable, and reusable mentor signals in every requirement disassembly, code generation, review and revision, test verification, and online review. This article reconstructs the closed-loop framework of AI training, AI-assisted product engineering delivery, high-quality SFT data precipitation, and model evaluation.
  8. Part 8 Current From engineering practice to training data: a systematic method for automatically generating SFT data in AI engineering Following the data closed loop in Part 7, this article focuses on how to process the screened engineering assets into high-quality SFT samples and connect them to a manageable, evaluable, and iterable training pipeline.
  9. Part 9 Future Outlook: Evolutionary Trends and Long-term Thinking of AI Programming Assessment As the final article in the series, this article reconstructs the future route of AI Coding Mentor from the perspective of engineering decision-making: how evaluation objects evolve, how organizational capabilities are layered, and how governance boundaries are advanced.

Reading path

Continue along this topic path

Follow the recommended order for AI programming assessment instead of jumping through random articles in the same topic.

View full topic path →

Next step

Go deeper into this topic

If this article is useful, continue from the topic page or subscribe to follow later updates.

Return to topic Subscribe via RSS

RSS Subscribe

Subscribe to updates

Follow new articles in an RSS reader without checking the site manually.

Recommended readers include Follow , Feedly or Inoreader and other RSS readers.

Comments and discussion

Sign in with GitHub to join the discussion. Comments are synced to GitHub Discussions

Loading comments...