Hualin Luan Cloud Native · Quant Trading · AI Engineering
Back to articles

Article

Quantitative trading system development record (7): AI engineering implementation - from speckit to BMAD

Taking the trading calendar and daily aggregation requirements as a single case, explain how AI engineering can enter the delivery of real quantitative systems through specification drive, BMAD role handover and manual quality gate control.

Meta

Published

3/28/2026

Category

guide

Reading Time

45 min read

By the time readers read this article, they have already experienced system boundaries, real defects, testing lines of defense, performance management, and architecture evolution. AI engineering should not replace these disciplines, but organize them into a trackable, auditable, and signable delivery closed loop.

In the quantitative trading system, the greatest value of AI is not to “write a piece of code faster”, but to form a stable link between specifications, testing, implementation, review, documentation and acceptance evidence. Trading calendar, cross-night attribution, daily line aggregation, half-day market processing, layered normalization, indicator incremental calculation and chart performance optimization are not single-point function problems. As long as the business semantics are not clearly written, the faster the code generated by AI, the faster subsequent rework will enter the core of the system.

Series reading order

The recommended reading path is Part1 -> Part2 -> Part3 -> Part4 -> Part5 -> Part6 -> Part7. Finally, let’s look at AI engineering to allow readers to first understand the system, defects, testing, performance, and reconstruction, and then determine which constrained links AI should enter.

This article focuses on four issues:

  • How specification drives connect business semantics, interfaces, testing and acceptance evidence.
  • How BMAD multi-agent collaboration can reduce the blind area of ​​a single AI output.
  • How to upgrade the prompt word project from “write in more detail” to “boundaries, inputs, outputs, and acceptance can all be checked.”
  • How can human-machine division of labor avoid leaving business semantics sign-off, architectural trade-offs, and risk acceptance to black boxes?

Readers can use a simple judgment to evaluate whether AI engineering is really effective: after an AI-assisted delivery, whether the source of requirements, clarification of decisions, implementation plans, test evidence, performance evidence, and maintenance instructions can be found in the code base. If you can only find a piece of code and a few conversation records, this is still a temporary collaboration; if every key judgment can be returned to the artifact chain, AI will truly enter the engineering system.

Introduction: The faster AI writes, why the maintenance may be slower

Trading systems are prone to a counter-intuitive phenomenon: AI greatly increases the speed of code generation, but if there is a lack of specifications, testing and quality gates, system maintenance costs will rise simultaneously. A “K-line aggregation function” can be generated quickly, but three months later it may become a performance bottleneck, test blind spot, and business semantic debt.

Common signs of loss of control include:

  • The function works, but there is no documentation, and the next maintainer cannot confirm the transaction day attribution rules.
  • Without testing, we can only rely on manual recall across nights, half-day markets, holidays, and data gaps.
  • The code only covers the happy path. Friday night trading, holiday eve night trading, and temporary market closures require continuous patching.
  • prompt only requires “function implementation” and does not require type annotations, complexity, boundary conditions and acceptance evidence.
  • The AI-generated plan appears complete, but does not explain which decisions require human sign-off.

This is not a problem with the AI ​​itself, but a lack of engineering boundaries in the delivery chain. AI can be responsible for candidate implementations, test drafts, first drafts of documents, and structural checks, but humans must be responsible for the business semantics, risk acceptance, go-live windows, rollback conditions, and quality signoff in the trading system. What readers really need to pay attention to is whether the AI ​​output has been governed by specifications, tests, reviews, and evidence before it enters the code base.

Open source background: Spec Kit and BMAD provide two types of complementary capabilities

If readers want to migrate this set of methods to their own engineering systems, they first need to distinguish the positioning of the two open source projects.

Spec Kit is GitHub’s open source Spec-Driven Development toolkit. It emphasizes moving the starting point of software development from “writing code directly” to “executable specifications”, allowing the team to focus on product scenarios and predictable results before entering implementation. For quantitative trading systems, the inspiration of Spec Kit is that issues such as trading day attribution, half-day market, consistency between backtesting and real trading, and performance budgets should first enter the specification, clarification, planning, and acceptance links, instead of relying on prompts to supplement on the spot during the implementation stage.

BMAD-METHOD is an open source Agile AI-driven Development method and module ecosystem. Its core value is not to let AI automatically replace the team, but to turn requirement clarification, architecture, implementation, review, testing and documentation into a collaborative process through role-based Agents and structured workflows. For trading systems, the inspiration of BMAD is that the roles of Architect, Developer, QA, Reviewer, and Documenter can be participated by AI, but each role must deliver traceable artifacts and cannot just leave the conclusion of the conversation.

These two projects solve two problems respectively: Spec Kit makes “specifications” the entrance to implementation, and BMAD makes “multi-role collaboration” a means of quality control. When it comes to actual production-level delivery, specifications alone are not enough, because the implementation process will get out of control; multi-Agent is not enough, because there may be a lack of common specifications between roles. The BMAD-Speckit-SDD-Flow discussed later further combines these two types of capabilities into a governed AI delivery flow.

Part 1: speckit—Specification-driven AI development

speckit can be understood as a set of specification-driven development processes. It is not a single tool, but organizes requirements, clarification, planning, implementation and acceptance into a stable artifact chain:

  • Specify: Clarify business requirements and write them into quotable specifications.
  • Clarify: Eliminate ambiguity, record technical decisions and pending issues.
  • Plan: Break down the implementation steps, dependencies, risks and verification methods.
  • Implement: Generate or write code within specification constraints.
  • Checklist: Complete acceptance against specifications, tests and evidence.

The biggest difference between this process and traditional waterfall is that each step can be accelerated using AI, but the decision-making power still lies with the human. AI does not directly determine the definition of trading days, does not directly accept half-day market risks, and does not directly sign off on indicator results; AI is responsible for structuring candidate content, and humans are responsible for confirming semantics and boundaries.

For quantitative trading systems, specification-driven has an additional value: it separates “business correctness” and “code correctness”. Code can pass type checking and unit testing, but the business semantics can still be wrong. For example, if a function returns a date object, it does not mean that it returns the correct trading day; if daily aggregation can calculate OHLC, it does not mean that it does not include Friday night into Friday. It is this semantic gap that speckit aims to solve.

speckit spec-driven artifact dependency graph
Figure 1: Speckit artifact flow diagram, from business semantics to implementation tasks must be traceable, and cannot jump directly from prompt to code.

This picture answers a key question: why AI engineering cannot jump directly from prompt to code. The requirements of the trading system usually first fall on the business semantics, such as “Friday night trading belongs to the next Monday trading day”. This semantics needs to enter Specify, and then through Clarify to become trading day definition, half-day market rules, holiday sources and exception handling strategies. Only after these artifacts are stable, Plan can split tasks, Implement can generate code, and Checklist can verify the results.

If a step is missing, the consequences are often concrete. Without Specify, AI will treat natural days as trading days; without Clarify, half-day markets and temporary market breaks will be hard-coded; without Plan, the implementation sequence may be to first build the UI and then add data semantics; without Checklist, even if the test passes, it cannot prove that the real market boundary is covered. Specs are driven not to slow things down, but to avoid speed turning directly into rework.

Practical Case 1: Trading Calendar and Daily Line Aggregation

Trading calendar and daily aggregation is a typical sample of AI engineering because it looks like an ordinary aggregation function and actually involves business semantics, time zones, exchange rules, data integrity, and UI interpretation.

Business complexity includes at least four categories:

  • Overnight trading: Friday night trading from 17:15 to 03:00 the next day belongs to the trading day of the next Monday, not Friday, the natural day.
  • The market is closed during holidays: The market is completely closed during public holidays such as Christmas, New Year’s Day, and Spring Festival, and misleading K lines cannot be generated.
  • Half-day market closure: Only morning trading is open on New Year’s Eve, Christmas Eve, etc., usually from 09:15 to 12:30, and K lines are no longer generated in the afternoon.
  • Forced closing: The half-day market is forced to close at 12:30, and the night market on the eve of financial holidays is forced to close at 03:00.
Quantitative requirements to implementation mapping diagram
Figure 2: Requirements to implementation sequence diagram, each business constraint must fall into interfaces, fixtures, tests and acceptance evidence.

The point of Figure 2 is not that “the process is more complex”, but that every business constraint must have a landing point. The ownership of Friday night trading is not a description. It must correspond to the semantics of get_trading_day(timestamp), the fixture covering Friday 23:30 and Monday 00:30, the test to verify the daily OHLC, and the final acceptance evidence that can be reviewed. Otherwise, even if the aggregation logic generated by AI passes ordinary samples, it may contaminate indicators in the real disk.

Specify: Write business requirements into acceptable specifications

The specifications for trading calendar and daily aggregation can be broken down into four user stories.

serial numberprioritybusiness goalsCore acceptance scenario
US-001P0Determine the trading day based on the Hong Kong futures trading calendar2024-01-08 returns to trading day, 2024-01-06 returns to non-trading day, 2024-01-01 returns to non-trading day for public holidays
US-002P0Friday night trading data belongs to the following MondayThe 1-minute K-line at 23:30 on Friday belongs to 2024-01-08, not 2024-01-05; Monday 10:00 belongs to Monday
US-003P1Correct handling of half-day market holidaysOn New Year’s Eve, only 09:15-12:30 will be returned; K lines will no longer be generated after 12:30; K lines will no longer be generated after 03:00 on holiday eve.
US-004P0Correctly aggregate daily OHLCThe opening price is taken from the first minute line, the highest price is taken from the maximum price of the minute line, the lowest price is taken from the minimum price from the minute line, the closing price is taken from the last minute line, and the trading volume is summed.

Functional requirements need to be further refined:

  • FR-001: Support Hong Kong Futures Exchange trading calendar, covering 2020-2030.
  • FR-002: Distinguish between full trading days and half-day trading days.
  • FR-003: Night trading from 17:15 to 03:00 the next day belongs to the next trading day.
  • FR-004: Daily market 09:15-12:00, 13:00-16:30 on the vesting day; half-day market 09:15-12:30.
  • FR-005: Cross-midnight data 00:00-03:00 belongs to the same trading day.
  • FR-006: No K-line is generated in the afternoon session of the half-day market.
  • FR-007: The night session on holiday eve is forced to close at 03:00.

Non-functional requirements must also be written into specifications rather than left as verbal expectations:

  • SC-001: Calendar query latency less than 1ms.
  • SC-002: 1000 minute lines aggregate into daily lines less than 50ms.
  • SC-003: Memory usage is less than 100MB when caching 10-year calendar data.

Technical constraints cannot be omitted either:

  • Based on Python 3.10+.
  • You can use pandas-market-calendars as the base calendaring capability.
  • Need to support HKFE customization.
  • It needs to be compatible with the existing BarData structure of the system.
  • Time zone conversions and daylight saving time boundaries must be handled.

The acceptance criteria ultimately come down to the test list:

  • Normal trading days are correctly identified.
  • Weekends are recognized as non-trading days.
  • Public holidays are recognized correctly.
  • Friday night trading belongs to the following Monday.
  • During the half-day market, trading is normal in the morning and closed in the afternoon.
  • Data across midnight 00:00-03:00 is attributed correctly.
  • Calendar query satisfies SC-001.
  • Daily aggregation satisfies SC-002.

Clarify: Turning ambiguity into a record of technical decisions

After the specifications are written, they cannot be implemented immediately. Ambiguities in trading systems often lie in areas where everyone thinks they understand the same thing, especially in the definition of trading days, the handling of half-day markets, and the origin of holidays.

The first decision is trading day definition. The system adopts the official trading day definition of the Hong Kong Stock Exchange: the trading day is based on the daily trading date. The night trading of the previous night and the daily trading of the next day belong to the same trading day, and the night trading of Friday belongs to the following Monday. This decision affects daily aggregations, indicator windows, backtest slices, and UI annotations.

physical timeThe trading dayillustrate
2024-01-05 09:152024-01-05Daily trading starts on Friday
2024-01-05 17:152024-01-08Monday night trading starts next week
2024-01-05 23:302024-01-08Friday night trading belongs to the following Monday
2024-01-08 00:302024-01-08Crossing midnight still belongs to the same trading day
2024-01-08 03:002024-01-08End of night session
2024-01-08 09:152024-01-08Start of daily trading

Key implementation logic can be kept as minimally schematic as possible, rather than letting long blocks of code overwhelm business semantics:

def get_trading_day(timestamp: datetime) -> date:
    """Code:barsreturntrading day."""
    hk_time = timestamp.astimezone(HONG_KONG_TZ)
    current_date = hk_time.date()
    current_time = hk_time.time()

    if current_time >= time(17, 15) or current_time < time(3, 0):
        return get_next_trading_day(current_date)

    return current_date

The second decision is to close the market for a half day. The half-day market date varies every year, and may be temporarily changed due to epidemics, typhoons, or special arrangements by the exchange, so the rules should not be hard-coded in the aggregation function. A more reliable way is to use a configuration table to maintain the half-day market date, reason, morning closing time and night trading mandatory closing time. For example, 2024-02-09 New Year’s Eve, 2024-12-24 Christmas Eve, and 2024-12-31 New Year’s Eve all need to be recorded independently. The benefit of configuration is that when the rules change, the calendar and test can be updated first, and then the aggregation logic verification is triggered.

The third decision is the holiday calendar source. The HKFE calendar can be customized based on pandas-market-calendars, and then supplemented with Hong Kong’s unique holidays, half-day market markers, typhoons and other special market closure APIs. The key to doing this is not to rely on a certain library, but to divide “official calendar + local correction + manual special events” into three layers to avoid scattering exchange rules into aggregators, data importers and UI prompts.

The Clarify stage also records two issues that are easily overlooked:

  • How to handle unanticipated market closures: Provide a manual entry such as add_holiday(date, reason). The generated K-line is not deleted directly, but marked as “incomplete trading day”, and the UI displays a warning.
  • How to deal with data gaps: Missing trading days will not generate misleading daily lines, the UI will display a “missing data” placeholder, and the data entry entry will be retained.

Plan: Break the specification into deliverable stages

Planning is not a stack of task lists, but an implementation of dependencies, risks, and verification methods.

stageobject file or moduleMain contentVerification key points
Phase 1hkfe_calendar.py, calendar_config.jsonis_trading_day, is_half_day, get_trading_session, get_trading_dayFull trading days, half-day trading, weekends, public holidays
Phase 2daily_bar_aggregator.pyaggregate, handle_night_session, handle_half_day, validate_completenessMulti-scenario daily aggregation, incomplete trading day marking
Phase 3vnpy_datamanager, vnpy_datarecorderAutomatic identification of trading days during data import and recording phasesNight disk data ownership and data layer compatibility
Phase 4trading_day_indicator.py, chart UIHalf-day market, holidays, incomplete data prompts and data supplementary entry interfaceUser visible status and data re-recording path

The dependencies of this plan are Phase 1 -> Phase 2 -> Phase 3 -> Phase 4. Before the calendar semantics are stable, the UI should not be written first; before the aggregator has been tested, it should not be connected to the data layer; before the data layer has evidence of ownership, it should not enter the shared path of backtesting and real disk.

The planning stage also needs to clarify the exit conditions for each stage. The exit condition of Phase 1 is not “the calendar class is written”, but that all key date samples have passed; the exit condition of Phase 2 is not “the aggregation function can run”, but that the complete trading day, Friday night market, cross-midnight, half-day market and missing data all have assertions; the exit condition of Phase 3 is not “the data layer is connected”, but the import, record and query return the same trading day semantics; Phase 4 The exit condition is not “the interface displays a prompt”, but that the user can distinguish between complete, half-day, holiday and incomplete trading days.

Implement: generate code under constraints

After the specifications and decisions are stable, the input to the AI ​​needs to be upgraded from “help writing an aggregate function” to executable constraints:

Implement DailyBarAggregator for trading-day ownership and daily aggregation.

must:
1. Use type annotations for all parameters and return values.
2. Accept minute bars as input and return trading-day-owned daily bars.
3. Map night-session data to the correct trading day.
4. Generate half-day bars for 12:30 market-close sessions.
5. Mark incomplete trading days and never fabricate OHLC values.
6. Every public method must have unit tests.
7. Aggregating 1,000 minute bars must complete within 50 ms.

Tests must cover:
normal trading day, night session, cross-midnight ownership, half-day session,
holiday, missing data, and incomplete trading day.

Output must include:
implementation code, pytest tests, complexity analysis, and boundary-case notes.

Such prompts are not only longer, but also bring the AI ​​output into checkable boundaries. The key business logic is already defined in the specification and Clarify, and there is no room for AI to freely interpret “which day the night session belongs to”.

There is also a common misunderstanding that needs to be noted here: just because constraints are written in the prompt, it does not mean that the constraints have been satisfied. Constraints must be supported by corresponding evidence. For example, “Each public method must have a unit test” corresponds to test files and coverage reports; “1000 minute lines aggregation is less than 50ms” corresponds to benchmark output; “Incomplete trading days do not generate errors OHLC” corresponds to exception samples and assertions; “Friday night trading belongs to next Monday” corresponds to fixed date fixtures. Constraints without evidence are just wishes.

Extended specification: layered data normalization

After the trading calendar, the system also encounters layered data normalization issues. If the main period, auxiliary period, indicator period, and UI display period keep separate boundary references, strategies and charts can easily drift.

A feasible decision is: use the main period, which is the chart display period, as the benchmark, and align other period data according to rules. The plans are compared as follows:

plancomplexityAccuracyApplicable judgment
Main cycle base + interpolationmiddlehighSuitable as the main solution to facilitate the sharing of time semantics between UI and policies
independent timelineLowmiddleThe implementation is simple, but it is difficult for readers to judge whether cross-cycle references are consistent.
Resample allhighhighHigher accuracy, but higher computational cost and bounding complexity

When data is missing, it cannot be automatically filled in, because filling in will create non-existent quotes. A more prudent strategy is to show the empty slots, mark “data not available” on the UI, and make it clear to the strategy layer that data is missing for that period. This will increase the user’s cognitive cost, but it can prevent missing data from being silently regarded as real signals.

The layered plan can be broken down into four phases:

  • Phase 1: Implement trading_calendar.py and bar_aggregator.py to provide trading calendar, trading period, daily aggregation and arbitrary period aggregation.
  • Phase 2: Implement multi_timeframe_data.py and timeframe.py, encapsulate fetch_bars, align_timeframes, period definition and conversion.
  • Phase 3: Make chart_window.py, chart_widget.py and period_selector.py support multi-SubChart and period switching.
  • Phase 4: Complete the integration of the data layer and UI layer, add caching, E2E scenarios and performance verification.

The input to the AI ​​implementation must also be constrained:

Implement BarAggregator.

must:
1. Use type annotations and docstrings.
2. Accept minute-bar input and return period-aligned aggregated bars.
3. Every public method must have unit tests.
4. Aggregating 10,000 minute bars must complete within 50 ms.
5. Trading-session rules must be explicit and reviewable.

Output must include:
implementation code, pytest tests, and complexity analysis.

This example shows that the role of speckit is not to generate more documents, but to fix the link of “Requirements -> Decision -> Plan -> Implementation -> Evidence”.

When readers implement speckit in their own systems, they can start with a high-risk rule instead of covering all modules at once. Rules suitable for entry usually have three characteristics: business semantics are easily misunderstood, errors can contaminate indicators or orders, and it is difficult to restore by logs alone during manual review. For example, overnight vesting, half-day market, main contract switching, fee model, slippage model, risk control freeze and matching delay are all suitable to be standardized first.

Practical Case 2: Chart Performance Optimization

Another typical case is that it takes 3 seconds for the chart to load 10,000 K lines, and the goal is to drop it to less than 100ms. If you directly ask AI to “optimize performance”, it is easy to get mixed suggestions such as caching, concurrency, NumPy, asynchronous loading, etc., but it is impossible to judge which suggestion corresponds to the real bottleneck.

The Clarify stage first records three decisions:

  • D1 data caching strategy: using memory LRU, disk cache, database third-level cache, and controlling window data by SlidingWindowBarManager.
  • D2 rendering optimization: only the visible range is calculated through VisibleAreaTracker, and the virtual list only renders the K-line within the viewport.
  • D3 picture cache: The static K-line cache is a bitmap, and the unchanged graphics object is reused by LRUPictureCache.

There are two more trade-offs that need to be documented:

trade offCandidatesReason for selection
Memory vs SpeedFull caching, on-demand loading, sliding window + bitmap cachingThe full cache usage is too high, and on-demand loading interaction is slow. Sliding window + bitmap cache can strike a balance between memory and response.
Real-time vs fluencyFully redraw when the real-time market is updated, or only the last K line will be redrawn.Only redraw the last K line to avoid triggering full drawing every tick

The key benefit of this type of performance specification is that it turns an “optimization direction” into a “testable hypothesis.” Readers can clearly see: caching solves the problem of repeated access to data, virtualization solves the problem of visible area drawing, bitmap caching solves the problem of repeated drawing of static graphics, and real-time update strategy solves the refresh cost of the last K-line.

Performance optimization should also avoid “AI solution stacking”. If prompt requires caching, parallelism, NumPy, async, and rendering optimization at the same time, it’s easy for AI to come up with a seemingly comprehensive but unverifiable combination. A more reliable way is to first split the 3-second load into data reading, data conversion, indicator calculation, graphic object construction and drawing refresh, and then establish hypotheses piece by piece. Each hypothesis should have entry conditions and exit conditions: the entry condition is that the profile or benchmark proves the existence of the bottleneck, and the exit condition is that the data before and after optimization, correctness assertions and fallback plans are all visible.

The benefits brought by speckit can be observed from four perspectives:

indexBefore useAfter use
Demand rework ratehighsignificantly reduced
Architecture dispute timehoursDocumentation of decisions
Newcomers understand the needs1-2 daysRead the specs for about 30 minutes
AI code qualityOn the low sideSignificant improvement

Part 2: BMAD—Multi-Agent Collaborative Development

Speckit solves the problem of tracking requirements to code, and BMAD solves the problem of code quality and multi-perspective review. It’s easy to generate a single AI all the way down to the first working solution, lacking cross-checking between architecture, implementation, review, testing, and documentation.

BMAD can be broken down into five roles:

RoleenteroutputMain risk control points
ArchitectRequirements specificationArchitecture plan, interface definition, core algorithm pseudocodeAre the boundaries clear and the complexity reasonable?
ImplementerArchitecture planRunnable codeComply with interface, type, exception and performance constraints
ReviewerImplement codeReview report, issue listAre there boundary errors, floating point errors, and implicit states?
Testercode and requirementsTest code, test reportWhether to cover normal, boundary, abnormal and performance scenarios
Documenterfinal codeAPI documentation, usage examples, notesCan the next maintainer independently understand

Readers need to be wary of a “formal BMAD”: multiple roles are invoked, but each role simply retells the output of the previous round. A truly effective BMAD must create differential pressure. Architect should question boundaries, Implementer should expose implementation limitations, Reviewer should find bugs, Tester should turn bugs into repeatable samples, and Documenter should allow maintainers to take over independently. The lack of conflict and correction between roles often indicates that the process is not really working.

BMAD multi-role workpiece handover diagram
Figure 3: BMAD artifact handoff swim lane diagram, where specifications, tests, and evidence are passed between roles rather than verbal judgments.

The focus of this picture is “artifact handover”, not “multiple Agents look more lively”. Architect delivers boundaries and interfaces, Implementer delivers constrained code, Reviewer delivers issues and fix suggestions, Tester delivers reproducible tests, and Documenter delivers maintenance instructions. If any role only outputs natural language conclusions without artifacts, subsequent tracking will be disconnected.

Complete case: indicator calculation reconstruction

The indicator calculation module initially used Pandas DataFrame for batch calculation, and performance became a bottleneck. The goal is to reduce the time it takes to calculate RSI for 10,000 K lines from about 150ms to less than 5ms, while allowing backtesting and real-time trading to be implemented using the same set of indicators.

The Architect phase focuses on boundaries and complexity. Input requirements describe current use of Pandas batch computation, need to support incremental updates, reduce memory usage by 50%, compute latency less than 10ms, and require output core abstractions, data flows, incremental MA/EMA/RSI pseudocode, file organization recommendations, and risk assessments.

The core abstraction can be kept in a short code block:

class IncrementalIndicator(ABC):
    """indicatorbase class."""

    @abstractmethod
    def update(self, bar: BarData) -> Optional[IndicatorValue]:
        pass

    @abstractmethod
    def reset(self) -> None:
        pass

    @property
    @abstractmethod
    def is_ready(self) -> bool:
        pass

class IncrementalMA(IncrementalIndicator):
    def __init__(self, period: int):
        self.period = period
        self.values = deque(maxlen=period)
        self.total = 0.0

    def update(self, bar: BarData) -> Optional[float]:
        if len(self.values) == self.period:
            self.total -= self.values[0]
        self.values.append(bar.close)
        self.total += bar.close
        return None if len(self.values) < self.period else self.total / self.period

This abstraction conveys two important principles: indicator instances save necessary states, and update only processes new Bars; after the state boundaries are clear, backtesting and real-time market conditions can share the same calculation path.

In this case, the value of Architect is not to write the name IncrementalIndicator, but to expose the state boundary of indicator calculation. The advantage of Pandas batch calculation is concise expression, but the disadvantage is that it is easy to repeatedly scan the history window in the real-time path. The advantage of incremental indicators is that the cost of a single update is stable, but the disadvantage is that state initialization, playback recovery, reset and numerical stability must be explicitly managed. The Architect must write out these benefits and costs at the same time, otherwise the Implementer will only get a vague task of “change Pandas to a handwritten loop”.

The Implementer stage implements IncrementalRSI, and the constraints include:

  • Strictly follow the IncrementalIndicator interface.
  • Supports Wilder smoothing method, which is SMMA(today) = (SMMA(yesterday) * (period - 1) + value) / period.
  • Memory complexity is O(1) and only necessary state is retained.
  • Contains type annotations, docstrings and full pytest.
  • Returns None when the previous period root Bar has insufficient data.
  • RSI returns to 50 when price is unchanged.
  • Numerical stability protection is required during full rises, full falls and continuous slight rises.

There are three types of typical problems found in the Reviewer stage:

  • Initial stage calculation error: the first period root Bar should use simple averaging instead of direct Wilder smoothing.
  • Lack of division-by-zero protection: 0/0 will appear when the average rise and fall are both 0, and 50 should be returned if the price remains unchanged.
  • Insufficient numerical stability: During continuous slight increases or decreases, the SMMA may be affected by floating point errors, requiring epsilon or clear branches.

The Tester stage needs to turn these issues into tests, rather than just reminding them in the review report. The test list includes at least:

  • The first Bar returns None.
  • The period - 1 root Bar still returns None.
  • The periodth root Bar returns valid values ​​between 0 and 100.
  • The RSI is equal to 50 when prices are continuously unchanged.
  • The RSI is greater than 50 on a rising streak.
  • The RSI is less than 50 on a continuous decline.
  • Wilder smoothing results are within reasonable limits and compare well with the reference implementation.

The Documenter stage needs to complete the usage, algorithm description, precautions and performance characteristics:

  • Usage: Create IncrementalRSI(period=14), call update for each Bar, and return None to indicate insufficient data.
  • Algorithm description: Simple averaging is used in the initial stage, followed by Wilder smoothing.
  • Note: The first period - 1 root Bar does not produce indicators, the price remains unchanged and returns 50. The memory usage has nothing to do with the amount of historical data.
  • Performance characteristics: single update O(1), fixed memory, calculation of 10,000 K-lines takes about 3ms, which is significantly improved compared to Pandas batch calculation of about 150ms.

The closed loop formed by these five rounds of collaboration is more important than any single round of prompts. Architect prevents interface boundaries from drifting, Implementer implements boundaries into code, Reviewer finds algorithm and numerical vulnerabilities, Tester solidifies vulnerabilities into regression protection, and Documenter passes usage constraints to subsequent maintainers. Without Reviewer, the initial RSI calculation error may go directly to the trunk; without Tester, the rule of returning 50 when the price remains unchanged will remain verbal; without Documenter, the next maintainer may not know why the initial stage cannot be directly Wilder smoothed.

BMAD Quality Access Control

BMAD only makes sense in conjunction with quality gate control. Otherwise, multiple agents will just take turns generating text and cannot change the quality of the system.

AI engineered artificial quality access control map
Figure 4: Human-machine quality access control swim lane diagram, AI is responsible for candidate output, and humans are responsible for semantic sign-off, risk acceptance, and evidence closure.

Quality gate control can be set according to five levels:

  • Architect solutions must pass interface completeness and complexity analysis.
  • Implementer code must pass unit testing, type checking, and interface constraints.
  • Reviewer’s review must have no serious problems and the code quality must reach the agreed threshold.
  • Tester tests must cover edge cases and performance constraints.
  • Documenter documentation must contain API descriptions, usage examples, and notes.

The benefit of BMAD is not “delivery time is necessarily shorter.” A more realistic result is: the average delivery time may increase from 1 day to 1.5 days, but the bug escape rate is reduced, boundary coverage is improved, document completeness is improved, and maintenance costs are reduced. For trading systems, this trade-off is usually worth it, as real-time mishaps and wrong indicators are much more costly than an extra half-day delivery time.

Quality gating should also preserve deny paths. If the Architect plan does not explain whether the backtest and the real disk share the interface, it cannot enter the implementation; if the Implementer code does not have type and exception boundaries, it cannot enter the Review; when the Reviewer finds serious problems, the Tester cannot use more tests to “prove that it can run”; when the Tester does not cover the boundary samples, the Documententer should not be written with stable capabilities. The value of gatekeeping is to expose failures early, not to make the process look smooth.

Part 3: Best Practices for AI Prompt Word Engineering

Instead of writing requirements longer, Prompt Word Engineering turns roles, constraints, inputs, outputs, examples, rhetorical questions, and context into checkable structures.

A practical criterion is whether each sentence in the prompt affects the output. If a sentence doesn’t affect the interface, testing, performance, error handling, documentation, or acceptance, it’s probably just a tone modifier. Prompts in the trading system should reduce unacceptable statements such as “try to be of high quality” and “write elegantly” and add verifiable constraints such as “what is returned when the input is empty”, “how to judge overnight ownership”, “what is the performance target” and “how to downgrade in case of failure”.

Principle One: Role Activation

Roles should be specific to area, experience, and review style. For example, “Quantitative trading system architect, with 10 years of experience in high-performance computing, specializing in low-latency system design, Python performance optimization and financial data structures, with a style focusing on edge cases, concise implementation and testability”. This role setting makes the AI ​​more inclined to discuss complexity, status, and testing than “help optimize a metric.”

Principle 2: Constraint List

Constraints must distinguish between hard constraints and recommendations. Hard constraints include: memory usage O(1), single update O(1), must contain type annotations, and each public method must have unit tests. Suggestions may include: use deque to store sliding windows, consider floating point errors, and add benchmarks. Hard constraints are used for acceptance and recommendations are used to guide implementation.

Principle 3: Input and output formats

The input and output formats should be clear to the data structure. For example, the input is List[BarData], including open/high/low/close/volume; period is an integer period. Output includes implementation code, complexity analysis, and pytest tests. After the format is clear, it is easier for AI to stably produce accessible code instead of writing explanatory text that cannot be implemented.

Principle 4: Ask and confirm

Complex tasks should start with the AI ​​asking 3 to 5 clarifying questions. The issues that most need clarification about the K-line aggregation function include:

  • Input K-line is a dictionary, class, or DataFrame.
  • The target period only supports minutes to hours, or any period.
  • Skip, fill or report an error when data is missing.
  • How to attribute overnight trading? Friday night calculation is Friday or next Monday.
  • Is the performance goal real-time streaming or batch processing.

These problems may seem basic, but they are often the source of accidents.

Principle 5: Iterative refinement

Ambiguous feedback causes the AI ​​to rewrite randomly. Better feedback would be to point out the specific location, specific problem, and fix requirements. For example, “Division lacks zero protection. When the denominator is 0, the price remains unchanged and returns 50.” “The current time complexity is O(n) and needs to be changed to incremental state O(1).” “The processing of period > len(data) is missing.” Structured feedback can turn iteration into a convergent process.

Principle 6: Example-driven

Providing normal, borderline, and abnormal examples can significantly reduce misunderstandings. The EMA example can be written as: input [100, 101, 102, 101, 100], period=3, alpha=2/(3+1)=0.5, and the output is [100, 100.5, 101.25, 101.125, 100.5625]. This example simultaneously expresses the initialization method, the recursive formula, and the expected value.

Principle Seven: Context Management

Large projects need to provide enough context, but they cannot just throw all the code into the AI ​​at once. More valid contexts include:

  • Technology stack: Python 3.10, PyQt6, SQLite or other actual stack.
  • Existing files: core/data.py defines BarData, core/indicators.py defines the indicator base class.
  • Architecture constraint: All indicators must inherit IncrementalIndicator.
  • Current task: Implement IncrementalEMA or IncrementalRSI.

A complete prompt can be organized like this:

Role: senior Python performance engineer for a trading system.

Task: implementation EMA.

Background:
- The project already has an IncrementalIndicator base class.
- The runtime scenario updates indicators one bar at a time.
- EMA formula: EMA(today) = alpha * price + (1 - alpha) * EMA(yesterday).

constraint:
1. Implement the IncrementalIndicator interface.
2. Return the first bar's price as the initial EMA value.
3. Keep memory usage O(1); do not retain full history.
4. Include type annotations, docstrings, and unit tests.

example:
input [100, 101, 102, 101, 100], period=3.
output [100, 100.5, 101.25, 101.125, 100.5625].

Provide 2-3 implementation notes about initialization and numerical behavior.

Part 4: Boundaries and Strategies of AI Assistance

AI can enter many engineering links, but the signing rights of different links cannot be mixed together.

Tasks that AI is good at include: boilerplate code, data classes, configuration parsing, logging, renaming, extracting functions, formatting, unit testing of a given function, complex code explanations, docstrings, and README drafts. These tasks have clear boundaries and the output is easy to verify.

Tasks that AI can undertake as candidate implementations but require manual review include: specific functions with clear specifications, simple bug fixes with clear reproduction steps, and performance optimization with clear target data. This type of task requires testing and review before entering mainline.

Tasks that AI is not suitable for direct leadership include: core architecture design, requirements analysis, complex debugging and technology selection. The reason is not that AI cannot give advice, but that it lacks a true sense of business context, team constraints, historical debt, rollout risks, and long-term maintenance costs.

The division of labor between man and machine can fall into the following table:

linkdominant partyThe role of AIhuman responsibility
needs analysispeopleAsk clarifying questions and compile draft specificationsConfirm business semantics and priorities
Architecture designpeopleProvide a list of alternatives and risksChoose boundaries, accept costs
Specification writingPeople + AIDraft user stories, acceptance criteria, constraintsSigning and Acceptable Semantics
Code implementationAI can dominateGenerate candidate codesReview, test, merge
test writingAI can dominateExtend test casesDetermine whether an assertion represents a real risk
Document maintenanceAI can dominateGenerate first draft and change notesConfirm that documentation does not mislead maintainers

The bottom line of this division of labor is this: AI can generate candidates, but it cannot independently sign off on semantic risks.

Boundary strategies also need to fall into daily Code Review. The review checklist can require: whether the AI ​​generated code is marked with sources and constraints, whether there are corresponding specifications, whether there are boundary tests, whether there is performance evidence, whether there is business semantics for manual sign-off, and whether there is a rollback path. This is not to exclude AI, but to make AI output and human code adhere to the same set of engineering disciplines.

Part 5: Refactoring legacy code with AI

AI-assisted refactoring is most suitable to start from legacy modules with clear boundaries and controllable risks. A typical scenario is a 2000-line chart_widget.py that mixes data acquisition, chart rendering and user interaction at the same time. As functions increase, this file will become a restricted area for modification: changing an indicator display may affect data loading, and changing an interactive event may trigger a rendering regression.

The first step is to let the AI ​​analyze the situation, but the input must require structured output:

  • Violation of single responsibility position.
  • Circular dependencies.
  • The hard part to test.
  • Refactor priorities.
  • The location, description, recommended solution and priority for each problem.

If AI only outputs “recommended splitting modules”, the value is very low; if it can point out the coupling points between data access, rendering state, user interaction and event subscription, then there will be an executable entry later.

The second step is to let the AI ​​design the new architecture. Constraints must include:

  • No external interface changes, maintaining backward compatibility.
  • Each layer can be tested independently.
  • Use the Observer pattern to decouple data changes and UI updates.
  • Support new chart types in the future.
  • Output file organization suggestions, key class definitions, and data flow descriptions.

The third step is to let AI generate a migration plan. The plan needs to be migrated in stages, each stage can be run independently, each stage has corresponding tests, and the workload, dependencies and rollback strategy must be clearly written. AI may generate four stages, and the team can adjust to three stages based on actual risks; the key is not the number of stages, but that each stage can be independently verified.

The fourth step is to implement it step by step. Each stage should follow the same rhythm: AI generates candidate code based on specifications, humans review, run tests, fix problems, and then move on to the next stage.

The fifth step is verification. A complete test suite proves at least three things: the functional output of the old and new versions is consistent, there is no performance degradation, and the code coverage reaches the agreed threshold.

Refactoring legacy code in particular requires limiting the AI’s modification radius. If a refactoring changes the file structure, event model, rendering logic, data access, and UI interaction at the same time, it will be difficult for the Review to determine which change introduces risks. A more reliable way is to extract the interface first and keep the old path; then output the new path and the old path in parallel; then confirm the consistency with testing and user scenarios; and finally remove the old link. AI can generate migration steps, but whether each step can be rolled back and whether the old behavior can be retained still requires human judgment.

The reconstruction results can be expressed in data:

  • The code went from 2000 lines in a single file to about 1500 lines in 3 files.
  • Test coverage improved from about 30% to about 80%.
  • Modification response time dropped from “it takes 2 days to change one place” to “it takes 2 hours to change one place”.
  • More importantly, team members dared to maintain this module again.

This shows that the goal of AI-assisted refactoring is not to make the directory structure look more beautiful, but to reduce the number of contexts that the maintainer must remember at the same time. After the data layer, rendering layer and control layer are separated, testing, review and performance positioning have clearer entrances.

Part 6: BMAD-Speckit-SDD-Flow - Precipitating practices into governed delivery flows

What was discussed above is the method. When it is actually implemented, more detailed engineering problems will be encountered: the specifications are written, but how to judge the readiness of the implementation entrance; multiple agents are started, but who decides the next global path; the review is done, but whether the failure can really be blocked; the test is run, but whether the evidence can be returned to the dashboard, scoring and training data. These problems appear repeatedly in the AI ​​engineering delivery of quantitative trading systems, especially in tasks such as trading calendars, indicator reconstruction, chart performance, test completion, and legacy code splitting, which will constantly expose the risk of “the process can run, but the governance is not closed loop”.

BMAD-Speckit-SDD-Flow is an open source project precipitated in this type of practice. It is built based on BMAD-METHOD and Spec Kit, integrating requirements specification, audit process, operation monitoring and scoring feedback into a complete delivery link. It is not about inventing another prompt template, but solidifying specify -> plan -> audit -> readiness gate -> runtime governance -> close-out into an engineering surface that can be installed, inspected, and observable.

From the perspective of technical architecture, this project can be understood as a five-layer delivery architecture:

HierarchyfocusThe significance of quantitative trading systems
Product DefProduct definition and business semanticsThe trading day attribution, half-day market, and risk control boundaries are clearly defined first.
Epic PlanningEpic and cross-module planslayered data, unified interface for backtesting real trading, and phased dismantling of chart performance
Story DevStory life cycleEach story has specifications, tasks, testing and acceptance paths
Technical ImplementationTechnical implementationThe sub-agent can only execute bounded packets and cannot change the global route without authorization.
FinishClosure and evidencePassed, needs to be repaired, blocked, and re-run will all be entered into the execution record and dashboard.

The key to this architecture is not the name of the hierarchy, but the inclusion of both “before implementation” and “after implementation” into governance. Many AI engineering flows only focus on the code generation in the middle, and the real problems often lie at both ends: there is no ready baseline in the front, and there is no close-out evidence in the back.

From a module perspective, the project provides several types of core components:

  • _bmad/: Canonical source for workflow modules, hooks, prompts, routes, and host-side assets.
  • packages/scoring/: Scoring engine, readiness drift assessment, kanban projection, diagnostic input and training data extraction.
  • dashboard: The default runtime observable layer for viewing runtime status, snapshots and scoring projections.
  • runtime-mcp: Optional MCP tool interface, only enabled when runtime data needs to be exposed to the Agent tool surface.
  • speckit-workflow: Covers Specify, Plan, GAPS, Tasks, TDD, with mandatory audit loop.
  • bmad-story-assistant, bmad-bug-assistant, bmad-standalone-tasks: serve Story, Bug and independent task paths respectively, but the global route is still determined by the main Agent after reading inspect.

It focuses on managing five common AI coding failure modes.

The first category is requirement hallucination. AI can easily complete the business semantics when the requirements are incomplete. For example, the default natural day is the trading day, the default half-day market does not have night trading, and the default missing data can be filled in. In the trading system, this kind of “seemingly reasonable completion” will directly pollute the indicators and backtesting. BMAD-Speckit-SDD-Flow bundles requirements, clarifying questions, assumptions and tasks using Specify, Clarify, GAPS and Cross-Document Traceability. Semantics that are not documented by specifications and clarifications cannot be directly translated into implementations.

The second category is implementation drift. Even if the initial specifications are correct, the Agent may gradually deviate from the original constraints during the implementation process: the interface is changed in order to make the test pass, the performance budget is bypassed in order to simplify the code, and the backtest and actual disk are written into two sets of logic in order to keep up with the schedule. The project uses story packets, bounded tasks, main Agent inspect, and task-level traceability to control the implementation radius. The subagent only executes authorized packets and does not determine the global route; the implementation results must go back to specifications, tasks, tests and evidence chains, rather than just seeing whether the code can run.

The third category is pseudo-implementation without E2E integration evidence. This is the most insidious kind of failure: the code, unit tests, and documentation are all there, but there’s no evidence of end-to-end integration, and the functionality doesn’t really plug into the real path. A common performance in quantitative systems is that the aggregator single test passes, but the UI, data import, backtest engine and real disk monitoring do not share the same semantics. The project introduces Smoke E2E Readiness and Evidence Proof Chain, requiring the critical path to have a minimum smoke scenario, evidence chain and acceptance record. Implementations without E2E access evidence can only be considered candidate implementations and cannot be considered completed.

The fourth category is one-pass execution without critical iterative reasoning. One-shot execution typically squeezes design, implementation, testing, fixing, and closure into one round, lacking critical iteration. AI may provide a runnable solution in the first round, but it has not experienced Reviewer’s counterexamples, Tester’s boundary samples and redesign. The project makes audit loops, required-fixes, blocked, rerun and scoring part of the process so that failures are not swallowed. For issues such as RSI initialization, half-day market attribution, and performance budgeting, the process must be allowed to fall back to the clarification, planning, or implementation stages instead of closing after one round.

The fifth category is premature closure without reviewable delivery artifacts. Many AI delivery will end prematurely after “the code has been changed and the tests have been run”, but subsequent maintainers cannot see the specifications, decisions, failure fixes, E2E evidence and closing conclusions. The project connects close-out, packet execution truth, dashboard, coach, and SFT extraction to turn results such as pass, required-fixes, blocked, and rerun into reviewable assets. For long-term maintenance trading systems, this means that failure samples, repair processes, audit conclusions and final evidence can serve the next round of diagnosis, rather than being scattered in chat logs, terminal output and personal memories.

A practical way to use it is to install it from the consumer project first, instead of directly changing the framework source code:

npx --yes --package bmad-speckit-sdd-flow@latest bmad-speckit version
npx --yes --package bmad-speckit-sdd-flow@latest bmad-speckit-init . --agent codex --full --no-package-json
npx --yes --package bmad-speckit-sdd-flow@latest bmad-speckit check
npx --yes --package bmad-speckit-sdd-flow@latest bmad-speckit dashboard-status

This type of tool is most suitable for delivery scenarios with high risks, long links, and high evidence requirements, such as trading calendars, matching engines, indicator reconstruction, backtesting unified interfaces, risk control state machines, and performance management. For ordinary small fixes, it may seem too heavy; for real financial risks, multi-person collaboration and long-term maintenance, mandatory specifications, audits, readiness, runtime control and evidence closure can actually reduce long-term costs.

Summary: Three core principles of AI engineering

First, specifications come first. AI needs constraints, and specifications are the most effective constraints. Taking 30 minutes to write down the trading days, half-day markets, data gaps, and acceptance criteria can often save hours of rework.

Second, multi-Agent collaboration. A single AI can easily fall into local optimality. BMAD’s Architect, Implementer, Reviewer, Tester, and Documenter can reduce blind spots from five perspectives: architecture, implementation, review, testing, and documentation.

Third, division of labor between man and machine. AI does the candidate, expansion, and sorting, and humans are responsible for semantics, risk, and signoff. The trading system faces real market conditions, real funds and long-term maintenance. Any AI output must go through specifications, testing, review and evidence closure.

Tools and Resources

Open source project entrance:

  • GitHub Spec Kit: GitHub’s open source Spec-Driven Development toolkit is used to connect product scenarios, specifications and implementation entries.
  • BMAD-METHOD: Open source Agile AI-driven Development method and module ecology, using role-based Agent and workflow to enhance the quality of collaboration.
  • BMAD-Speckit-SDD-Flow: A governed SDD delivery flow based on BMAD-METHOD and Spec Kit, providing runtime control, mandatory auditing, Kanban and npm installation paths.

Speckit commands can be organized by function: /speckit.specify <feature-name> creates specifications, /speckit.clarify <decision-topic> records technical decisions, /speckit.plan generates implementation plans, /speckit.checklist does acceptance checks.

BMAD Agents can be activated by role: /bmad-agent-bmm-architect for architectural scenarios, /bmad-agent-bmm-dev for candidate implementations, /bmad-agent-bmm-qa for testing and QA.

Common tools include Cursor AI Editor, Claude Code CLI, Codex, speckit spec templates, BMAD multi-agent workflow, and bmad-speckit-sdd-flow. The tool itself is not the point, the point is that every output goes into specification, testing, review, runtime control and acceptance evidence.

Series ending

AI engineering is the capability enhancement layer of this series. The previous system boundaries, real defects, test defense lines, performance management and architecture evolution ultimately need to be taken over by the delivery process. When readers return to their own systems, they can start with a minimal closed loop: choose a function with clear semantic risks, write clear specifications and clarify decisions, dismantle the plan, generate candidate implementations, supplement tests and documentation, and then use quality gate sign-off.

If AI makes code faster but reduces evidence, system risk will rise. AI has only truly entered engineering if it makes specifications, testing, reviews, and documentation more traceable.

Series context

You are reading: Quantitative trading system development record

This is article 7 of 7. Reading progress is stored only in this browser so the full series page can resume from the right entry.

View full series →

Series Path

Current series chapters

Chapter clicks store reading progress only in this browser so the series page can resume from the right entry.

7 chapters
  1. Part 1 Previous in path Quantitative trading system development record (1): five key decisions in project startup and architecture design Taking Micang Trader as an example, this article starts from system boundaries, data flow, trading-session ownership, unified backtesting/live-trading interfaces, and AI collaboration boundaries to establish the architecture thread for the quantitative trading system series.
  2. Part 2 Previous in path Quantitative trading system development record (2): Python Pitfalls practical pitfall avoidance guide (1) Reorganize Python traps from a long list into an engineering risk reference for quantitative trading systems: how to amplify the three types of risks, syntax and scope, type and state, concurrency and state, into real trading system problems.
  3. Part 3 Previous in path Record of Quantitative Trading System Development (Part 3): Python Pitfalls Practical Pitfalls Avoidance Guide (Part 2) Continuing to reorganize Python risks into a reference piece: how GUI lifecycles, asynchronous network failures, security boundaries, and deployment infrastructure affect the long-term stability of quantitative trading systems.
  4. Part 4 Previous in path Quantitative trading system development record (4): test-driven agile development (AI Agent assistance) Starting from a cross-night trading day boundary bug, we reconstruct the test defense line of the quantitative trading system: defect-oriented testing pyramid, AI TDD division of labor, boundary time, data lineage and CI Gate.
  5. Part 5 Previous in path Quantitative trading system development record (5): Python performance tuning practice Transform performance optimization from empirical guesswork into a verifiable investigation process: start from the 3-second chart delay, locate the real bottleneck, compare optimization solutions, and establish benchmarks and rollback strategies.
  6. Part 6 Previous in path Record of Quantitative Trading System Development (6): Architecture Evolution and Reconstruction Decisions Review the five refactorings of Micang Trader, explaining how the system evolved from the initial snapshot to a clearer target architecture, and incorporated technical debt and ADR decisions into long-term governance.
  7. Part 7 Current Quantitative trading system development record (7): AI engineering implementation - from speckit to BMAD Taking the trading calendar and daily aggregation requirements as a single case, explain how AI engineering can enter the delivery of real quantitative systems through specification drive, BMAD role handover and manual quality gate control.

Reading path

Continue along this topic path

Follow the recommended order for Quantitative system development practice instead of jumping through random articles in the same topic.

View full topic path →

Next step

Go deeper into this topic

If this article is useful, continue from the topic page or subscribe to follow later updates.

Return to topic Subscribe via RSS

RSS Subscribe

Subscribe to updates

Follow new articles in an RSS reader without checking the site manually.

Recommended readers include Follow , Feedly or Inoreader and other RSS readers.

Comments and discussion

Sign in with GitHub to join the discussion. Comments are synced to GitHub Discussions

Loading comments...