Article

Quantitative trading system development record (4): test-driven agile development (AI Agent assistance)

Starting from a cross-night trading day boundary bug, we reconstruct the test defense line of the quantitative trading system: defect-oriented testing pyramid, AI TDD division of labor, boundary time, data lineage and CI Gate.

Topic · Quantitative system development practice Series Quantitative trading system development record 4/7

Tdd Testing Ai Development Pytest Quant Trading

Series reading order

The recommended reading path is Part1 -> Part2 -> Part3 -> Part4 -> Part5 -> Part6 -> Part7. Part 4 is placed before performance optimization, because the trading system must first prove that the results are reliable before discussing speed, throughput and rendering performance.

This article focuses on five questions:

Why quantitative system testing should work backwards from real defect and risk populations, rather than from templated coverage.
How RED-GREEN-REFACTOR protects API semantics, bounding times, and refactoring safety in trading systems.
When AI generates tests, which responsibilities can be handed over to AI and which acceptance criteria must be signed off by humans.
How boundary time, attribute testing, fuzz testing, backtest consistency and test data lineage are combined into a trading system testing defense line.
How AI assists integration testing and E2E testing while identifying implementation excursions through requirements contract tracing.
How CI Gate turns unit tests, integration tests, property tests, boundary time tests, and coverage reports into merge gates.

Introduction: an edge case not covered by tests

The most dangerous flaws before going online are usually not errors that occur every day, but errors that are only triggered on specific dates, specific trading periods, and specific data combinations. A typical scenario is: the same set of strategies has an extra K line on Monday, causing the indicator window to be misaligned. All ordinary test cases passed, but the real trading day semantics have been contaminated.

The root cause comes from the overnight trading mechanism of Hong Kong futures. Friday night trading and Monday morning trading may belong to the same trading day in terms of trading semantics, but the physical time spans Friday, Saturday, Sunday and Monday. If the aggregation logic is only segmented according to natural days, Friday night trading, cross-midnight data and Monday morning trading may be mistakenly split, and the daily lines, indicators and backtest results will be contaminated.

What this flaw exposes is not “insufficient number of tests”, but that the test design is not aligned with the real risks. Effective testing must answer at least three questions: does the risk belong to syntax, state, concurrency, time, data, GUI, network or security boundary; what is the minimum recurrence data; should it be intercepted by unit testing, property testing, integration testing, end-to-end testing or CI Gate.

Part One: Why Quantitative Systems Are Difficult to Test

Quantitative systems are difficult to test not because the test framework is complex, but because the object under test is simultaneously superimposed on data, time, status, randomness and external dependencies.

The first type of challenge is data dependence. The strategy logic relies on historical market conditions, but the complete historical data is large and cannot be submitted to version control; the real data may contain sensitive information and cannot be shared publicly; the exchange may also adjust historical data, causing old samples to become invalid. Therefore, test data cannot only rely on “a certain local CSV”, but needs to be supported by synthetic data, desensitized snapshots, contract testing and data lineage.

The second type of challenge is randomness. The network delay is uncertain, there are jitters in the market push, there are precision errors in floating point calculations, and there may be differences between backtesting and simulated trading due to the different sequence of events. Tests can’t just assert “a value is exactly equal to another”; they also need to explicitly tolerate errors, order of events, and repeatable random seeds.

The third type of challenge is complex states. Position status, order status, funding status, layered data normalization status, and strategy internal indicator status may all carry over across events. If a test only looks at the return value of a single function, it is easy to miss state life cycle issues, such as cancellation after partial transactions, recovery of the indicator window after night trading, duplication of subscription status after reconnection, etc.

The fourth type of challenge is the density of edge cases. Opening, closing, lunch break, night trading, half-day market, holidays, cross-week, cross-day K-lines, data loss, network interruption, market delay and risk control freeze are all high-risk boundaries. A truly effective testing pyramid should infer the testing levels from these defect entrances.

Quantitative trading system defect-oriented testing pyramid — Figure 1: Defect-oriented testing pyramid, the bottom layer covers pure functions and time semantics, and the upper layer verifies strategy, backtesting and real-time consistency.

The focus of this picture is “defect orientation”. The bottom-level tests cover pure functions, indicator calculations, and time semantics, because these errors will be repeatedly amplified by the upper-level strategies; the middle-level tests cover data sources, aggregators, policy events, and test avatars, because multiple modules are connected here; the upper-level tests cover backtests, simulated disks, and real disk consistency, because this verifies whether the system really runs according to business semantics; the top-level CI Gate is responsible for blocking merging, rather than replacing the bottom-level tests.

Part 2: TDD Basics: RED-GREEN-REFACTOR

The core loop of TDD is RED-GREEN-REFACTOR. For a trading system, its value is not to formally “write tests first”, but to use failed tests to define behavioral boundaries, use minimum implementation to verify that the tests are effective, and then refactor under test protection.

The goal of the RED phase is to define correctness with tests. Taking the simple moving average as an example, the test first defines [1, 2, 3, 4, 5] and the result when period=3 is [None, None, 2.0, 3.0, 4.0]. This test should fail first, either because the class does not exist, the method does not exist, or the behavior is not as expected. Failure itself is a valid signal: it proves that the test actually caught the current gap.

def test_sma_calculates_correctly():
    prices = [1, 2, 3, 4, 5]
    sma = SimpleMovingAverage(period=3)

    result = sma.calculate(prices)

    assert result == [None, None, 2.0, 3.0, 4.0]

The goal of the GREEN phase is to use the minimum implementation to make the test pass. The minimal implementation can even be hardcoded first, as long as it proves that the test itself works. Then enter the actual implementation: the first period - 1 result is None, the first window uses the initial sum, and subsequent incremental updates are through sliding windows.

class SimpleMovingAverage:
    def __init__(self, period: int):
        self.period = period

    def calculate(self, prices: list[float]) -> list[float | None]:
        if len(prices) < self.period:
            return [None] * len(prices)

        result: list[float | None] = [None] * (self.period - 1)
        window_sum = sum(prices[: self.period])
        result.append(window_sum / self.period)

        for index in range(self.period, len(prices)):
            window_sum += prices[index] - prices[index - self.period]
            result.append(window_sum / self.period)

        return result

The goal of the REFACTOR phase is to improve the structure without changing the external behavior. Refactoring in trading systems especially requires test protection, since indicator calculations, data attribution and strategy signals may be affected by local changes. If the test remains green, it means that the refactoring has not changed the defined behavior.

The direct benefits of TDD to the trading system can be compressed into a table:

Dimensions	No TDD	Have TDD
code coverage	Commonly stays around 30%	Can be stably advanced to more than 80%
Bug escape rate	Boundary defects are easy to penetrate	Real defects can be deposited into regression use cases
Reconstruct confidence	Don’t dare to change the core module	Can be refactored in stages and continuously verified
API design	Working backward from implementation details	Driven by test scenarios
Documentation value	Documentation lags behind code	Tests become runnable documents

Part 3: Test Patterns and Structure

Test structure determines whether readers can quickly understand what a use case is verifying. The three most commonly used structures in quantitative systems are AAA, Given-When-Then, and table-driven testing. They solve problems at different levels: AAA makes function behavior clear, Given-When-Then makes business scenarios clear, and table-driven makes boundary collections clear.

AAA is suitable for functional and service level testing. Arrange prepares data and dependencies, Act performs the operation under test, and Assert verifies the results. The following skeleton, although simple, prevents tests from mixing preparations, actions, and assertions.

# illustrative code, not production code
def test_xxx():
    """testdescription"""
    # Arrange: test data and environment
    input_data = ...
    expected_output = ...

    # Act: Executeoperation
    actual_output = function_under_test(input_data)

    # Assert: verifyresult
    assert actual_output == expected_output

When readers look at AAA tests, they should first check three questions: whether the input is minimal, whether there is only one action, and whether the assertion directly verifies the business behavior. Profit and loss calculation is the best example for AAA because its input, action and output boundaries are very clear.

# illustrative code, not production code
def test_calculate_profit():
    """test"""
    # Arrange
    entry_price = 25000
    exit_price = 25100
    position_size = 10
    expected_profit = 1000  # (25100 - 25000) * 10

    # Act
    actual_profit = calculate_profit(
        entry=entry_price,
        exit=exit_price,
        size=position_size,
    )

    # Assert
    assert actual_profit == expected_profit

The value of this code does not lie in the complexity of the formula, but in that it fixes the basic semantics in the trading system that are most easily amplified. If there is no independent testing of profit and loss calculations, subsequent positions, risk control, backtesting statistics and performance attribution will be based on unreliable foundations.

Given-When-Then is more suitable for strategic behavior and acceptance scenarios. It makes the test read like a business specification: given a certain market state, when the strategy handles the event, it should produce a certain trading action.

# illustrative code, not production code
def test_user_story_scenario():
    """scenariodescription"""
    # Given: status
    setup_conditions()

    # When:
    perform_action()

    # Then: expectedresult
    verify_expected_outcomes()

The strategic golden cross buying is a typical scenario. The test should not only verify the change of an internal variable, but also verify whether the strategy generates an executable business signal when facing the last Bar.

# illustrative code, not production code
def test_strategy_generates_buy_signal():
    """generate"""
    # Given: MA5  MA10
    strategy = MovingAverageCrossStrategy(fast=5, slow=10)
    historical_data = load_data_with_golden_cross()

    # When: handlebars Bar
    signal = strategy.on_bar(historical_data[-1])

    # Then: generate
    assert signal.action == Action.BUY
    assert signal.price == historical_data[-1].close

This code turns “Golden Cross” from a verbal description into a runnable specification. Readers should pay attention to whether the Given part actually constructs a golden cross, rather than just relying on the function name to imply the scene; whether the Then part verifies the action and price, rather than just verifying that the return object is not empty.

Table-driven testing is suitable for a large number of boundary values and equivalence classes. To add a new scenario, you only need to add one row of data. pytest parameterization can ensure that each use case runs independently and displays the specific scenario when it fails.

# illustrative code, not production code
import pytest

TEST_CASES = [
    # input, expected, description
    ([1, 2, 3], 2.0, "positive integers"),
    ([-1, -2, -3], -2.0, "negative integers"),
    ([1.5, 2.5, 3.5], 2.5, "floating-point values"),
    ([5, 5, 5], 5.0, "identical values"),
    ([1], 1.0, "single element"),
]

@pytest.mark.parametrize("input_data,expected,desc", TEST_CASES)
def test_average_calculation(input_data, expected, desc):
    """test"""
    result = calculate_average(input_data)
    assert result == expected, f"Failed on {desc}"

The key to table-driven testing is to explicitly write out “what input categories are there”. Quantitative systems can put price series, trading periods, order statuses, missing data patterns, and abnormal inputs into tables, rather than duplicating multiple nearly identical tests.

model	Applicable scenarios	risk interception point
AAA	Pure functions, indicators, profit and loss calculations	Clear separation of input, action, and output
Given-When-Then	Strategy signals, user stories, acceptance scenarios	Binding business semantics and behavioral results
table driven	Multiple sets of boundary values, equivalence classes, abnormal input	Prevent only testing one happy path

Part 4: Test Doubles: Isolating External Dependencies

Quantitative systems rely heavily on external data sources, order managers, databases, logs, and trading gateways. The purpose of a test double is not to “mock everything”, but to isolate uncontrollable dependencies while retaining the behavior you want to verify. Dummy, Stub, Spy, Mock and Fake solve different problems, and mixing them will make the test brittle.

Dummy is a placeholder object and is only used to meet parameter requirements. The logger can be an object that will not be used when the current test does not verify logging behavior.

# illustrative code, not production code
def test_dummy_logger_does_not_affect_fetch():
    """Dummy, andassertion"""
    dummy_logger = object()
    service = DataService(logger=dummy_logger)

    result = service.fetch_data("HSI")

    assert result is not None

This code only proves that DataService.fetch_data does not rely on logger behavior. Readers should not assert logging calls in Dummy tests, otherwise Dummy will be mistakenly upgraded to Mock.

Stub is a fixed return value object, suitable for replacing the real market source. When the strategy only requires a certain price, Stub can ensure that the test does not rely on real APIs, network delays, and market permissions.

# illustrative code, not production code
class StubDataFeed:
    """return"""
    def __init__(self, fixed_price: float):
        self.fixed_price = fixed_price

    def get_price(self, symbol: str) -> float:
        return self.fixed_price

def test_strategy_with_stub_price():
    """use Stub test strategy"""
    stub_feed = StubDataFeed(fixed_price=25000)
    strategy = SimpleStrategy(data_feed=stub_feed)

    signal = strategy.check_signal()

    assert signal is not None

The reader value of Stub is stable input. As long as the test goal is strategy logic, the real market service should not be a source of failure.

Spy records the interaction between the object under test and collaborators, and is suitable for verifying the number, direction and quantity of orders. It does not preset complex expectations and only records the interaction for use in assertions.

# illustrative code, not production code
class SpyOrderManager:
    """recordcall Spy"""
    def __init__(self):
        self.orders_placed = []

    def place_order(self, symbol: str, side: str, quantity: int):
        self.orders_placed.append({
            "symbol": symbol,
            "side": side,
            "quantity": quantity,
        })

def test_strategy_places_buy_order_once():
    """verify BUY"""
    spy = SpyOrderManager()
    strategy = SignalStrategy(order_manager=spy)

    strategy.on_signal(Signal.BUY)

    assert len(spy.orders_placed) == 1
    assert spy.orders_placed[0]["side"] == "BUY"

This code protects order side effects boundaries. Readers should pay attention to the two assertions of “only place one order” and “correct direction”, because repeated orders and wrong directions may lead to real risks.

Mock presets expectations and verifies calling behavior, and is suitable for checking whether a method is called according to specified parameters. Overuse of mocks binds tests to implementation details, so only use them when the interaction itself is part of the behavior.

# illustrative code, not production code
from unittest.mock import Mock

def test_strategy_calls_data_feed_with_symbol():
    """verifyrequest HSI"""
    mock_feed = Mock()
    mock_feed.get_price.return_value = 25000

    strategy = Strategy(data_feed=mock_feed)
    strategy.update()

    mock_feed.get_price.assert_called_once_with("HSI")
    assert mock_feed.get_price.call_count == 1

The key assertion in this code is assert_called_once_with("HSI"). If the policy requests the wrong contract, duplicate requests, or missed requests, Mock can immediately expose interface interaction errors.

Fake is a simplified but real usable implementation, such as an in-memory database. Repository testing uses Fake to preserve read and write semantics while avoiding the cost of real database connections, transactions, and cleanup.

# illustrative code, not production code
class FakeDatabase:
    """memorydata Fake"""
    def __init__(self):
        self.data = {}

    def save(self, key: str, value: dict):
        self.data[key] = value

    def get(self, key: str) -> dict:
        return self.data.get(key)

    def delete(self, key: str):
        if key in self.data:
            del self.data[key]

def test_repository_with_fake_db():
    """use Fake datatest Repository"""
    fake_db = FakeDatabase()
    repo = BarRepository(database=fake_db)

    bar = BarData(symbol="HSI", close=25000)
    repo.save(bar)

    retrieved = repo.get("HSI")
    assert retrieved.close == 25000

The value of Fake is to preserve the true semantics. Readers can understand it as a “lightweight but runnable alternative implementation”, which is more suitable than Mock for testing the reading and writing process and Repository boundaries.

Stand type	use	Quantitative system example
Dummy	Fill in parameters that will not be used	Logger that does not participate in assertions
Stub	Return fixed response	Fixed price quote source
Spy	Record interaction	Record the number of strategic orders placed
Mock	Verify calling behavior	Verify market interface call parameters
Fake	Simplify real implementation	In-memory database or local order book

Part 5: AI-assisted TDD process

AI-assisted TDD should not let AI take over acceptance criteria. A more secure division of labor is: humans define defects, business semantics, and acceptance boundaries; AI expands the test framework, generates candidate implementations, and makes refactoring suggestions; and humans ultimately review whether the tests really cover risks.

AI-assisted TDD human-machine division of labor swim lane diagram — Figure 2: AI TDD swim lane diagram, AI extension candidates, humans retain defect definition, acceptance criteria, and final signoff rights.

The three steps of traditional TDD are manual writing of tests, manual writing of implementation, and manual refactoring. AI-assisted TDD can be changed to: manually write specifications and key boundaries, AI generates a test framework; AI generates candidate implementations, and humans review; AI provides refactoring suggestions, and humans judge whether to accept them. This improves speed but does not hand over business semantic sign-off to the AI.

K-line cycle aggregation is a complete example. The specifications are first defined: given 1-minute K-line data, generate 5-minute K-line; the input contains open/high/low/close/volume/turnover; the output is a 5-minute K-line list. The core rules include: boundary normalization to period boundaries such as 09:00, 09:05, 09:10; open takes the first root, high takes the maximum value, low takes the minimum value, close takes the last root, and volume and turnover are summed. Boundary cases include: not generating less than 5 roots, correct segmentation across 5-minute boundaries, and rounding down when timestamps are not on boundary.

Specifications should be locked down to business rules in natural language before being handed over to AI for extended testing. The following specification is not a decorative document, it qualifies that AI cannot interpret aggregation semantics at will.

##: K periodaggregation

### Requirements
1 minute barsdata, generate 5 minute barsdata.

### input
- 1 minute barslist,  open, high, low, close, volume, turnover

### output
- 5 minute barslist

### Rules
1.: 5 minute bars 09:00, 09:05, 09:10...
2.:
- open = bars 1 minute open
- high = max(1 minute high)
- low = min(1 minute low)
- close = bars 1 minute close
3.: volume = sum(1 minute volume)

### Boundary cases
- datainsufficient 5 bars: generate K
- data 5 minuteboundary:
-:  5 minuteboundary

Readers can think of this specification as an input contract for AI-generated testing. Without it, it is easy for AI to generate tests that “seem reasonable but are semantically uncertain”, such as ignoring boundary normalization or aggregating data for insufficient periods.

Prompt generation for AI testing should clearly define the testing strategy, sample factory, and assertion goals.

Generate pytest tests for the aggregation behavior described below.

Requirements:
1. use AAA.
2. Cover normal cases, boundary cases, and error cases.
3. Use input fixtures that expose trading-session ownership.
4. Generate test data with factory functions.
5. test docstring.

The following test framework retains key details in the original version: the factory function is responsible for generating time-continuous Bar, normal use cases verify OHLCV aggregation, edge use cases verify that insufficient cycles are not generated, and parameterized use cases verify the relationship between the number of inputs and the number of outputs.

# illustrative code, not production code
import pytest
from datetime import datetime, timedelta
from typing import List

from core.aggregation import aggregate_bars
from core.data import BarData


class TestAggregateBars:
    """test K periodaggregation"""

    @staticmethod
    def create_bars(count: int, start_time: datetime) -> List[BarData]:
        """factoryFunction:createtest data"""
        bars = []
        for i in range(count):
            bars.append(
                BarData(
                    symbol="HSI",
                    timestamp=start_time + timedelta(minutes=i),
                    open_price=100 + i,
                    high_price=105 + i,
                    low_price=95 + i,
                    close_price=100 + i + 0.5,
                    volume=1000,
                    turnover=100000,
                )
            )
        return bars

    def test_normal_case(self):
        """normal cases: 5 bars 1 minute 1 bars 5 minute"""
        start_time = datetime(2024, 1, 8, 9, 0)
        bars = self.create_bars(5, start_time)

        result = aggregate_bars(bars, period=5)

        assert len(result) == 1
        assert result[0].open_price == 100
        assert result[0].high_price == 109
        assert result[0].low_price == 95
        assert result[0].close_price == 104.5
        assert result[0].volume == 5000

    def test_insufficient_data(self):
        """Boundary cases: insufficient 5 barsgenerate"""
        start_time = datetime(2024, 1, 8, 9, 0)
        bars = self.create_bars(3, start_time)

        result = aggregate_bars(bars, period=5)

        assert len(result) == 0

    @pytest.mark.parametrize("input_count,expected_count", [
        (5, 1),
        (10, 2),
        (12, 2),  # 2 barsinsufficient
        (0, 0),
    ])
    def test_various_counts(self, input_count, expected_count):
        """test: input"""
        start_time = datetime(2024, 1, 8, 9, 0)
        bars = self.create_bars(input_count, start_time)

        result = aggregate_bars(bars, period=5)

        assert len(result) == expected_count

The architectural value of this test is that it first defines the output semantics and then allows implementation changes. As long as these tests exist, the same business results must be preserved whether subsequent use of loops, NumPy, vectorization, or incremental state.

The RED phase should fail first. Failure is not a bad thing, it proves that the test does catch the current gap.

$ pytest tests/test_aggregation.py -v

tests/test_aggregation.py::TestAggregateBars::test_normal_case FAILED
tests/test_aggregation.py::TestAggregateBars::test_insufficient_data FAILED
tests/test_aggregation.py::TestAggregateBars::test_various_counts FAILED

# failure: aggregate_bars function

This output confirms to the reader that the test is not a “post hoc fix”. If the test fails initially, it is likely that the test does not cover the real gap, or the implementation already exists but is not constrained to the business semantics.

The implementation of the GREEN phase only needs to meet the defined tests and do not add too many abstractions in advance. The following implementation retains the core aggregation logic: slice by complete cycle and calculate open, high, low, close, volume and turnover respectively.

# illustrative code, not production code
from typing import List

def aggregate_bars(bars: List[BarData], period: int) -> List[BarData]:
    """minute barsaggregationperiod"""
    if len(bars) < period:
        return []

    result = []
    for i in range(0, len(bars) // period * period, period):
        group = bars[i:i + period]
        aggregated = BarData(
            symbol=group[0].symbol,
            timestamp=group[0].timestamp,
            open_price=group[0].open_price,
            high_price=max(b.high_price for b in group),
            low_price=min(b.low_price for b in group),
            close_price=group[-1].close_price,
            volume=sum(b.volume for b in group),
            turnover=sum(b.turnover for b in group),
        )
        result.append(aggregated)

    return result

This implementation only addresses behavior covered by the current test. Readers should note that there is still room for improvement. For example, the timestamp alignment logic has not yet been independent, and the remaining insufficient period processing strategy still relies on test constraints.

The REFACTOR phase then takes time to align and aggregate constructors. The premise of refactoring is that tests continue to pass, and the goal is to make boundary semantics easier to understand for the next maintainer.

# illustrative code, not production code
def aggregate_bars(bars: List[BarData], period: int) -> List[BarData]:
    """minute barsaggregationperiod"""
    if len(bars) < period:
        return []

    result = []
    end = len(bars) - len(bars) % period
    for i in range(0, end, period):
        group = bars[i:i + period]
        aligned_time = _align_timestamp(group[0].timestamp, period)
        result.append(_create_aggregated_bar(group, aligned_time))

    return result


def _align_timestamp(timestamp: datetime, period: int) -> datetime:
    """periodboundary"""
    minute = (timestamp.minute // period) * period
    return timestamp.replace(minute=minute, second=0, microsecond=0)


def _create_aggregated_bar(group: List[BarData], timestamp: datetime) -> BarData:
    """K createaggregation K"""
    return BarData(
        symbol=group[0].symbol,
        timestamp=timestamp,
        open_price=group[0].open_price,
        high_price=max(b.high_price for b in group),
        low_price=min(b.low_price for b in group),
        close_price=group[-1].close_price,
        volume=sum(b.volume for b in group),
        turnover=sum(b.turnover for b in group),
    )

The refactored code makes boundary normalization explicit. Readers should focus on whether _normalize_boundary_timestamp covers real exchange sessions; if night session, half-day trading, or cross-week attribution is needed later, this function becomes a key test entry.

Part 6: Quantifying System-Specific Testing Strategies

Quantitative testing cannot rely solely on handwritten samples. Handwriting samples are good for illustrating business intent, but rarely exhaust the input space. Trading systems require at least four types of proprietary strategies: attribute-based testing, fuzz testing, backtesting consistency verification, and boundary time testing.

Property-based testing is suitable for verifying mathematical properties. The core properties of the moving average include: the result length is related to the input length and period; each mean value should fall between the minimum and maximum values of the corresponding window; when the input sequence does not decrease monotonically, the moving average should also not decrease monotonically. Hypothesis can automatically generate a large number of price series and periods, helping to find boundaries missed by manual samples.

# illustrative code, not production code
from hypothesis import given, strategies as st

@given(
    prices=st.lists(
        st.floats(min_value=1, max_value=100000),
        min_size=10,
        max_size=100,
    ),
    period=st.integers(min_value=2, max_value=20),
)
def test_ma_properties(prices, period):
    """test"""
    result = calculate_ma(prices, period)

    assert len(result) == len(prices) - period + 1

    for value in result:
        assert min(prices) <= value <= max(prices)

    if all(prices[i] <= prices[i + 1] for i in range(len(prices) - 1)):
        assert all(result[i] <= result[i + 1] for i in range(len(result) - 1))

The value of this test is upgraded from “giving a few examples” to “verifying mathematical properties”. Readers should note that the second assertion is written as a global scope constraint; if you want to be more strict, you can limit each mean to min/max of the corresponding sliding window.

Fuzz testing is suitable for verifying the robustness of parsing logic. When the market parser faces any byte input, it is allowed to throw a clear ParseError, but no unexpected exceptions should occur; if the parsing is successful, it must also meet basic constraints such as high >= low, high >= open, high >= close and other basic constraints.

# illustrative code, not production code
import atheris
import sys

@atheris.instrument_func
def test_parse_bar(input_bytes):
    """testparse"""
    fdp = atheris.FuzzedDataProvider(input_bytes)
    data = fdp.ConsumeBytes(len(input_bytes))

    try:
        bar = parse_bar_data(data)
        if bar:
            assert bar.high >= bar.low
            assert bar.high >= bar.open
            assert bar.high >= bar.close
    except ParseError:
        pass
    except Exception:
        raise

atheris.Setup(sys.argv, test_parse_bar)
atheris.Fuzz()

This code keeps dirty data hitting the parser. Readers should pay attention to exception boundaries: business-acceptable parsing failures should enter ParseError, and unknown exceptions are defects that need to be fixed.

Backtest consistency verification is used to ensure that the strategy performs consistently in backtests and simulated trading. What this test protects is “the same set of business semantics for backtesting and real trading”, not a single function.

# illustrative code, not production code
def test_strategy_consistency():
    """verify and"""
    historical_data = load_data("HSI", "2024-01-01", "2024-01-31")

    backtest_result = run_backtest(
        strategy=MyStrategy(),
        data=historical_data,
        mode="backtest",
    )

    simulated_result = run_backtest(
        strategy=MyStrategy(),
        data=historical_data,
        mode="simulation",
    )

    assert len(backtest_result.signals) == len(simulated_result.signals)
    for b_sig, s_sig in zip(backtest_result.signals, simulated_result.signals):
        assert b_sig.timestamp == s_sig.timestamp
        assert b_sig.action == s_sig.action
        assert abs(b_sig.price - s_sig.price) < 0.01

This code should not be understood as “the backtest income and the simulated plate income are exactly the same.” A more precise goal is: under the same event input, the signal quantity, timestamp, action and price semantics should be consistent, and the allowed error must be explicitly written in the assertion.

Boundary time testing is one of the most important regression defense lines for quantitative systems. There should be fixed fixtures from Friday night trading to Monday morning trading, half-day trading, lunch break, the first opening bar, the last closing bar, and night trading on the eve of holidays.

# illustrative code, not production code
class TestHKFEBoundaryCases:
    """testboundarycase"""

    def test_friday_night_to_monday(self):
        """night session"""
        friday_night = datetime(2024, 1, 5, 23, 0)
        monday_morning = datetime(2024, 1, 8, 9, 15)

        bars = [
            BarData(timestamp=friday_night, close=25000),
            BarData(timestamp=monday_morning, close=25100),
        ]

        result = aggregate_daily(bars)

        assert len(result) == 1
        assert result[0].date == date(2024, 1, 5)

    def test_half_day_holiday(self):
        """half-day sessionmarket close"""
        half_day_close = datetime(2024, 2, 9, 12, 0)
        bars = generate_bars_until(half_day_close)

        assert all(bar.timestamp <= half_day_close for bar in bars)

This code turns overnight, weekly, and half-day markets into fixed regression samples. Readers should focus on the trading day attribution rules, not whether the physical dates are consecutive.

Quantitative trading system boundary time test map — Figure 3: Boundary time test state diagram, taking opening, closing, night trading, cross-week and half-day trading as explicit testing states.

Figure 3 Think of boundary times as state transitions rather than lists of dates. The test needs to cover the transition from day trading to lunch break, recovery from lunch break, day trading to night trading, cross-midnight, cross-week attribution, early closing of half-day market and market closure on holidays. As long as these states are not explicitly tested, aggregators and policies will continue to rely on implicit assumptions.

Part 7: Best Practices for AI Generated Testing

The key to getting AI to generate better tests is to give it a testing strategy, rather than just saying “make up the test.”

First, clarify the testing method. You can ask AI to use boundary value analysis to generate normal ranges, boundary values, and abnormal inputs respectively; you can also ask AI to use equivalence class division to split price, time, trading day, and order status into different categories.

Use boundary value analysis to generate tests for the following function:
- normal case: input is within the valid range
- boundary case: input is exactly on the boundary
- error case: input is outside the valid range

The value of this prompt is to force the AI to organize the test space first. The reader can go on to add that “each use case must describe the source of risk and assertion target” to prevent AI from giving only superficial coverage.

Second, ask the AI to interpret the intent of the test. Each test should state the verified business hypothesis, such as “the night trading bar at 23:30 on Friday should be attributed to the trading day of the following Monday, not the natural day Friday”.

testaddcomment, test and verify.

If the AI’s explanations are unclear, the tests it generates are often unreliable. Explain the test intent not to add comments but to make the business assumptions reviewable.

Third, let AI identify omissions. Give the implementation, specifications, and existing tests to the AI and let it list the scenarios it may have missed. It will usually find problems such as null input, maximum values, minimum values, concurrent access, resource exhaustion, and floating point precision.

Given the implementation and requirement notes, identify missing test scenarios.
Classify them as normal, boundary, error, performance, or output-validation cases.
Each scenario must state the risk source, minimal reproduction data, and assertion target.

This prompt turns “make-up testing” into a risk classification task. Readers should treat the AI output as a candidate list, and then decide whether to adopt it based on real defects and system boundaries.

Fourth, use mutation testing to check test quality. Tools such as mutmut will modify the source code, such as changing > to >=, and then run the tests. If the test still passes, the test does not truly protect the behavior.

# illustrative code, not production code
# : pip install mutmut
# : mutmut run

# mutmut modifycode, for example >  >=.
# iftestpass, descriptiontestnoconstraint.

Mutation testing is not suitable for running every submission, but it is suitable for use before and after refactoring key modules. It can expose the illusion of “high coverage but weak assertions”.

Fifth, when letting AI generate integration tests and E2E tests, the system boundaries must first be defined. The most common problem is not that AI cannot write test code, but that it writes E2E as a happy path that “can run through pages or commands”, but does not verify whether the business link is really closed. Integration testing of the quantitative system must at least go through data sources, trading day attribution, period aggregation, indicator calculations, strategy signals and order adapters; E2E testing must also verify the results that readers really care about, such as signal timestamps, order direction, quantity, price tolerance, backtest reports and abnormal degradation paths.

Generate integration tests and E2E tests based on the following requirements contract.

Requirements:
1. Each test must trace back to a REQ/INV/E2E ID.
2. Tests must cover DataFeed -> Calendar -> Aggregator -> Strategy -> OrderAdapter.
3. E2E tests must verify user-visible business results, not only that a page opens or a command exits with code 0.
4. Use fakes or stubs only at explicit external boundaries, and keep assertions on the real integration path.
5. Test output must include the evidence path, assertion target, and failure message.

The key to this prompt is to change “Generate E2E” to “Generate traceable E2E according to contract”. Readers should avoid asking AI to write only browser clicks, CLI smoke tests, or mock-only assertions. A qualified E2E must at least prove that a certain requirement enters the system from input data, passes through the real integration boundary, and finally produces auditable business results.

Sixth, after the execution of the requirements contract is completed, there needs to be an independent tracking artifact, instead of just seeing whether the test passes. Trace Matrix should put requirements, constraints, invariants, tests, evidence and states in the same table. In this way, implementation deviations can be identified: the code is written but does not correspond to the requirements, the test passes but there is no evidence, E2E only covers the happy path, or the implementation path bypasses the boundaries defined in the contract.

trace:
  - id: TRACE-HKFE-001
    requirement: REQ-TRADING-DAY-OWNERSHIP
    invariant: INV-NATURAL-DAY-MUST-NOT-LEAK
    integration_test: tests/integration/test_hkfe_daily_aggregation.py
    e2e_test: tests/e2e/test_strategy_signal_calendar_boundary.py
    evidence:
      - reports/integration/hkfe_daily_aggregation.xml
      - reports/e2e/strategy_signal_calendar_boundary.json
    drift_signals:
      - implementation_without_requirement
      - passing_test_without_business_assertion
      - mock_only_without_integration_boundary
      - missing_evidence_artifact
    status: PASS

The value of this tracking artifact is to make “done” into a reviewable state. Readers can follow TRACE-HKFE-001 to see where the requirements come from, what the invariants are, where the integration tests and E2E tests are, what the evidence files are, and which offset signals need to be intercepted. If an item is missing, the status should not be changed to PASS.

There are typically five types of signals used to implement offsets. First, the code change cannot find the corresponding REQ or TRACE, indicating that the implementation may have exceeded the requirements boundary. Second, the test only verifies function return or page existence, but does not verify business semantics. Third, E2E relies on a mock-only path that does not cross critical integration boundaries. Fourth, the AI-generated fix makes the test green but removes, relaxes, or bypasses the original assertion. Fifth, the final report only has a “passed” conclusion and no evidence artifacts that can be reviewed.

Quick correction should not start with “continue to let AI change the code”, but should start with the failure evidence package. The minimum closed loop is: locate the failed TRACE line, read the corresponding REQ/INV/E2E, confirm the offset type, fill in the minimum recurrence or assertion, and then let AI only modify the code and tests related to the TRACE. After repair, re-run the corresponding gate and update the evidence path and status. This process can condense a large-scale rework into iterations around a single requirements contract.

offset signal	risk	corrective action
There is code but no REQ/TRACE	Achieve running beyond the demand boundary	Supplement requirement mapping or withdraw unfounded implementation
Test no business assertion	happy path pseudo pass	Add business visible result assertions
E2E only uses mock	Integration boundaries not verified	Connect to Fake gateway and real data path
PASS No evidence	Completion status cannot be reviewed	Generate and log reports, screenshots or JSON evidence
AI relaxes assertions	Test quality declines	Restore the original assertion and fix the failed use case

The way to use this table is straightforward: each time the AI claims completion, it goes through the offset signal one by one instead of reading the code diff first. Trading systems especially require this kind of discipline, since many errors are not exposed in unit tests, but in the combination of data paths, trading day semantics, strategy events and order adapters.

Part 8: Test Data Management

Test data management determines whether a test is reproducible. The quantitative system cannot directly plug real market conditions into all tests, nor can it rely solely on randomly generated data. A more reliable combination is: synthetic data covers structural scenarios, data snapshots save real defect samples, and contract testing protects data source boundaries.

Synthetic data is suitable for covering trends, volatility, gaps, gaps and extreme values. The generator can control start and end times, trend direction, volatility and volume range.

# illustrative code, not production code
def generate_synthetic_bars(
    symbol: str,
    start: datetime,
    end: datetime,
    trend: str = "random",
    volatility: float = 0.02,
) -> List[BarData]:
    """generate K data"""
    bars = []
    current_time = start
    price = 25000

    while current_time <= end:
        if trend == "up":
            change = abs(random.gauss(0.001, volatility))
        elif trend == "down":
            change = -abs(random.gauss(0.001, volatility))
        else:
            change = random.gauss(0, volatility)

        price *= (1 + change)
        bar = BarData(
            symbol=symbol,
            timestamp=current_time,
            open_price=price * (1 + random.gauss(0, 0.001)),
            high_price=price * (1 + abs(random.gauss(0, 0.002))),
            low_price=price * (1 - abs(random.gauss(0, 0.002))),
            close_price=price,
            volume=random.randint(1000, 10000),
        )
        bars.append(bar)
        current_time += timedelta(minutes=1)

    return bars

This code is suitable for constructing trend, volatility, and volume ranges, but should not be used as a substitute for real defect snapshots. Readers should fix the random seed or record generation parameters, otherwise the synthetic data itself will become a source of instability.

Data snapshots are suitable for saving real defect samples. The desensitized fixtures/hsi_2024_01.json can be used to reproduce the Friday night trading attribution error or the half-day market closing abnormality.

# illustrative code, not production code
@pytest.fixture
def real_market_data():
    """use a redacted real-market data snapshot"""
    return load_test_data("fixtures/hsi_2024_01.json")

Snapshots should be small and stable, focusing on retaining the minimum data required to reproduce defects. Readers should record the source, desensitization method, transaction day attribution rules and corresponding defect number.

Contract testing is suitable for validating data source boundaries. For example, the data returned by get_bars("HSI", "1m", count=100) must be sorted by time; each Bar must satisfy high >= open/close/low; the timestamp must have a time zone.

# illustrative code, not production code
def test_datafeed_contract():
    """verifydata"""
    feed = DataFeed()

    bars = feed.get_bars("HSI", "1m", count=100)
    timestamps = [bar.timestamp for bar in bars]
    assert timestamps == sorted(timestamps)

    for bar in bars:
        assert bar.high >= bar.open
        assert bar.high >= bar.close
        assert bar.high >= bar.low

Contract testing prevents upstream data changes from quietly breaking downstream policies. Readers should place it near the data entry instead of waiting for the strategy test to fail before troubleshooting data problems.

Quantitative trading system test data lineage chart — Figure 4: Test data lineage diagram, traceable from original quotes to cleaning, attribution, aggregation, fixtures and assertions.

Figure 4 focuses on blood ties. Every step of the test data from raw market conditions to cleaning, trading day attribution, aggregation, fixtures, assertions and regression records should be traceable. Otherwise, when the test fails, readers cannot determine whether the original data has changed, the cleaning logic has changed, the transaction day attribution has changed, or the assertion itself has expired.

Part 9: Automated Testing in CI/CD

The CI Gate’s responsibility is not to run all tests, but to block incorrect merges at the appropriate stage. Different test layers bear different feedback speeds and risk interception responsibilities.

Quantitative trading system CI Gate layered map — Figure 5: CI Gate decision diagram, different test layers bear different feedback speed and risk interception responsibilities.

A typical CI can be executed in layers: unit testing is the fastest, covering the core algorithm; integration testing verifies data source, aggregator, strategy and order module collaboration; attribute testing uses a fixed seed to ensure reproducibility; boundary time testing covers HKFE special trading days; coverage reports are used to prevent core modules from losing test protection; failed tests must prevent merging.

name: Tests

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
          python-version: "3.10"
      - run: pip install -r requirements.txt -r requirements-test.txt
      - run: pytest tests/unit -v --cov=core --cov-report=xml
      - run: pytest tests/integration -v
      - run: pytest tests/property -v --hypothesis-seed=0
      - uses: codecov/codecov-action@v3
        with:
          files: ./coverage.xml
          fail_ci_if_error: true

This configuration breaks fast feedback, module collaboration, property testing, and coverage reporting into independent steps. Readers should decide which tests go into each PR and which tests go into nightly tasks based on the size of the project, but failing tests must be able to prevent high-risk merges.

The test report needs to be able to answer “which levels have been verified.” The output example below shows the information the report should carry, rather than asking the reader to copy specific numbers.

tests/unit/test_aggregation.py PASSED
tests/unit/test_indicators.py PASSED
tests/unit/test_datafeed.py PASSED
tests/unit/test_strategy.py PASSED
tests/integration/test_end_to_end.py PASSED
tests/property/test_properties.py PASSED
tests/boundary/test_hkfe_cases.py PASSED
tests/regression/test_issue_*.py PASSED

core/aggregation.py        97%
core/indicators.py         94%
core/datafeed.py           91%
core/strategy.py           88%
total                      95%

These numbers are not goals per se, but they give the reader an idea of where risk interception occurs. What’s more important is: whether boundary timing, property testing, and real defect regressions are included in the report, not whether the overall coverage is pretty.

Part 10: TDD Anti-Patterns and Pitfalls

The first anti-pattern is testing implementation rather than behavior. Testing internal_cache.size() == 5 would tie the test to the internal structure; it would be safer to assert external behavior.

# illustrative code, not production code
def test_implementation():
    """example: testimplementation"""
    result = calculate()
    assert result.internal_cache.size() == 5

The problem with this code is that the test relies on internal cache structures. As soon as the implementation is changed to a generator, array, or incremental state, the test will fail, even if the external behavior has not changed.

# illustrative code, not production code
def test_behavior():
    """example: test"""
    result = calculate([1, 2, 3, 4, 5])
    assert result == 3.0

Behavioral testing leaves room for refactoring. Trading systems require this kind of testing because performance optimization, data structure replacement, and execution domain splitting all change the internal implementation.

The second anti-pattern is that a test verifies too many things. A test calls func1, func2, func3 at the same time, and it is difficult to locate the root cause when it fails.

# illustrative code, not production code
def test_everything():
    """example: testcover"""
    result1 = func1()
    result2 = func2()
    result3 = func3()
    assert result1 == expected1
    assert result2 == expected2
    assert result3 == expected3

The problem with this code is that failure location is difficult. Readers cannot quickly determine whether input preparation, a certain function behavior, or pre-state pollutes subsequent assertions.

# illustrative code, not production code
def test_func1():
    """example: test func1"""
    assert func1() == expected1

def test_func2():
    """example: test func2"""
    assert func2() == expected2

Split tests make failure signals clearer. Especially in quantitative systems it is important to avoid that one test simultaneously verifies data loading, indicator calculations, strategy signals and order execution.

The third anti-pattern is ignoring edge cases. Simply testing that divide(10, 2) == 5 is not enough to prove that the function is reliable; you also need to test for division by zero, decimals, decimals, and floating-point approximations.

# illustrative code, not production code
def test_normal_case():
    """example: testnormal cases"""
    assert divide(10, 2) == 5

This test only covers the happy path. The boundaries in the trading system are more complex and must cover opening, closing, lunch break, night trading, half-day trading and cross-week trading.

# illustrative code, not production code
def test_normal_case():
    assert divide(10, 2) == 5

def test_divide_by_zero():
    """testboundary"""
    with pytest.raises(ZeroDivisionError):
        divide(10, 0)

def test_small_numbers():
    """test"""
    assert divide(1, 3) == pytest.approx(0.333, rel=1e-3)

The value of boundary testing is to pin down situations that “do not happen often but have a huge impact once they occur.” In quantitative systems, many real accidents occur at low-frequency boundaries rather than on daily paths.

The fourth anti-pattern is that test data has no lineage. When a test fails without knowing where the fixture came from and what cleaning and attribution logic it went through, it’s difficult to tell whether the failure is valid. All real defect samples should have documented source, desensitization method, attribution rules, and assertion purpose.

The fifth anti-pattern is treating AI-generated tests as acceptance results. AI can expand scenarios, but it cannot prove that scenarios represent real business risks. Acceptance rights still come from specifications, defect reproduction, manual review and CI Gate.

Summary: Test checklist

The test design needs to confirm:

Whether to infer testing from real defect and risk groups instead of just pursuing coverage.
Whether to cover normal, boundary, and abnormal scenarios.
Whether to use attribute testing to uncover hidden assumptions.
Is there a boundary time test for quantized scenarios.

Test implementation needs to confirm:

The tests are independent and do not depend on the order of execution.
Whether the test is fast enough, regular single tests should maintain second-level feedback.
Whether the test is readable, you can see at a glance what is being verified.
Is there a clear Arrange-Act-Assert or Given-When-Then construct.

Test avatars require confirmation:

Whether external dependencies are isolated using Stub, Mock, Spy or Fake.
Is there a simplified but realistic fake for complex objects.
Interactive validation uses mocks only when really needed.

Test data needs to be confirmed:

Whether to use synthetic data to ensure reproducibility.
Whether to keep desensitized snapshots of real defects.
Whether there is a data contract test protecting the upstream boundary.
Whether sensitive data is desensitized.

CI Gate requires confirmation:

Whether to run critical tests on every commit.
Whether the core module coverage is higher than the agreed threshold.
Whether failing the test prevents the merge.
Whether regression defects are converted into fixed tests.

Next article preview

The next article enters the practical practice of Python performance tuning. Readers will see how profilers locate bottlenecks, how incremental computation and virtualized rendering reduce latency, and why performance optimization must preserve evidence of correctness.

Reference resources

pytest documentation: https://docs.pytest.org/
Hypothesis property test: https://hypothesis.readthedocs.io/
mutmut mutation testing: https://mutmut.readthedocs.io/
Test-Driven Development: By Example (Kent Beck)

Series context

You are reading: Quantitative trading system development record

This is article 4 of 7. Reading progress is stored only in this browser so the full series page can resume from the right entry.

View full series →

Reading path

Continue along this topic path

Follow the recommended order for Quantitative system development practice instead of jumping through random articles in the same topic.

View full topic path →

Next step

Go deeper into this topic

If this article is useful, continue from the topic page or subscribe to follow later updates.

Quantitative trading system development record (4): test-driven agile development (AI Agent assistance)

Series reading order

Introduction: an edge case not covered by tests

Part One: Why Quantitative Systems Are Difficult to Test

Part 2: TDD Basics: RED-GREEN-REFACTOR

Part 3: Test Patterns and Structure

Part 4: Test Doubles: Isolating External Dependencies

Part 5: AI-assisted TDD process

Part 6: Quantifying System-Specific Testing Strategies

Part 7: Best Practices for AI Generated Testing

Part 8: Test Data Management

Part 9: Automated Testing in CI/CD

Part 10: TDD Anti-Patterns and Pitfalls

Summary: Test checklist

Next article preview

Reference resources

You are reading: Quantitative trading system development record

Current series chapters

Continue along this topic path

Go deeper into this topic

Subscribe to updates

Comments and discussion

Series reading order

Introduction: an edge case not covered by tests

Part One: Why Quantitative Systems Are Difficult to Test

Part 2: TDD Basics: RED-GREEN-REFACTOR

Part 3: Test Patterns and Structure

Part 4: Test Doubles: Isolating External Dependencies

Part 5: AI-assisted TDD process

Part 6: Quantifying System-Specific Testing Strategies

Part 7: Best Practices for AI Generated Testing

Part 8: Test Data Management

Part 9: Automated Testing in CI/CD

Part 10: TDD Anti-Patterns and Pitfalls

Summary: Test checklist

Next article preview

Reference resources

You are reading: Quantitative trading system development record

Current series chapters

Continue along this topic path

Quantitative trading system development record (1): five key decisions in project startup and architecture design

Quantitative trading system development record (2): Python Pitfalls practical pitfall avoidance guide (1)

Record of Quantitative Trading System Development (Part 3): Python Pitfalls Practical Pitfalls Avoidance Guide (Part 2)

Continue with this topic

Quantitative trading system development record (5): Python performance tuning practice

Record of Quantitative Trading System Development (6): Architecture Evolution and Reconstruction Decisions

Quantitative trading system development record (7): AI engineering implementation - from speckit to BMAD

Go deeper into this topic

Subscribe to updates

Comments and discussion