Article
Quantitative trading system development record (4): test-driven agile development (AI Agent assistance)
Starting from a cross-night trading day boundary bug, we reconstruct the test defense line of the quantitative trading system: defect-oriented testing pyramid, AI TDD division of labor, boundary time, data lineage and CI Gate.
Readers can regard this article as a defect-oriented testing line of defense for quantitative trading systems. The 100 real pitfalls of Part2/Part3 should not stay in the issue list, they need to be converted into executable tests, regression cases, test data lineage and CI Gate. Testing is not about heap numbers, but about making sure that real defects cannot cross the system boundaries again.
Series reading order
The recommended reading path is Part1 -> Part2 -> Part3 -> Part4 -> Part5 -> Part6 -> Part7. Part 4 is placed before performance optimization, because the trading system must first prove that the results are reliable before discussing speed, throughput and rendering performance.
This article focuses on five questions:
- Why quantitative system testing should work backwards from real defect and risk populations, rather than from templated coverage.
- How RED-GREEN-REFACTOR protects API semantics, bounding times, and refactoring safety in trading systems.
- When AI generates tests, which responsibilities can be handed over to AI and which acceptance criteria must be signed off by humans.
- How boundary time, attribute testing, fuzz testing, backtest consistency and test data lineage are combined into a trading system testing defense line.
- How AI assists integration testing and E2E testing while identifying implementation excursions through requirements contract tracing.
- How CI Gate turns unit tests, integration tests, property tests, boundary time tests, and coverage reports into merge gates.
Introduction: an edge case not covered by tests
The most dangerous flaws before going online are usually not errors that occur every day, but errors that are only triggered on specific dates, specific trading periods, and specific data combinations. A typical scenario is: the same set of strategies has an extra K line on Monday, causing the indicator window to be misaligned. All ordinary test cases passed, but the real trading day semantics have been contaminated.
The root cause comes from the overnight trading mechanism of Hong Kong futures. Friday night trading and Monday morning trading may belong to the same trading day in terms of trading semantics, but the physical time spans Friday, Saturday, Sunday and Monday. If the aggregation logic is only segmented according to natural days, Friday night trading, cross-midnight data and Monday morning trading may be mistakenly split, and the daily lines, indicators and backtest results will be contaminated.
What this flaw exposes is not “insufficient number of tests”, but that the test design is not aligned with the real risks. Effective testing must answer at least three questions: does the risk belong to syntax, state, concurrency, time, data, GUI, network or security boundary; what is the minimum recurrence data; should it be intercepted by unit testing, property testing, integration testing, end-to-end testing or CI Gate.
Part One: Why Quantitative Systems Are Difficult to Test
Quantitative systems are difficult to test not because the test framework is complex, but because the object under test is simultaneously superimposed on data, time, status, randomness and external dependencies.
The first type of challenge is data dependence. The strategy logic relies on historical market conditions, but the complete historical data is large and cannot be submitted to version control; the real data may contain sensitive information and cannot be shared publicly; the exchange may also adjust historical data, causing old samples to become invalid. Therefore, test data cannot only rely on “a certain local CSV”, but needs to be supported by synthetic data, desensitized snapshots, contract testing and data lineage.
The second type of challenge is randomness. The network delay is uncertain, there are jitters in the market push, there are precision errors in floating point calculations, and there may be differences between backtesting and simulated trading due to the different sequence of events. Tests can’t just assert “a value is exactly equal to another”; they also need to explicitly tolerate errors, order of events, and repeatable random seeds.
The third type of challenge is complex states. Position status, order status, funding status, layered data normalization status, and strategy internal indicator status may all carry over across events. If a test only looks at the return value of a single function, it is easy to miss state life cycle issues, such as cancellation after partial transactions, recovery of the indicator window after night trading, duplication of subscription status after reconnection, etc.
The fourth type of challenge is the density of edge cases. Opening, closing, lunch break, night trading, half-day market, holidays, cross-week, cross-day K-lines, data loss, network interruption, market delay and risk control freeze are all high-risk boundaries. A truly effective testing pyramid should infer the testing levels from these defect entrances.
The focus of this picture is “defect orientation”. The bottom-level tests cover pure functions, indicator calculations, and time semantics, because these errors will be repeatedly amplified by the upper-level strategies; the middle-level tests cover data sources, aggregators, policy events, and test avatars, because multiple modules are connected here; the upper-level tests cover backtests, simulated disks, and real disk consistency, because this verifies whether the system really runs according to business semantics; the top-level CI Gate is responsible for blocking merging, rather than replacing the bottom-level tests.
Part 2: TDD Basics: RED-GREEN-REFACTOR
The core loop of TDD is RED-GREEN-REFACTOR. For a trading system, its value is not to formally “write tests first”, but to use failed tests to define behavioral boundaries, use minimum implementation to verify that the tests are effective, and then refactor under test protection.
The goal of the RED phase is to define correctness with tests. Taking the simple moving average as an example, the test first defines [1, 2, 3, 4, 5] and the result when period=3 is [None, None, 2.0, 3.0, 4.0]. This test should fail first, either because the class does not exist, the method does not exist, or the behavior is not as expected. Failure itself is a valid signal: it proves that the test actually caught the current gap.
def test_sma_calculates_correctly():
prices = [1, 2, 3, 4, 5]
sma = SimpleMovingAverage(period=3)
result = sma.calculate(prices)
assert result == [None, None, 2.0, 3.0, 4.0]
The goal of the GREEN phase is to use the minimum implementation to make the test pass. The minimal implementation can even be hardcoded first, as long as it proves that the test itself works. Then enter the actual implementation: the first period - 1 result is None, the first window uses the initial sum, and subsequent incremental updates are through sliding windows.
class SimpleMovingAverage:
def __init__(self, period: int):
self.period = period
def calculate(self, prices: list[float]) -> list[float | None]:
if len(prices) < self.period:
return [None] * len(prices)
result: list[float | None] = [None] * (self.period - 1)
window_sum = sum(prices[: self.period])
result.append(window_sum / self.period)
for index in range(self.period, len(prices)):
window_sum += prices[index] - prices[index - self.period]
result.append(window_sum / self.period)
return result
The goal of the REFACTOR phase is to improve the structure without changing the external behavior. Refactoring in trading systems especially requires test protection, since indicator calculations, data attribution and strategy signals may be affected by local changes. If the test remains green, it means that the refactoring has not changed the defined behavior.
The direct benefits of TDD to the trading system can be compressed into a table:
| Dimensions | No TDD | Have TDD |
|---|---|---|
| code coverage | Commonly stays around 30% | Can be stably advanced to more than 80% |
| Bug escape rate | Boundary defects are easy to penetrate | Real defects can be deposited into regression use cases |
| Reconstruct confidence | Don’t dare to change the core module | Can be refactored in stages and continuously verified |
| API design | Working backward from implementation details | Driven by test scenarios |
| Documentation value | Documentation lags behind code | Tests become runnable documents |
Part 3: Test Patterns and Structure
Test structure determines whether readers can quickly understand what a use case is verifying. The three most commonly used structures in quantitative systems are AAA, Given-When-Then, and table-driven testing. They solve problems at different levels: AAA makes function behavior clear, Given-When-Then makes business scenarios clear, and table-driven makes boundary collections clear.
AAA is suitable for functional and service level testing. Arrange prepares data and dependencies, Act performs the operation under test, and Assert verifies the results. The following skeleton, although simple, prevents tests from mixing preparations, actions, and assertions.
# illustrative code, not production code
def test_xxx():
"""testdescription"""
# Arrange: test data and environment
input_data = ...
expected_output = ...
# Act: Executeoperation
actual_output = function_under_test(input_data)
# Assert: verifyresult
assert actual_output == expected_output
When readers look at AAA tests, they should first check three questions: whether the input is minimal, whether there is only one action, and whether the assertion directly verifies the business behavior. Profit and loss calculation is the best example for AAA because its input, action and output boundaries are very clear.
# illustrative code, not production code
def test_calculate_profit():
"""test"""
# Arrange
entry_price = 25000
exit_price = 25100
position_size = 10
expected_profit = 1000 # (25100 - 25000) * 10
# Act
actual_profit = calculate_profit(
entry=entry_price,
exit=exit_price,
size=position_size,
)
# Assert
assert actual_profit == expected_profit
The value of this code does not lie in the complexity of the formula, but in that it fixes the basic semantics in the trading system that are most easily amplified. If there is no independent testing of profit and loss calculations, subsequent positions, risk control, backtesting statistics and performance attribution will be based on unreliable foundations.
Given-When-Then is more suitable for strategic behavior and acceptance scenarios. It makes the test read like a business specification: given a certain market state, when the strategy handles the event, it should produce a certain trading action.
# illustrative code, not production code
def test_user_story_scenario():
"""scenariodescription"""
# Given: status
setup_conditions()
# When:
perform_action()
# Then: expectedresult
verify_expected_outcomes()
The strategic golden cross buying is a typical scenario. The test should not only verify the change of an internal variable, but also verify whether the strategy generates an executable business signal when facing the last Bar.
# illustrative code, not production code
def test_strategy_generates_buy_signal():
"""generate"""
# Given: MA5 MA10
strategy = MovingAverageCrossStrategy(fast=5, slow=10)
historical_data = load_data_with_golden_cross()
# When: handlebars Bar
signal = strategy.on_bar(historical_data[-1])
# Then: generate
assert signal.action == Action.BUY
assert signal.price == historical_data[-1].close
This code turns “Golden Cross” from a verbal description into a runnable specification. Readers should pay attention to whether the Given part actually constructs a golden cross, rather than just relying on the function name to imply the scene; whether the Then part verifies the action and price, rather than just verifying that the return object is not empty.
Table-driven testing is suitable for a large number of boundary values and equivalence classes. To add a new scenario, you only need to add one row of data. pytest parameterization can ensure that each use case runs independently and displays the specific scenario when it fails.
# illustrative code, not production code
import pytest
TEST_CASES = [
# input, expected, description
([1, 2, 3], 2.0, "positive integers"),
([-1, -2, -3], -2.0, "negative integers"),
([1.5, 2.5, 3.5], 2.5, "floating-point values"),
([5, 5, 5], 5.0, "identical values"),
([1], 1.0, "single element"),
]
@pytest.mark.parametrize("input_data,expected,desc", TEST_CASES)
def test_average_calculation(input_data, expected, desc):
"""test"""
result = calculate_average(input_data)
assert result == expected, f"Failed on {desc}"
The key to table-driven testing is to explicitly write out “what input categories are there”. Quantitative systems can put price series, trading periods, order statuses, missing data patterns, and abnormal inputs into tables, rather than duplicating multiple nearly identical tests.
| model | Applicable scenarios | risk interception point |
|---|---|---|
| AAA | Pure functions, indicators, profit and loss calculations | Clear separation of input, action, and output |
| Given-When-Then | Strategy signals, user stories, acceptance scenarios | Binding business semantics and behavioral results |
| table driven | Multiple sets of boundary values, equivalence classes, abnormal input | Prevent only testing one happy path |
Part 4: Test Doubles: Isolating External Dependencies
Quantitative systems rely heavily on external data sources, order managers, databases, logs, and trading gateways. The purpose of a test double is not to “mock everything”, but to isolate uncontrollable dependencies while retaining the behavior you want to verify. Dummy, Stub, Spy, Mock and Fake solve different problems, and mixing them will make the test brittle.
Dummy is a placeholder object and is only used to meet parameter requirements. The logger can be an object that will not be used when the current test does not verify logging behavior.
# illustrative code, not production code
def test_dummy_logger_does_not_affect_fetch():
"""Dummy, andassertion"""
dummy_logger = object()
service = DataService(logger=dummy_logger)
result = service.fetch_data("HSI")
assert result is not None
This code only proves that DataService.fetch_data does not rely on logger behavior. Readers should not assert logging calls in Dummy tests, otherwise Dummy will be mistakenly upgraded to Mock.
Stub is a fixed return value object, suitable for replacing the real market source. When the strategy only requires a certain price, Stub can ensure that the test does not rely on real APIs, network delays, and market permissions.
# illustrative code, not production code
class StubDataFeed:
"""return"""
def __init__(self, fixed_price: float):
self.fixed_price = fixed_price
def get_price(self, symbol: str) -> float:
return self.fixed_price
def test_strategy_with_stub_price():
"""use Stub test strategy"""
stub_feed = StubDataFeed(fixed_price=25000)
strategy = SimpleStrategy(data_feed=stub_feed)
signal = strategy.check_signal()
assert signal is not None
The reader value of Stub is stable input. As long as the test goal is strategy logic, the real market service should not be a source of failure.
Spy records the interaction between the object under test and collaborators, and is suitable for verifying the number, direction and quantity of orders. It does not preset complex expectations and only records the interaction for use in assertions.
# illustrative code, not production code
class SpyOrderManager:
"""recordcall Spy"""
def __init__(self):
self.orders_placed = []
def place_order(self, symbol: str, side: str, quantity: int):
self.orders_placed.append({
"symbol": symbol,
"side": side,
"quantity": quantity,
})
def test_strategy_places_buy_order_once():
"""verify BUY"""
spy = SpyOrderManager()
strategy = SignalStrategy(order_manager=spy)
strategy.on_signal(Signal.BUY)
assert len(spy.orders_placed) == 1
assert spy.orders_placed[0]["side"] == "BUY"
This code protects order side effects boundaries. Readers should pay attention to the two assertions of “only place one order” and “correct direction”, because repeated orders and wrong directions may lead to real risks.
Mock presets expectations and verifies calling behavior, and is suitable for checking whether a method is called according to specified parameters. Overuse of mocks binds tests to implementation details, so only use them when the interaction itself is part of the behavior.
# illustrative code, not production code
from unittest.mock import Mock
def test_strategy_calls_data_feed_with_symbol():
"""verifyrequest HSI"""
mock_feed = Mock()
mock_feed.get_price.return_value = 25000
strategy = Strategy(data_feed=mock_feed)
strategy.update()
mock_feed.get_price.assert_called_once_with("HSI")
assert mock_feed.get_price.call_count == 1
The key assertion in this code is assert_called_once_with("HSI"). If the policy requests the wrong contract, duplicate requests, or missed requests, Mock can immediately expose interface interaction errors.
Fake is a simplified but real usable implementation, such as an in-memory database. Repository testing uses Fake to preserve read and write semantics while avoiding the cost of real database connections, transactions, and cleanup.
# illustrative code, not production code
class FakeDatabase:
"""memorydata Fake"""
def __init__(self):
self.data = {}
def save(self, key: str, value: dict):
self.data[key] = value
def get(self, key: str) -> dict:
return self.data.get(key)
def delete(self, key: str):
if key in self.data:
del self.data[key]
def test_repository_with_fake_db():
"""use Fake datatest Repository"""
fake_db = FakeDatabase()
repo = BarRepository(database=fake_db)
bar = BarData(symbol="HSI", close=25000)
repo.save(bar)
retrieved = repo.get("HSI")
assert retrieved.close == 25000
The value of Fake is to preserve the true semantics. Readers can understand it as a “lightweight but runnable alternative implementation”, which is more suitable than Mock for testing the reading and writing process and Repository boundaries.
| Stand type | use | Quantitative system example |
|---|---|---|
| Dummy | Fill in parameters that will not be used | Logger that does not participate in assertions |
| Stub | Return fixed response | Fixed price quote source |
| Spy | Record interaction | Record the number of strategic orders placed |
| Mock | Verify calling behavior | Verify market interface call parameters |
| Fake | Simplify real implementation | In-memory database or local order book |
Part 5: AI-assisted TDD process
AI-assisted TDD should not let AI take over acceptance criteria. A more secure division of labor is: humans define defects, business semantics, and acceptance boundaries; AI expands the test framework, generates candidate implementations, and makes refactoring suggestions; and humans ultimately review whether the tests really cover risks.
The three steps of traditional TDD are manual writing of tests, manual writing of implementation, and manual refactoring. AI-assisted TDD can be changed to: manually write specifications and key boundaries, AI generates a test framework; AI generates candidate implementations, and humans review; AI provides refactoring suggestions, and humans judge whether to accept them. This improves speed but does not hand over business semantic sign-off to the AI.
K-line cycle aggregation is a complete example. The specifications are first defined: given 1-minute K-line data, generate 5-minute K-line; the input contains open/high/low/close/volume/turnover; the output is a 5-minute K-line list. The core rules include: boundary normalization to period boundaries such as 09:00, 09:05, 09:10; open takes the first root, high takes the maximum value, low takes the minimum value, close takes the last root, and volume and turnover are summed. Boundary cases include: not generating less than 5 roots, correct segmentation across 5-minute boundaries, and rounding down when timestamps are not on boundary.
Specifications should be locked down to business rules in natural language before being handed over to AI for extended testing. The following specification is not a decorative document, it qualifies that AI cannot interpret aggregation semantics at will.
##: K periodaggregation
### Requirements
1 minute barsdata, generate 5 minute barsdata.
### input
- 1 minute barslist, open, high, low, close, volume, turnover
### output
- 5 minute barslist
### Rules
1.: 5 minute bars 09:00, 09:05, 09:10...
2.:
- open = bars 1 minute open
- high = max(1 minute high)
- low = min(1 minute low)
- close = bars 1 minute close
3.: volume = sum(1 minute volume)
### Boundary cases
- datainsufficient 5 bars: generate K
- data 5 minuteboundary:
-: 5 minuteboundary
Readers can think of this specification as an input contract for AI-generated testing. Without it, it is easy for AI to generate tests that “seem reasonable but are semantically uncertain”, such as ignoring boundary normalization or aggregating data for insufficient periods.
Prompt generation for AI testing should clearly define the testing strategy, sample factory, and assertion goals.
Generate pytest tests for the aggregation behavior described below.
Requirements:
1. use AAA.
2. Cover normal cases, boundary cases, and error cases.
3. Use input fixtures that expose trading-session ownership.
4. Generate test data with factory functions.
5. test docstring.
The following test framework retains key details in the original version: the factory function is responsible for generating time-continuous Bar, normal use cases verify OHLCV aggregation, edge use cases verify that insufficient cycles are not generated, and parameterized use cases verify the relationship between the number of inputs and the number of outputs.
# illustrative code, not production code
import pytest
from datetime import datetime, timedelta
from typing import List
from core.aggregation import aggregate_bars
from core.data import BarData
class TestAggregateBars:
"""test K periodaggregation"""
@staticmethod
def create_bars(count: int, start_time: datetime) -> List[BarData]:
"""factoryFunction:createtest data"""
bars = []
for i in range(count):
bars.append(
BarData(
symbol="HSI",
timestamp=start_time + timedelta(minutes=i),
open_price=100 + i,
high_price=105 + i,
low_price=95 + i,
close_price=100 + i + 0.5,
volume=1000,
turnover=100000,
)
)
return bars
def test_normal_case(self):
"""normal cases: 5 bars 1 minute 1 bars 5 minute"""
start_time = datetime(2024, 1, 8, 9, 0)
bars = self.create_bars(5, start_time)
result = aggregate_bars(bars, period=5)
assert len(result) == 1
assert result[0].open_price == 100
assert result[0].high_price == 109
assert result[0].low_price == 95
assert result[0].close_price == 104.5
assert result[0].volume == 5000
def test_insufficient_data(self):
"""Boundary cases: insufficient 5 barsgenerate"""
start_time = datetime(2024, 1, 8, 9, 0)
bars = self.create_bars(3, start_time)
result = aggregate_bars(bars, period=5)
assert len(result) == 0
@pytest.mark.parametrize("input_count,expected_count", [
(5, 1),
(10, 2),
(12, 2), # 2 barsinsufficient
(0, 0),
])
def test_various_counts(self, input_count, expected_count):
"""test: input"""
start_time = datetime(2024, 1, 8, 9, 0)
bars = self.create_bars(input_count, start_time)
result = aggregate_bars(bars, period=5)
assert len(result) == expected_count
The architectural value of this test is that it first defines the output semantics and then allows implementation changes. As long as these tests exist, the same business results must be preserved whether subsequent use of loops, NumPy, vectorization, or incremental state.
The RED phase should fail first. Failure is not a bad thing, it proves that the test does catch the current gap.
$ pytest tests/test_aggregation.py -v
tests/test_aggregation.py::TestAggregateBars::test_normal_case FAILED
tests/test_aggregation.py::TestAggregateBars::test_insufficient_data FAILED
tests/test_aggregation.py::TestAggregateBars::test_various_counts FAILED
# failure: aggregate_bars function
This output confirms to the reader that the test is not a “post hoc fix”. If the test fails initially, it is likely that the test does not cover the real gap, or the implementation already exists but is not constrained to the business semantics.
The implementation of the GREEN phase only needs to meet the defined tests and do not add too many abstractions in advance. The following implementation retains the core aggregation logic: slice by complete cycle and calculate open, high, low, close, volume and turnover respectively.
# illustrative code, not production code
from typing import List
def aggregate_bars(bars: List[BarData], period: int) -> List[BarData]:
"""minute barsaggregationperiod"""
if len(bars) < period:
return []
result = []
for i in range(0, len(bars) // period * period, period):
group = bars[i:i + period]
aggregated = BarData(
symbol=group[0].symbol,
timestamp=group[0].timestamp,
open_price=group[0].open_price,
high_price=max(b.high_price for b in group),
low_price=min(b.low_price for b in group),
close_price=group[-1].close_price,
volume=sum(b.volume for b in group),
turnover=sum(b.turnover for b in group),
)
result.append(aggregated)
return result
This implementation only addresses behavior covered by the current test. Readers should note that there is still room for improvement. For example, the timestamp alignment logic has not yet been independent, and the remaining insufficient period processing strategy still relies on test constraints.
The REFACTOR phase then takes time to align and aggregate constructors. The premise of refactoring is that tests continue to pass, and the goal is to make boundary semantics easier to understand for the next maintainer.
# illustrative code, not production code
def aggregate_bars(bars: List[BarData], period: int) -> List[BarData]:
"""minute barsaggregationperiod"""
if len(bars) < period:
return []
result = []
end = len(bars) - len(bars) % period
for i in range(0, end, period):
group = bars[i:i + period]
aligned_time = _align_timestamp(group[0].timestamp, period)
result.append(_create_aggregated_bar(group, aligned_time))
return result
def _align_timestamp(timestamp: datetime, period: int) -> datetime:
"""periodboundary"""
minute = (timestamp.minute // period) * period
return timestamp.replace(minute=minute, second=0, microsecond=0)
def _create_aggregated_bar(group: List[BarData], timestamp: datetime) -> BarData:
"""K createaggregation K"""
return BarData(
symbol=group[0].symbol,
timestamp=timestamp,
open_price=group[0].open_price,
high_price=max(b.high_price for b in group),
low_price=min(b.low_price for b in group),
close_price=group[-1].close_price,
volume=sum(b.volume for b in group),
turnover=sum(b.turnover for b in group),
)
The refactored code makes boundary normalization explicit. Readers should focus on whether _normalize_boundary_timestamp covers real exchange sessions; if night session, half-day trading, or cross-week attribution is needed later, this function becomes a key test entry.
Part 6: Quantifying System-Specific Testing Strategies
Quantitative testing cannot rely solely on handwritten samples. Handwriting samples are good for illustrating business intent, but rarely exhaust the input space. Trading systems require at least four types of proprietary strategies: attribute-based testing, fuzz testing, backtesting consistency verification, and boundary time testing.
Property-based testing is suitable for verifying mathematical properties. The core properties of the moving average include: the result length is related to the input length and period; each mean value should fall between the minimum and maximum values of the corresponding window; when the input sequence does not decrease monotonically, the moving average should also not decrease monotonically. Hypothesis can automatically generate a large number of price series and periods, helping to find boundaries missed by manual samples.
# illustrative code, not production code
from hypothesis import given, strategies as st
@given(
prices=st.lists(
st.floats(min_value=1, max_value=100000),
min_size=10,
max_size=100,
),
period=st.integers(min_value=2, max_value=20),
)
def test_ma_properties(prices, period):
"""test"""
result = calculate_ma(prices, period)
assert len(result) == len(prices) - period + 1
for value in result:
assert min(prices) <= value <= max(prices)
if all(prices[i] <= prices[i + 1] for i in range(len(prices) - 1)):
assert all(result[i] <= result[i + 1] for i in range(len(result) - 1))
The value of this test is upgraded from “giving a few examples” to “verifying mathematical properties”. Readers should note that the second assertion is written as a global scope constraint; if you want to be more strict, you can limit each mean to min/max of the corresponding sliding window.
Fuzz testing is suitable for verifying the robustness of parsing logic. When the market parser faces any byte input, it is allowed to throw a clear ParseError, but no unexpected exceptions should occur; if the parsing is successful, it must also meet basic constraints such as high >= low, high >= open, high >= close and other basic constraints.
# illustrative code, not production code
import atheris
import sys
@atheris.instrument_func
def test_parse_bar(input_bytes):
"""testparse"""
fdp = atheris.FuzzedDataProvider(input_bytes)
data = fdp.ConsumeBytes(len(input_bytes))
try:
bar = parse_bar_data(data)
if bar:
assert bar.high >= bar.low
assert bar.high >= bar.open
assert bar.high >= bar.close
except ParseError:
pass
except Exception:
raise
atheris.Setup(sys.argv, test_parse_bar)
atheris.Fuzz()
This code keeps dirty data hitting the parser. Readers should pay attention to exception boundaries: business-acceptable parsing failures should enter ParseError, and unknown exceptions are defects that need to be fixed.
Backtest consistency verification is used to ensure that the strategy performs consistently in backtests and simulated trading. What this test protects is “the same set of business semantics for backtesting and real trading”, not a single function.
# illustrative code, not production code
def test_strategy_consistency():
"""verify and"""
historical_data = load_data("HSI", "2024-01-01", "2024-01-31")
backtest_result = run_backtest(
strategy=MyStrategy(),
data=historical_data,
mode="backtest",
)
simulated_result = run_backtest(
strategy=MyStrategy(),
data=historical_data,
mode="simulation",
)
assert len(backtest_result.signals) == len(simulated_result.signals)
for b_sig, s_sig in zip(backtest_result.signals, simulated_result.signals):
assert b_sig.timestamp == s_sig.timestamp
assert b_sig.action == s_sig.action
assert abs(b_sig.price - s_sig.price) < 0.01
This code should not be understood as “the backtest income and the simulated plate income are exactly the same.” A more precise goal is: under the same event input, the signal quantity, timestamp, action and price semantics should be consistent, and the allowed error must be explicitly written in the assertion.
Boundary time testing is one of the most important regression defense lines for quantitative systems. There should be fixed fixtures from Friday night trading to Monday morning trading, half-day trading, lunch break, the first opening bar, the last closing bar, and night trading on the eve of holidays.
# illustrative code, not production code
class TestHKFEBoundaryCases:
"""testboundarycase"""
def test_friday_night_to_monday(self):
"""night session"""
friday_night = datetime(2024, 1, 5, 23, 0)
monday_morning = datetime(2024, 1, 8, 9, 15)
bars = [
BarData(timestamp=friday_night, close=25000),
BarData(timestamp=monday_morning, close=25100),
]
result = aggregate_daily(bars)
assert len(result) == 1
assert result[0].date == date(2024, 1, 5)
def test_half_day_holiday(self):
"""half-day sessionmarket close"""
half_day_close = datetime(2024, 2, 9, 12, 0)
bars = generate_bars_until(half_day_close)
assert all(bar.timestamp <= half_day_close for bar in bars)
This code turns overnight, weekly, and half-day markets into fixed regression samples. Readers should focus on the trading day attribution rules, not whether the physical dates are consecutive.
Figure 3 Think of boundary times as state transitions rather than lists of dates. The test needs to cover the transition from day trading to lunch break, recovery from lunch break, day trading to night trading, cross-midnight, cross-week attribution, early closing of half-day market and market closure on holidays. As long as these states are not explicitly tested, aggregators and policies will continue to rely on implicit assumptions.
Part 7: Best Practices for AI Generated Testing
The key to getting AI to generate better tests is to give it a testing strategy, rather than just saying “make up the test.”
First, clarify the testing method. You can ask AI to use boundary value analysis to generate normal ranges, boundary values, and abnormal inputs respectively; you can also ask AI to use equivalence class division to split price, time, trading day, and order status into different categories.
Use boundary value analysis to generate tests for the following function:
- normal case: input is within the valid range
- boundary case: input is exactly on the boundary
- error case: input is outside the valid range
The value of this prompt is to force the AI to organize the test space first. The reader can go on to add that “each use case must describe the source of risk and assertion target” to prevent AI from giving only superficial coverage.
Second, ask the AI to interpret the intent of the test. Each test should state the verified business hypothesis, such as “the night trading bar at 23:30 on Friday should be attributed to the trading day of the following Monday, not the natural day Friday”.
testaddcomment, test and verify.
If the AI’s explanations are unclear, the tests it generates are often unreliable. Explain the test intent not to add comments but to make the business assumptions reviewable.
Third, let AI identify omissions. Give the implementation, specifications, and existing tests to the AI and let it list the scenarios it may have missed. It will usually find problems such as null input, maximum values, minimum values, concurrent access, resource exhaustion, and floating point precision.
Given the implementation and requirement notes, identify missing test scenarios.
Classify them as normal, boundary, error, performance, or output-validation cases.
Each scenario must state the risk source, minimal reproduction data, and assertion target.
This prompt turns “make-up testing” into a risk classification task. Readers should treat the AI output as a candidate list, and then decide whether to adopt it based on real defects and system boundaries.
Fourth, use mutation testing to check test quality. Tools such as mutmut will modify the source code, such as changing > to >=, and then run the tests. If the test still passes, the test does not truly protect the behavior.
# illustrative code, not production code
# : pip install mutmut
# : mutmut run
# mutmut modifycode, for example > >=.
# iftestpass, descriptiontestnoconstraint.
Mutation testing is not suitable for running every submission, but it is suitable for use before and after refactoring key modules. It can expose the illusion of “high coverage but weak assertions”.
Fifth, when letting AI generate integration tests and E2E tests, the system boundaries must first be defined. The most common problem is not that AI cannot write test code, but that it writes E2E as a happy path that “can run through pages or commands”, but does not verify whether the business link is really closed. Integration testing of the quantitative system must at least go through data sources, trading day attribution, period aggregation, indicator calculations, strategy signals and order adapters; E2E testing must also verify the results that readers really care about, such as signal timestamps, order direction, quantity, price tolerance, backtest reports and abnormal degradation paths.
Generate integration tests and E2E tests based on the following requirements contract.
Requirements:
1. Each test must trace back to a REQ/INV/E2E ID.
2. Tests must cover DataFeed -> Calendar -> Aggregator -> Strategy -> OrderAdapter.
3. E2E tests must verify user-visible business results, not only that a page opens or a command exits with code 0.
4. Use fakes or stubs only at explicit external boundaries, and keep assertions on the real integration path.
5. Test output must include the evidence path, assertion target, and failure message.
The key to this prompt is to change “Generate E2E” to “Generate traceable E2E according to contract”. Readers should avoid asking AI to write only browser clicks, CLI smoke tests, or mock-only assertions. A qualified E2E must at least prove that a certain requirement enters the system from input data, passes through the real integration boundary, and finally produces auditable business results.
Sixth, after the execution of the requirements contract is completed, there needs to be an independent tracking artifact, instead of just seeing whether the test passes. Trace Matrix should put requirements, constraints, invariants, tests, evidence and states in the same table. In this way, implementation deviations can be identified: the code is written but does not correspond to the requirements, the test passes but there is no evidence, E2E only covers the happy path, or the implementation path bypasses the boundaries defined in the contract.
trace:
- id: TRACE-HKFE-001
requirement: REQ-TRADING-DAY-OWNERSHIP
invariant: INV-NATURAL-DAY-MUST-NOT-LEAK
integration_test: tests/integration/test_hkfe_daily_aggregation.py
e2e_test: tests/e2e/test_strategy_signal_calendar_boundary.py
evidence:
- reports/integration/hkfe_daily_aggregation.xml
- reports/e2e/strategy_signal_calendar_boundary.json
drift_signals:
- implementation_without_requirement
- passing_test_without_business_assertion
- mock_only_without_integration_boundary
- missing_evidence_artifact
status: PASS
The value of this tracking artifact is to make “done” into a reviewable state. Readers can follow TRACE-HKFE-001 to see where the requirements come from, what the invariants are, where the integration tests and E2E tests are, what the evidence files are, and which offset signals need to be intercepted. If an item is missing, the status should not be changed to PASS.
There are typically five types of signals used to implement offsets. First, the code change cannot find the corresponding REQ or TRACE, indicating that the implementation may have exceeded the requirements boundary. Second, the test only verifies function return or page existence, but does not verify business semantics. Third, E2E relies on a mock-only path that does not cross critical integration boundaries. Fourth, the AI-generated fix makes the test green but removes, relaxes, or bypasses the original assertion. Fifth, the final report only has a “passed” conclusion and no evidence artifacts that can be reviewed.
Quick correction should not start with “continue to let AI change the code”, but should start with the failure evidence package. The minimum closed loop is: locate the failed TRACE line, read the corresponding REQ/INV/E2E, confirm the offset type, fill in the minimum recurrence or assertion, and then let AI only modify the code and tests related to the TRACE. After repair, re-run the corresponding gate and update the evidence path and status. This process can condense a large-scale rework into iterations around a single requirements contract.
| offset signal | risk | corrective action |
|---|---|---|
| There is code but no REQ/TRACE | Achieve running beyond the demand boundary | Supplement requirement mapping or withdraw unfounded implementation |
| Test no business assertion | happy path pseudo pass | Add business visible result assertions |
| E2E only uses mock | Integration boundaries not verified | Connect to Fake gateway and real data path |
| PASS No evidence | Completion status cannot be reviewed | Generate and log reports, screenshots or JSON evidence |
| AI relaxes assertions | Test quality declines | Restore the original assertion and fix the failed use case |
The way to use this table is straightforward: each time the AI claims completion, it goes through the offset signal one by one instead of reading the code diff first. Trading systems especially require this kind of discipline, since many errors are not exposed in unit tests, but in the combination of data paths, trading day semantics, strategy events and order adapters.
Part 8: Test Data Management
Test data management determines whether a test is reproducible. The quantitative system cannot directly plug real market conditions into all tests, nor can it rely solely on randomly generated data. A more reliable combination is: synthetic data covers structural scenarios, data snapshots save real defect samples, and contract testing protects data source boundaries.
Synthetic data is suitable for covering trends, volatility, gaps, gaps and extreme values. The generator can control start and end times, trend direction, volatility and volume range.
# illustrative code, not production code
def generate_synthetic_bars(
symbol: str,
start: datetime,
end: datetime,
trend: str = "random",
volatility: float = 0.02,
) -> List[BarData]:
"""generate K data"""
bars = []
current_time = start
price = 25000
while current_time <= end:
if trend == "up":
change = abs(random.gauss(0.001, volatility))
elif trend == "down":
change = -abs(random.gauss(0.001, volatility))
else:
change = random.gauss(0, volatility)
price *= (1 + change)
bar = BarData(
symbol=symbol,
timestamp=current_time,
open_price=price * (1 + random.gauss(0, 0.001)),
high_price=price * (1 + abs(random.gauss(0, 0.002))),
low_price=price * (1 - abs(random.gauss(0, 0.002))),
close_price=price,
volume=random.randint(1000, 10000),
)
bars.append(bar)
current_time += timedelta(minutes=1)
return bars
This code is suitable for constructing trend, volatility, and volume ranges, but should not be used as a substitute for real defect snapshots. Readers should fix the random seed or record generation parameters, otherwise the synthetic data itself will become a source of instability.
Data snapshots are suitable for saving real defect samples. The desensitized fixtures/hsi_2024_01.json can be used to reproduce the Friday night trading attribution error or the half-day market closing abnormality.
# illustrative code, not production code
@pytest.fixture
def real_market_data():
"""use a redacted real-market data snapshot"""
return load_test_data("fixtures/hsi_2024_01.json")
Snapshots should be small and stable, focusing on retaining the minimum data required to reproduce defects. Readers should record the source, desensitization method, transaction day attribution rules and corresponding defect number.
Contract testing is suitable for validating data source boundaries. For example, the data returned by get_bars("HSI", "1m", count=100) must be sorted by time; each Bar must satisfy high >= open/close/low; the timestamp must have a time zone.
# illustrative code, not production code
def test_datafeed_contract():
"""verifydata"""
feed = DataFeed()
bars = feed.get_bars("HSI", "1m", count=100)
timestamps = [bar.timestamp for bar in bars]
assert timestamps == sorted(timestamps)
for bar in bars:
assert bar.high >= bar.open
assert bar.high >= bar.close
assert bar.high >= bar.low
Contract testing prevents upstream data changes from quietly breaking downstream policies. Readers should place it near the data entry instead of waiting for the strategy test to fail before troubleshooting data problems.
Figure 4 focuses on blood ties. Every step of the test data from raw market conditions to cleaning, trading day attribution, aggregation, fixtures, assertions and regression records should be traceable. Otherwise, when the test fails, readers cannot determine whether the original data has changed, the cleaning logic has changed, the transaction day attribution has changed, or the assertion itself has expired.
Part 9: Automated Testing in CI/CD
The CI Gate’s responsibility is not to run all tests, but to block incorrect merges at the appropriate stage. Different test layers bear different feedback speeds and risk interception responsibilities.
A typical CI can be executed in layers: unit testing is the fastest, covering the core algorithm; integration testing verifies data source, aggregator, strategy and order module collaboration; attribute testing uses a fixed seed to ensure reproducibility; boundary time testing covers HKFE special trading days; coverage reports are used to prevent core modules from losing test protection; failed tests must prevent merging.
name: Tests
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: "3.10"
- run: pip install -r requirements.txt -r requirements-test.txt
- run: pytest tests/unit -v --cov=core --cov-report=xml
- run: pytest tests/integration -v
- run: pytest tests/property -v --hypothesis-seed=0
- uses: codecov/codecov-action@v3
with:
files: ./coverage.xml
fail_ci_if_error: true
This configuration breaks fast feedback, module collaboration, property testing, and coverage reporting into independent steps. Readers should decide which tests go into each PR and which tests go into nightly tasks based on the size of the project, but failing tests must be able to prevent high-risk merges.
The test report needs to be able to answer “which levels have been verified.” The output example below shows the information the report should carry, rather than asking the reader to copy specific numbers.
tests/unit/test_aggregation.py PASSED
tests/unit/test_indicators.py PASSED
tests/unit/test_datafeed.py PASSED
tests/unit/test_strategy.py PASSED
tests/integration/test_end_to_end.py PASSED
tests/property/test_properties.py PASSED
tests/boundary/test_hkfe_cases.py PASSED
tests/regression/test_issue_*.py PASSED
core/aggregation.py 97%
core/indicators.py 94%
core/datafeed.py 91%
core/strategy.py 88%
total 95%
These numbers are not goals per se, but they give the reader an idea of where risk interception occurs. What’s more important is: whether boundary timing, property testing, and real defect regressions are included in the report, not whether the overall coverage is pretty.
Part 10: TDD Anti-Patterns and Pitfalls
The first anti-pattern is testing implementation rather than behavior. Testing internal_cache.size() == 5 would tie the test to the internal structure; it would be safer to assert external behavior.
# illustrative code, not production code
def test_implementation():
"""example: testimplementation"""
result = calculate()
assert result.internal_cache.size() == 5
The problem with this code is that the test relies on internal cache structures. As soon as the implementation is changed to a generator, array, or incremental state, the test will fail, even if the external behavior has not changed.
# illustrative code, not production code
def test_behavior():
"""example: test"""
result = calculate([1, 2, 3, 4, 5])
assert result == 3.0
Behavioral testing leaves room for refactoring. Trading systems require this kind of testing because performance optimization, data structure replacement, and execution domain splitting all change the internal implementation.
The second anti-pattern is that a test verifies too many things. A test calls func1, func2, func3 at the same time, and it is difficult to locate the root cause when it fails.
# illustrative code, not production code
def test_everything():
"""example: testcover"""
result1 = func1()
result2 = func2()
result3 = func3()
assert result1 == expected1
assert result2 == expected2
assert result3 == expected3
The problem with this code is that failure location is difficult. Readers cannot quickly determine whether input preparation, a certain function behavior, or pre-state pollutes subsequent assertions.
# illustrative code, not production code
def test_func1():
"""example: test func1"""
assert func1() == expected1
def test_func2():
"""example: test func2"""
assert func2() == expected2
Split tests make failure signals clearer. Especially in quantitative systems it is important to avoid that one test simultaneously verifies data loading, indicator calculations, strategy signals and order execution.
The third anti-pattern is ignoring edge cases. Simply testing that divide(10, 2) == 5 is not enough to prove that the function is reliable; you also need to test for division by zero, decimals, decimals, and floating-point approximations.
# illustrative code, not production code
def test_normal_case():
"""example: testnormal cases"""
assert divide(10, 2) == 5
This test only covers the happy path. The boundaries in the trading system are more complex and must cover opening, closing, lunch break, night trading, half-day trading and cross-week trading.
# illustrative code, not production code
def test_normal_case():
assert divide(10, 2) == 5
def test_divide_by_zero():
"""testboundary"""
with pytest.raises(ZeroDivisionError):
divide(10, 0)
def test_small_numbers():
"""test"""
assert divide(1, 3) == pytest.approx(0.333, rel=1e-3)
The value of boundary testing is to pin down situations that “do not happen often but have a huge impact once they occur.” In quantitative systems, many real accidents occur at low-frequency boundaries rather than on daily paths.
The fourth anti-pattern is that test data has no lineage. When a test fails without knowing where the fixture came from and what cleaning and attribution logic it went through, it’s difficult to tell whether the failure is valid. All real defect samples should have documented source, desensitization method, attribution rules, and assertion purpose.
The fifth anti-pattern is treating AI-generated tests as acceptance results. AI can expand scenarios, but it cannot prove that scenarios represent real business risks. Acceptance rights still come from specifications, defect reproduction, manual review and CI Gate.
Summary: Test checklist
The test design needs to confirm:
- Whether to infer testing from real defect and risk groups instead of just pursuing coverage.
- Whether to cover normal, boundary, and abnormal scenarios.
- Whether to use attribute testing to uncover hidden assumptions.
- Is there a boundary time test for quantized scenarios.
Test implementation needs to confirm:
- The tests are independent and do not depend on the order of execution.
- Whether the test is fast enough, regular single tests should maintain second-level feedback.
- Whether the test is readable, you can see at a glance what is being verified.
- Is there a clear Arrange-Act-Assert or Given-When-Then construct.
Test avatars require confirmation:
- Whether external dependencies are isolated using Stub, Mock, Spy or Fake.
- Is there a simplified but realistic fake for complex objects.
- Interactive validation uses mocks only when really needed.
Test data needs to be confirmed:
- Whether to use synthetic data to ensure reproducibility.
- Whether to keep desensitized snapshots of real defects.
- Whether there is a data contract test protecting the upstream boundary.
- Whether sensitive data is desensitized.
CI Gate requires confirmation:
- Whether to run critical tests on every commit.
- Whether the core module coverage is higher than the agreed threshold.
- Whether failing the test prevents the merge.
- Whether regression defects are converted into fixed tests.
Next article preview
The next article enters the practical practice of Python performance tuning. Readers will see how profilers locate bottlenecks, how incremental computation and virtualized rendering reduce latency, and why performance optimization must preserve evidence of correctness.
Reference resources
- pytest documentation: https://docs.pytest.org/
- Hypothesis property test: https://hypothesis.readthedocs.io/
- mutmut mutation testing: https://mutmut.readthedocs.io/
- Test-Driven Development: By Example (Kent Beck)
Series context
You are reading: Quantitative trading system development record
This is article 4 of 7. Reading progress is stored only in this browser so the full series page can resume from the right entry.
Series Path
Current series chapters
Chapter clicks store reading progress only in this browser so the series page can resume from the right entry.
- Quantitative trading system development record (1): five key decisions in project startup and architecture design Taking Micang Trader as an example, this article starts from system boundaries, data flow, trading-session ownership, unified backtesting/live-trading interfaces, and AI collaboration boundaries to establish the architecture thread for the quantitative trading system series.
- Quantitative trading system development record (2): Python Pitfalls practical pitfall avoidance guide (1) Reorganize Python traps from a long list into an engineering risk reference for quantitative trading systems: how to amplify the three types of risks, syntax and scope, type and state, concurrency and state, into real trading system problems.
- Record of Quantitative Trading System Development (Part 3): Python Pitfalls Practical Pitfalls Avoidance Guide (Part 2) Continuing to reorganize Python risks into a reference piece: how GUI lifecycles, asynchronous network failures, security boundaries, and deployment infrastructure affect the long-term stability of quantitative trading systems.
- Quantitative trading system development record (4): test-driven agile development (AI Agent assistance) Starting from a cross-night trading day boundary bug, we reconstruct the test defense line of the quantitative trading system: defect-oriented testing pyramid, AI TDD division of labor, boundary time, data lineage and CI Gate.
- Quantitative trading system development record (5): Python performance tuning practice Transform performance optimization from empirical guesswork into a verifiable investigation process: start from the 3-second chart delay, locate the real bottleneck, compare optimization solutions, and establish benchmarks and rollback strategies.
- Record of Quantitative Trading System Development (6): Architecture Evolution and Reconstruction Decisions Review the five refactorings of Micang Trader, explaining how the system evolved from the initial snapshot to a clearer target architecture, and incorporated technical debt and ADR decisions into long-term governance.
- Quantitative trading system development record (7): AI engineering implementation - from speckit to BMAD Taking the trading calendar and daily aggregation requirements as a single case, explain how AI engineering can enter the delivery of real quantitative systems through specification drive, BMAD role handover and manual quality gate control.
Reading path
Continue along this topic path
Follow the recommended order for Quantitative system development practice instead of jumping through random articles in the same topic.
Next step
Go deeper into this topic
If this article is useful, continue from the topic page or subscribe to follow later updates.
Loading comments...
Comments and discussion
Sign in with GitHub to join the discussion. Comments are synced to GitHub Discussions