Article
Record of Quantitative Trading System Development (6): Architecture Evolution and Reconstruction Decisions
Review the five refactorings of Micang Trader, explaining how the system evolved from the initial snapshot to a clearer target architecture, and incorporated technical debt and ADR decisions into long-term governance.
Readers can regard this article as a review of architecture evolution and technical debt: the five refactorings are not for the pursuit of formal “more elegance”, but a systematic response to real defects, performance bottlenecks, test pressure and collaboration costs.
Series reading order
Part1 -> Part2 -> Part3 -> Part4 -> Part5 -> Part6 -> Part7. Part 6 is placed after the performance chapter because refactoring is not an abstract aesthetic, but a systematic response to real defects, test pressure, and performance bottlenecks.
The following five refactorings revolve around the same issue: when a trading system moves from running to maintainable, verifiable, and scalable, which boundaries must be redrawn and which technical debt must be explicitly managed. Readers do not need to remember all the class names and implementation details first, but only need to grasp one main line: every refactoring should be based on real risks, change the boundaries in a verifiable way, and record the costs of the new boundaries.
How readers understand these five reconstructions
In the trading system, refactoring is not about making the catalog more beautiful, but about making the system easier to reason under real pressure: when the market continues to enter, the K-line attribution must be stable; when the indicators continue to update, the results must be verifiable; when the user drags the chart, the interface cannot be overwhelmed by historical data; when backtesting, real-time monitoring and data loading are running at the same time, a failure in one execution domain cannot take away the entire set of terminals.
When understanding these five refactorings, the most important thing is not to remember all the class names first, but to first see the sequence relationship between them. When the data boundaries are clear, testing can bypass the GUI and directly verify transaction semantics; when the computing status is clear, performance optimization does not involve blindly stacking hardware; when chart responsibilities are separated, virtual rendering will not continue to be piled in a God class; when the execution domain is isolated, long-term management can have stable fault boundaries and review entrances.
The first step must address data responsibilities. If ChartWidget is responsible for database access, K-line conversion, indicator calculation, graphics rendering and interactive processing at the same time, any small changes will pass through multiple levels: changing a data source will affect the UI, adjusting a trading period will affect the chart, and repairing an indicator defect may also change the window state. The first refactoring separates the data layer and UI layer. The real answer is “who owns the data semantics and who is only responsible for presentation”. If this problem is not solved, subsequent unit tests, backtest verifications and performance comparisons will lack a stable entrance.
The second step is to deal with the calculation status. Pandas full recalculation is very convenient when the sample is small, but when layered indicators, sliding windows and real-time incremental market conditions are stacked together, each new K line may trigger an unnecessary historical replay. The value of IncrementalMA does not lie in hand-writing a more flashy indicator class, but in changing the question to “When a new K line is added, the system only needs to update which minimum states”. Only when the computing state is controllable can performance optimization have a clear target, otherwise it will just be a waste of execution faster.
The third step is to separate the chart responsibilities. After the data layer is independent, the chart may still become a new monolith: when the data model, coordinate conversion, drawing logic, mouse interaction and indicator overlay are all crowded together, any demand will continue to expand the cognitive load of the component. The third reconstruction uses MVC ideas to separate the data model, renderer and interactive control, so that the charts can separately discuss “what the data is”, “how to draw it” and “how user operations change the view”. This step does not look like performance optimization, but it determines whether the subsequent rendering optimization can fall on the correct boundary.
The fourth step is the rendering cost. After the number of K-lines expands from a few thousand to a hundred thousand, the bottleneck will shift from data reading to the view layer. The solution of VirtualizedCandleRenderer is not as simple as “drawing faster”, but splits the complete data pool, visible window, buffer, off-screen cache and texture reuse into different concepts: what can be seen on the screen will be processed first; data that is temporarily unseen remains in the data layer and no longer participates in full redrawing repeatedly. It is similar to the counterpart of the front-end virtual list in the trading chart, except that the objects here are not ordinary DOM lines, but K lines, trading volumes, indicator lines and interactive markers.
The fifth step deals with the execution domain. When the UI, indicator calculation, backtesting and data loading are all placed in the same Python process, GIL, long tasks and blocking I/O will amplify each other, eventually manifesting as interface freezes, real disk monitoring delays or exceptions that are difficult to locate. The goal of multi-process and shared_memory is not to blindly pursue parallelism, but to split heavy calculations, shared data and UI responses into isolated failure domains: when the indicator calculation process crashes, the main interface should not lose response; when the backtest fills up the CPU, the real disk monitoring should not be slowed down; when the shared memory writes abnormally, the system should be able to locate which period of the data life cycle has the problem.
Finally, we have to deal with loss of control in decision-making. The larger the system, the easier it is to “write it temporarily like this” to become the most difficult debt to repay. The meaning of ADR, DEBT-* records and Code Review checklist is to retain the background, benefits, costs and review conditions of each architectural choice. They are not meant to prove that a certain choice is always correct, but to allow the next maintenance to answer three questions: why it was done then, whether conditions have changed today, and where to start if the boundaries need to be adjusted.
This causal line also explains why the reconstruction order cannot be interchanged at will. Loss of responsibility will make testing difficult; difficulty in testing will cause performance optimization to lack evidence of correctness; performance optimization lacking evidence will amplify the risk of refactoring; after the risk of refactoring increases, the team will continue to pile up debt with local patches; when the debt piles up to a certain extent, new features, defect repairs, and fault location will all slow down. On the other hand, the first refactoring establishes boundaries, the second refactoring controls computational complexity, the third refactoring removes view responsibilities, the fourth refactoring reduces rendering costs, the fifth refactoring isolates execution failures, and long-term governance precipitates these changes into a repeatable mechanism.
The following reading frame can help readers locate key points in long articles:
| reading questions | Corresponding reconstruction | Judgment criteria | Evidence that readers should pay attention to |
|---|---|---|---|
| Is the data semantics polluted by the UI? | first refactoring | Whether data access, transformation, and display are hierarchical | ChartWidget Whether to exit the data owner role |
| Is the indicator slowed down by full double counting? | Second reconstruction | Whether the new K line only updates necessary status | IncrementalMA Whether to retain the minimum state |
| Can charting responsibilities evolve independently? | The third reconstruction | Are data models, renderers, and interactions separated? | Is it easier to test after MVC split? |
| Is the chart with large data volume still smooth? | The fourth reconstruction | Whether the visible window replaces full drawing | Whether VirtualizedCandleRenderer only handles viewport |
| Does recalculation affect UI and real disk monitoring? | The fifth reconstruction | Are process boundaries and shared memory clear? | shared_memory Whether there are life cycle constraints for reading and writing |
| Can architectural choices be reviewed? | debt governance | Are there ADR, DEBT and review conditions? | Whether the decision can be understood by the next maintainer |
If there is only one principle to remember, it is this: Refactoring a trading system should not start from “where the code is ugly”, but from “which boundaries are creating errors, delays or collaboration costs”. Code style can be solved through lint. Problems that really require architectural refactoring usually have three characteristics: it spans multiple modules, has caused testing, performance or operation and maintenance risks, and cannot be suppressed by local patches for a long time. Only when these signals appear simultaneously is refactoring worthy of formal decision-making.
This is why the five reconstructions must be viewed together with the evidence. After the data layer and UI are decoupled, tests need to be used to prove that the policy logic no longer relies on window objects; after incremental indicators replace Pandas and full recalculation, benchmarks need to be used to prove that the calculation time is reduced, and regression tests are used to prove that the results are consistent; after VirtualizedCandleRenderer is connected, interaction tests and frame rate data need to be used to prove that dragging is smooth; after the multi-process architecture is online, fault injection needs to be used to prove that the failure of the sub-process will not bring down the main interface. Without this evidence, refactoring is just a massive code move.
Looking at these five refactorings together, they look more like a maintenance mechanism than five isolated cases. The first refactoring makes data input trustworthy; the second refactoring makes computing states controllable; the third refactoring makes interface boundaries detachable; the fourth refactoring makes interactive performance measurable; and the fifth refactoring makes execution failures isolable. After these capabilities are established, the technical debt list, ADR, and Code Review checklist are not just documents, but operating systems that can continue to constrain system evolution. The real reliability of a long-running quantitative system is not that it is never reconstructed, but that every reconstruction has evidence, boundaries, rollbacks, and review conditions.
This main line can also help readers determine which stage their system is in: If data source switching still affects the UI, don’t talk about multi-process; if the indicator results have not yet been implemented as a reference, don’t rush to do incremental optimization; if the chart component is still a single entity, don’t treat virtual rendering as a silver bullet. The order of architecture evolution itself is part of risk control. The later optimizations are more dependent on the previous boundaries and tests.
In other words, the reconstruction order is not the layout order in the article, but the order in which system risks are disassembled layer by layer.
Technical debt itself is not scary. What is scary is not knowing where the debt is, why it is formed, and when it must be repaid. A trading system can accept short-term trade-offs, but not unrecorded trade-offs. Temporary code should correspond to DEBT-* records, architecture selection should correspond to ADR, performance optimization should correspond to benchmark, and interface changes should correspond to regression testing. This will make the early pace seem slower, but it can prevent the system from suddenly losing control under the pressure of real market conditions, real users and real funds.
We also need to see the cost of reconstruction at the same time. After the first split of the data layer, the life cycle of the data object must be redefined; after the second incremental indicator, state initialization and playback recovery must be more cautious; after the third split of the chart component, the event subscription and rendering refresh order need to be reorganized; after the fourth introduction of VirtualizedCandleRenderer, visible windows, cache invalidation and mouse interaction must have consistent protocols; after the fifth introduction of multi-process, serialization, shared memory release, exception propagation and log correlation will all become new governance objects. Benefits and costs must enter into judgment at the same time, otherwise refactoring can easily turn from risk control into a new source of complexity.
Therefore, every refactoring should have entry and exit conditions. Entry conditions include: the defect has reoccurred, local patches can only transfer the problem, test or performance data can prove the risk exists, and the team knows the cost of not refactoring. Exit conditions include: old behaviors are protected by regression testing, key indicators are compared before and after, abnormal paths have downgrade plans, new boundaries are recorded in documents, and maintainers can independently understand them next time. Without entry conditions, refactoring can easily turn into technology preference; without exit conditions, refactoring can easily turn into a borderless project.
If readers are maintaining their own quantitative system, they can directly use the following checklist to evaluate whether to enter refactoring:
- Whether the problem has spanned more than two modules rather than a local flaw in a single function.
- Are there any impacts that can be perceived by real users, such as UI freezes, misaligned indicators, inconsistencies in backtesting, or faults that are difficult to locate?
- Whether there are already minimal reproductions, performance data or failed use cases, rather than just judging by reading the code.
- Whether it can be replaced in stages without changing external behavior and preserving the rollback path.
- Whether the refactoring products can be written into ADR, DEBT list, Code Review checklist and test plan.
- Has the cost of not refactoring been clarified, including the scope of changes to new features in the future, testing costs, and the probability of accidents?
The value of this list is to turn “should it be refactored” from a subjective debate into an evidence-driven judgment. The trading system especially requires this kind of discipline, because it does not face one-time page delivery, but long-term operation, continuous iteration and real financial risk. Any architectural change should make the system easier to reason about, not just make the directory structure look tidier.
The five refactorings also have one thing in common: they all reduce the context that the human brain needs to remember at the same time. After the data layer is independent, readers do not have to understand data cleaning in the UI code; after incremental indicators are independent, readers do not need to recall the complete historical window every time; after virtualized rendering is independent, readers do not have to mix hundreds of thousands of K lines with hundreds of K lines on the screen; after multi-process boundaries are established, readers do not have to put UI responses and CPU-intensive calculations in the same failure domain. Good architecture is not about making code look advanced, but about enabling maintainers to make correct judgments under pressure.
If you want to migrate this method to your own system, you can start with a minimal closed loop: choose a module that has problems repeatedly, record the current symptoms, add tests or benchmarks that can reproduce the problem, and then write an ADR explaining why it needs to be changed, how to roll back, and when to review. Only move boundaries when the chain of evidence is complete. This order avoids the risk of “making big changes first and adding evidence later” and also allows the team to discuss facts rather than personal preferences during code reviews.
This is also the connection point between Part 6 and the previous articles: the defect catalog provides problem samples, the testing article provides a safety net, the performance article provides measurement methods, and the refactoring article transforms these evidence into boundary adjustments. Without previous evidence, architectural evolution will become an abstract slogan; without subsequent evolution, previous repairs will gradually pile up new debts.
Therefore, this article is more suitable as an architectural checklist for the long-term maintenance phase. It is also suitable for backtesting platforms, risk control platforms and desktop trading terminals.
Introduction: Refactoring Decisions and Technical Debt Management
In the development process of quantitative trading systems, architecture reconstruction is an inevitable stage. As the size of the code base gradually increases and the degree of component coupling becomes higher and higher, the management of technical debt becomes a key factor in the success of the project.
This part revolves around the practical experience of the micang-trader project. Readers can see how refactoring decisions grow out of defects, performance, testing and collaboration costs, and also see what evidence is needed to support a sustainable architecture evolution mechanism.
Part One: Record of five reconstructions of micang-trader
The first refactoring: decoupling the data layer and UI layer
Architectural dilemma before refactoring
Some time after the project started, obvious architectural problems have emerged.
Typical code smell:
class ChartWidget(QWidget): # illustrative code, not production code
"""Chart component that mixes data access, processing, and rendering."""
def load_kline_data(self, symbol: str, days: int = 30):
"""read data directly from the database"""
conn = sqlite3.connect(self.db_path)
cursor = conn.execute(
"""SELECT datetime, open, high, low, close, volume
FROM bar_data
WHERE symbol = ? AND datetime > date('now', '-{} days')
ORDER BY datetime""".format(days),
(symbol,)
)
rows = cursor.fetchall()
conn.close()
# direct UI data
self.kline_data = [
{
'datetime': row[0],
'open': float(row[1]),
'high': float(row[2]),
'low': float(row[3]),
'close': float(row[4]),
'volume': int(row[5])
}
for row in rows
]
self.update() # trigger repaint
The problem with this code:
- Mixed responsibilities: UI components directly operate the database
- Difficult to Test: Requires mock SQLite to unit test
- Duplicate Code: 8 components have similar SQL queries
- Cannot be reused: Data acquisition logic and Qt component binding
This picture first answers the boundary question: before refactoring, ChartWidget both retrieves and interprets data, and the GUI code and data semantics are tied together; after refactoring, DataService becomes the only data semantic entry, and the UI only consumes the compiled K-line and indicator inputs.
Pain point data:
- Changing data sources requires modifying a large number of files
- The same “get the last 30 days K-line” logic is repeated in multiple places
- Test coverage is low (because it is difficult to mock the database)
- New feature development speed is limited
Refactor decision analysis
Signal that triggers reconstruction (satisfying multiple items):
| Signal | status quo | threshold | Whether to trigger |
|---|---|---|---|
| code duplication | higher | > 20% | ✅ |
| Number of lines in a single file | exceeds threshold | > 1,000 | ✅ |
| Test difficulty | Need mock DB | It should be testable individually | ✅ |
| Modify scope of influence | More files | < 5 files | ✅ |
Refactoring Goals:
- Centralize data access logic into 1-2 files
- The chart component obtains data through the interface and does not directly access the database.
- Supports unit testing (no real database required)
- Changing the data source only requires changing 1 file
Input-output assessment:
refactoring cost = 5 × 1 = 40
refactoring benefit = data 3 × 5 = 15 = 120
risk cost = 10 ( Bug )
ROI > 1.5, worthy of reconstruction.
Restructured architecture
Core code comparison:
Before reconstruction (ChartWidget directly accesses the database):
class ChartWidget(QWidget): # illustrative code, not production code
def load_kline_data(self, symbol: str, days: int = 30):
conn = sqlite3.connect(self.db_path)
cursor = conn.execute(SQL_QUERY, (symbol,))
rows = cursor.fetchall()
# ... data processing
After refactoring (getting data through DataService):
class ChartWidget(QWidget): # illustrative code, not production code
def __init__(self, data_service: DataService):
self.data_service = data_service
def load_kline_data(self, symbol: str, days: int = 30):
# retrieve data through the interface without depending on the underlying implementation
self.kline_data = self.data_service.get_kline(
symbol=symbol,
days=days,
interval='1m'
)
self.update()
# data service interface
class DataService(ABC):
@abstractmethod
def get_kline(self, symbol: str, days: int, interval: str) -> List[KLine]:
pass
# implementation
class SQLiteDataService(DataService):
def get_kline(self, symbol: str, days: int, interval: str) -> List[KLine]:
# database access logic is centralized here
...
Reconstruction results
Quantitative benefits:
| index | Before refactoring | After reconstruction | promote |
|---|---|---|---|
| Changing the data source affects the number of files | more | 1 | significantly reduced |
| unit test coverage | lower | higher | substantial improvement |
| Data related bugs (months) | more | less | significantly reduced |
| New feature development speed | slower | faster | Significant improvement |
Non-quantified income:
- The team dares to change the code (increased psychological safety)
- Onboarding time for newcomers reduced from 2 weeks to 3 days
- Data layer issues are no longer a distraction during code review
Architect Review: Symptoms, Triggers, Verification and Residual Costs
| Review field | Readers should see the judgment |
|---|---|
| symptom | ChartWidget is responsible for database access, K-line conversion, drawing refresh and interactive response at the same time. Any GUI adjustment may change the data semantics. |
| trigger signal | Changing the data source requires going through multiple UI files, data-related bugs are difficult to reproduce, and unit testing must mock the GUI, SQLite, and window life cycle. |
| Before refactoring | The UI is the de facto data owner, with data source replacement, trading session attribution and indicator input scattered in widgets. |
| After reconstruction | DataService is responsible for data semantics and data source adaptation, while ChartWidget only relies on interfaces to consume structured data; tests can bypass the GUI and directly verify trading periods, gap filling and data source replacement. |
| Decision basis | This is not catalog organization, but moving “whether the data is trustworthy” out of the interface code. As long as the data boundaries are not independent, subsequent backtesting, live trading, and performance optimization will contaminate each other in the same UI component. |
| Verification results | The impact of data source replacement converges from multiple UI files to DataService implementation; core data conversion logic can be covered with windowless unit tests; Code Review can independently review data boundaries. |
| residual cost | Abstract interfaces increase initialization and dependency injection costs, and the team must maintain interface conventions to avoid plugging business shortcut logic back into ChartWidget. |
| Rollback strategy | Keep the old data reading path for a period of time, confirm that the DataService output is consistent through double-read comparison, and then remove the old SQL entry. |
Second refactoring: indicator calculation changed from Pandas to incremental
The emergence of performance bottlenecks
After the backtest function was launched, performance problems gradually emerged: it took 3 minutes to run one month’s worth of data, which seriously affected the efficiency of strategy verification.
Profiling results:
Total time: 180.5s
Breakdown:
- indicator calculation: 145.2s (80.4%)
- MA5/MA10: 42.1s
- RSI: 38.7s
- MACD: 35.4s
- other: 29.0s
- data loading: 25.3s (14.0%)
- signal generation: 6.8s (3.8%)
- other: 3.2s (1.8%)
Question code:
def calculate_indicators(bars: List[Bar]) -> pd.DataFrame: # illustrative code, not production code
"""full calculation of all indicators - performance bottleneck"""
df = pd.DataFrame(bars)
# recompute every historical indicator on each backtest
df['ma5'] = df['close'].rolling(window=5).mean()
df['ma10'] = df['close'].rolling(window=10).mean()
df['ma20'] = df['close'].rolling(window=20).mean()
df['rsi'] = talib.RSI(df['close'], timeperiod=14)
df['macd'], df['macd_signal'], df['macd_hist'] = talib.MACD(
df['close'], fastperiod=12, slowperiod=26, signalperiod=9
)
return df
The problem is: backtesting is done bar by bar, and every time a new candlestick is received, all historical indicators must be recalculated. 10,000 candlesticks × 10 indicators = 100,000 repeated calculations.
Reconstruction plan: incremental computing architecture
Core Insight:
There are two main categories of indicator calculations:
- Fully dependent (such as RSI): requires historical data and cannot be fully incremental
- Sliding window type (such as MA): only the latest N K lines are needed, and can be incremented
This picture answers “what state is the incremental indicator maintaining?” Readers do not need to memorize all technical indicators, as long as they understand the core of IncrementalMA: window is not full, window sliding, continuous update, gap recovery and abnormal reconstruction are different states and cannot be covered up by an ordinary function call.
Core code implementation:
from abc import ABC, abstractmethod # illustrative code, not production code
from dataclasses import dataclass
from typing import List, Optional
@dataclass
class IndicatorState:
"""indicator state base class"""
timestamp: datetime
value: float
class IncrementalIndicator(ABC):
"""incremental indicator interface"""
@abstractmethod
def update(self, bar: Bar) -> IndicatorState:
"""accept a new bar and return the latest indicator value"""
pass
@abstractmethod
def reset(self):
"""Reset internal indicator state."""
pass
class IncrementalMA(IncrementalIndicator):
"""incremental moving average - sliding-window implementation"""
def __init__(self, period: int):
self.period = period
self.window: List[float] = []
self.sum = 0.0
def update(self, bar: Bar) -> IndicatorState:
price = bar.close
self.window.append(price)
self.sum += price
# window slides forward
if len(self.window) > self.period:
self.sum -= self.window.pop(0)
# calculate only after the window is full
if len(self.window) == self.period:
ma = self.sum / self.period
else:
ma = self.sum / len(self.window)
return IndicatorState(
timestamp=bar.datetime,
value=ma
)
def reset(self):
self.window.clear()
self.sum = 0.0
class IncrementalRSI(IncrementalIndicator):
"""RSI implemented with incremental gain/loss state."""
def __init__(self, period: int = 14):
self.period = period
self.prev_close: Optional[float] = None
self.gain_sum = 0.0
self.loss_sum = 0.0
self.gains: List[float] = []
self.losses: List[float] = []
def update(self, bar: Bar) -> IndicatorState:
if self.prev_close is None:
self.prev_close = bar.close
return IndicatorState(bar.datetime, 50.0) # neutral value
change = bar.close - self.prev_close
gain = max(change, 0)
loss = abs(min(change, 0))
self.gains.append(gain)
self.losses.append(loss)
# Update
if len(self.gains) <= self.period:
self.gain_sum += gain
self.loss_sum += loss
else:
# Wilder's smoothing
self.gain_sum = (self.gain_sum * (self.period - 1) + gain) / self.period
self.loss_sum = (self.loss_sum * (self.period - 1) + loss) / self.period
self.gains.pop(0)
self.losses.pop(0)
self.prev_close = bar.close
if self.loss_sum == 0:
rsi = 100.0
else:
rs = self.gain_sum / self.loss_sum
rsi = 100 - (100 / (1 + rs))
return IndicatorState(bar.datetime, rsi)
def reset(self):
self.prev_close = None
self.gain_sum = 0.0
self.loss_sum = 0.0
self.gains.clear()
self.losses.clear()
Benefit comparison of refactoring
Performance test results:
| scene | Before refactoring | After reconstruction | promote |
|---|---|---|---|
| 10,000 K lines | 180.5s | 3.2s | 56x |
| 50,000 K lines | 892.3s | 14.8s | 60x |
| Memory usage | 1.2GB | 180MB | 6.7x |
Complexity Analysis:
old complexity: O(n × m × k)
n = K
m = indicator
k = ()
new complexity: O(n × m)
each bar is calculated once while state is maintained
Key decision points for refactoring
**Decision 1: Change all indicators to incremental? **
no. Some indicators (such as Bollinger Band width, ATR) are simpler to calculate in full and have little impact on performance. Only optimize hotspots displayed by profiling.
**Decision 2: How to verify the correctness of the reconstruction? **
def test_indicator_consistency(): # illustrative code, not production code
"""verify that incremental calculation matches full recalculation"""
bars = load_test_data('hsi_1m_10000.csv')
# full(Referencesimplementation)
df = pd.DataFrame(bars)
expected_ma = df['close'].rolling(5).mean()
# incremental calculation
ma = IncrementalMA(5)
actual_ma = [ma.update(bar).value for bar in bars]
# comparison(tolerance)
for i, (exp, act) in enumerate(zip(expected_ma, actual_ma)):
if not pd.isna(exp):
assert abs(exp - act) < 1e-10, f"Bar {i}: expected {exp}, got {act}"
**Decision 3: When to give up the increment and roll back to full volume? **
If there are gaps in the data (such as missing bars), the incremental state may become invalid. Strategy:
- Check data continuity
- Trigger full recalculation when discontinuous
- Record the number of recalculations. If recalculations occur frequently, it indicates data quality issues.
Architect Review: Symptoms, Triggers, Verification and Residual Costs
| Review field | Readers should see the judgment |
|---|---|
| symptom | Profiling shows that indicator calculation takes up most of the time. Every time a K-line is advanced in the backtest, the historical window is scanned repeatedly. Performance problems amplify linearly with the sample size. |
| trigger signal | A month’s backtesting takes minutes, and strategy parameter adjustments cannot be fed back quickly; although Pandas’ full recalculation is simple, it turns adding a new K line into recalculating the entire history. |
| Before refactoring | The indicator function converts the bars into a DataFrame and then rolls them in full. The status is hidden in the DataFrame calculation. There is no common model for playback recovery and real incremental update. |
| After reconstruction | IncrementalMA explicitly maintains the window, sum and current state; the new bar only updates the sliding window; gaps, playback recovery and reset become testable. |
| Decision basis | Only optimize profiling to tell readers the hottest paths, and do not forcibly change all indicators to increments. Sliding window indicators are prioritized for incrementalization, and indicators with strong full dependence and unobvious benefits are retained as reference implementations. |
| Verification results | Use the same historical bars to run the Pandas reference implementation and incremental implementation at the same time, and compare the consistency of the results bar by bar; use benchmark to prove that the time consumption of 10,000 K-lines has been reduced from minutes to seconds. |
| residual cost | The incremental state increases the implementation complexity. Initialization, breakpoint continuation, missing bar, and playback recovery must all be independently tested. Otherwise, performance optimization will hide correctness risks. |
| Rollback strategy | Keep the full implementation of Pandas as the reference path; when the continuity check fails or the incremental status is abnormal, fall back to full recalculation and record the number of triggers. |
The third refactoring: splitting of chart components
”God-like” disaster
When the project developed to a certain stage, chart_widget.py became a “God class” - data acquisition, chart rendering, and user interaction were all mixed together, making it difficult to maintain and test.
This picture answers the question “Who is responsible for what after the split?” If you only cut large files into small files, but the event subscription, refresh sequence and status ownership are still unclear, readers will only see directory changes, not architectural improvements.
Mixed Responsibilities:
- Data acquisition (database query, cache management)
- Chart rendering (K-line, indicator, order, grid)
- User interaction (mouse, keyboard, scroll wheel)
- Business logic (price formatting, alignment calculation)
Team Impact:
- Long modification response time (needs to understand a lot of code)
- Bug introduction rate is high (changes can easily destroy other functions)
- Low test coverage (too many logical couplings, difficult to test)
Refactoring plan: MVC layered architecture
File splitting:
| original file | After split | Responsibilities |
|---|---|---|
| chart_widget.py (large file) | chart_widget.py | Component coordination |
| data_manager.py | Data management | |
| indicator_manager.py | Indicator calculation | |
| chart_renderer.py | Chart rendering | |
| overlay_renderer.py | Order rendering | |
| interaction_controller.py | user interaction | |
| navigator.py | View navigation | |
| total | Multiple files | clearer |
Core code refactoring example:
Before refactoring (ChartWidget does everything):
class ChartWidget(QWidget): # illustrative code, not production code
def __init__(self):
self.data = []
self.cached_indicators = {}
self.zoom_level = 1.0
self.pan_offset = 0
def mousePressEvent(self, event):
# handle
if event.button() == Qt.LeftButton:
price = self.y_to_price(event.y())
self.selected_price = price
self.update()
def paintEvent(self, event):
# K
painter = QPainter(self)
for i, bar in enumerate(self.data):
x = self.index_to_x(i)
self.draw_candle(painter, x, bar)
# indicator
for name, values in self.cached_indicators.items():
self.draw_line_indicator(painter, values)
def load_data(self, symbol):
# data
conn = sqlite3.connect('data.db')
cursor = conn.execute("SELECT * FROM bars WHERE symbol = ?", (symbol,))
self.data = cursor.fetchall()
self.calculate_indicators()
After refactoring (separation of duties):
# chart_widget.py - coordinate # illustrative code, not production code
class ChartWidget(QWidget):
def __init__(self):
self.data_manager = DataManager()
self.indicator_manager = IndicatorManager()
self.renderer = ChartRenderer()
self.controller = InteractionController(self)
def set_symbol(self, symbol: str):
data = self.data_manager.load(symbol)
indicators = self.indicator_manager.calculate(data)
self.renderer.set_data(data, indicators)
self.update()
# data_manager.py - data
class DataManager:
def __init__(self):
self.cache = DataCache()
self.source = SQLiteDataSource()
def load(self, symbol: str, timeframe: str = '1m') -> List[Bar]:
if self.cache.has(symbol, timeframe):
return self.cache.get(symbol, timeframe)
data = self.source.fetch(symbol, timeframe)
self.cache.set(symbol, timeframe, data)
return data
# chart_renderer.py - rendering
class ChartRenderer:
def __init__(self):
self.candle_renderer = CandleRenderer()
self.indicator_renderer = IndicatorRenderer()
def render(self, painter: QPainter, rect: QRect):
self.candle_renderer.render(painter, self.data)
self.indicator_renderer.render(painter, self.indicators)
# interaction_controller.py -
class InteractionController:
def __init__(self, widget: ChartWidget):
self.widget = widget
self.navigator = ViewNavigator()
def handle_mouse_press(self, pos: QPoint):
if self.widget.hit_test(pos):
self.navigator.start_drag(pos)
Reconstruction results
Quantitative benefits:
| index | Before refactoring | After reconstruction | promote |
|---|---|---|---|
| Total number of lines in the file | more | streamline | Code is clearer |
| average function length | longer | shorter | Improved readability |
| unit test coverage | lower | higher | substantial improvement |
| Modify response time | longer | shorter | Efficiency improvement |
| Bug introduction rate | higher | lower | quality improvement |
Non-quantified income:
- Team psychological safety: from “dare not to change” to “dare to change”
- Code review time: reduced from 1 hour to 15 minutes
- Newbie understanding time: reduced from 3 days to 2 hours
Architect Review: Symptoms, Triggers, Verification and Residual Costs
| Review field | Readers should see the judgment |
|---|---|
| symptom | chart_widget.py becomes a God Object, with Model, Renderer, Controller and business formatting logic mixed together. Local changes require understanding of drawing, data and interaction at the same time. |
| trigger signal | Code review time becomes longer, newcomers cannot quickly locate bugs, event subscription and refresh order are often changed, and test coverage is hindered by god class. |
| Before refactoring | ChartWidget manages data, cached indicators, zoom, pan, mouse event and paint event directly, without any boundary being an explicit interface. |
| After reconstruction | DataManager manages data, IndicatorManager manages indicators, ChartRenderer manages drawing, InteractionController manages user actions, event subscription and refresh sequence are explicitly connected through the coordination layer. |
| Decision basis | This split is not to increase the number of files, but to allow each type of change to have a separate bearing point: data changes do not affect mouse interaction, interaction changes do not affect indicator calculations, and rendering optimization does not change business semantics. |
| Verification results | Model and Renderer can be unit tested separately; interactive controllers can be tested using event sequences; Code Review can be split by boundaries to reduce the context that needs to be loaded for a review. |
| residual cost | When the number of interfaces increases, the order of events, cache invalidation, and refresh throttling must form a protocol, otherwise the system will change from “one large class that is difficult to understand” to “a group of small classes that imply each other.” |
| Rollback strategy | Keep the external API of the old ChartWidget and let the new module take over the responsibilities internally first; if the interaction regression fails, you can switch back to the old rendering path and retain the data model split. |
The fourth reconstruction: chart virtualization rendering optimization
Performance bottleneck: Stuttering in charts with large amounts of data
After the MVC split of the chart component, the code structure is clearer, but new performance bottlenecks are encountered when processing large amounts of K-line data:
Problem scenario:
- When loading 10,000 K lines, the initial rendering takes 800ms, and the user perceives obvious lag.
- When dragging to view historical data, each redraw requires re-rendering the K-lines in all visible areas.
- During scaling operations, full redraw causes the frame rate to drop below 15 FPS
- The memory usage increases linearly with the amount of data, and can reach 500MB+ after long-term operation.
Performance analysis data:
chart performance profiling(10,000 bars K ):
- initial render time: 780ms
- drag latency: 120ms/
- zoom repaint time: 350ms
- memory usage: 485MB
- GPU texture upload: 120ms(bottleneck)
root cause:
- Full Rendering: No matter how many K lines are displayed in the viewport, all data will be calculated and rendered.
- No caching: Candlestick geometry is recalculated every frame
- GPU texture overflow: A large number of K-lines lead to frequent allocation/release of texture memory
Reconstruction plan: sliding window + virtualized rendering
Core Strategy:
- Sliding Window: Only maintain the visible area + buffer (for example, the viewport displays 200 roots, and actually loads 400 roots)
- Virtual Rendering: Only calculate and render the K-line within the viewport
- Offscreen Cache: Pre-rendering a fixed area, panning instead of redrawing when dragging
This picture answers the question “Why virtual rendering is not just about drawing a few fewer K lines.” The real path is that the user drags/zooms to change the viewport, the viewport determines the buffer, the buffer determines whether the offscreen cache can be reused, and texture reuse determines whether to avoid frequent allocation of GPU textures.
Core code implementation:
1. Sliding Window Manager
# illustrative code, not production code
@dataclass
class WindowConfig:
"""configuration"""
viewport_size: int = 200 # visible area K
buffer_ratio: float = 0.5 # buffer zone( 50%)
min_buffer: int = 50 # buffer zone
class SlidingWindowManager:
"""sliding window manager - decides which data should be loaded into memory"""
def __init__(self, config: WindowConfig = None):
self.config = config or WindowConfig()
self._full_data: List[Bar] = []
self._window_start = 0
self._window_end = 0
def set_data(self, data: List[Bar]):
"""set the full data source"""
self._full_data = data
self._recalculate_window(0) # data
def move_to_index(self, center_index: int):
"""location"""
self._recalculate_window(center_index)
def get_window_data(self) -> Tuple[List[Bar], int, int]:
"""get the current window data and its offset in the full dataset"""
return (
self._full_data[self._window_start:self._window_end],
self._window_start,
self._window_end
)
def _recalculate_window(self, center_index: int):
"""range"""
total = len(self._full_data)
buffer_size = max(
int(self.config.viewport_size * self.config.buffer_ratio),
self.config.min_buffer
)
# range
half_viewport = self.config.viewport_size // 2
self._window_start = max(0, center_index - half_viewport - buffer_size)
self._window_end = min(total, center_index + half_viewport + buffer_size)
def should_reload(self, new_center: int) -> bool:
"""needdata"""
buffer_threshold = self.config.min_buffer // 2
current_center = (self._window_start + self._window_end) // 2
# buffer zone
return abs(new_center - current_center) > buffer_threshold
2. Virtualized renderer
# illustrative code, not production code
class VirtualizedChartRenderer:
"""virtualized chart renderer - render only the visible area"""
def __init__(self, window_manager: SlidingWindowManager):
self.window_manager = window_manager
self._offscreen_cache = OffscreenCache()
self._geometry_cache: Dict[int, CandleGeometry] = {}
def render(self, painter: QPainter, rect: QRect, offset_x: float):
"""
render the chart
:param offset_x: ()
"""
# currentdata
window_data, data_start_idx, _ = self.window_manager.get_window_data()
# visible area range
visible_indices = self._calculate_visible_indices(
offset_x, len(window_data), rect.width()
)
# checkuseoffscreen cache
if self._offscreen_cache.is_valid(visible_indices, offset_x):
# directcache
self._offscreen_cache.draw(painter, rect, offset_x)
return
# rerender the visible area when the cache is invalid
self._render_visible_area(
painter, rect, window_data, visible_indices, data_start_idx
)
# Updatecache
self._offscreen_cache.update(
painter.device(), visible_indices, offset_x
)
def _calculate_visible_indices(self, offset_x: float,
data_count: int, viewport_width: int) -> slice:
"""calculate the index range for the current visible area"""
candle_width = 8 # bars K 8
spacing = 2 # 2
total_width = candle_width + spacing
# account for the shifted start index
start_idx = max(0, int(-offset_x / total_width))
visible_count = int(viewport_width / total_width) + 2 # +2
end_idx = min(data_count, start_idx + visible_count)
return slice(start_idx, end_idx)
def _render_visible_area(self, painter: QPainter, rect: QRect,
data: List[Bar], visible: slice, data_offset: int):
"""render only the visible area"""
for i in range(visible.start, visible.stop):
if i >= len(data):
break
bar = data[i]
geometry = self._get_or_create_geometry(
i + data_offset, bar, rect.height()
)
self._draw_candle(painter, geometry, i - visible.start)
def _get_or_create_geometry(self, global_idx: int, bar: Bar,
height: int) -> CandleGeometry:
"""create K (cache)"""
if global_idx not in self._geometry_cache:
self._geometry_cache[global_idx] = self._calculate_geometry(bar, height)
return self._geometry_cache[global_idx]
3. Off-screen caching system
# illustrative code, not production code
class OffscreenCache:
"""offscreen cache - pre-render and store as a texture, then translate directly during dragging"""
def __init__(self, cache_size: int = 2048):
self.cache_size = cache_size
self._pixmap: Optional[QPixmap] = None
self._valid_range: Optional[slice] = None
self._cached_offset: float = 0.0
def update(self, source: QPaintDevice, visible_range: slice, offset: float):
"""update cached content"""
if self._pixmap is None or self._pixmap.size().width() != self.cache_size:
self._pixmap = QPixmap(self.cache_size, source.height())
# pre-render a range larger than the visible area
painter = QPainter(self._pixmap)
# ... rendering logic...
painter.end()
self._valid_range = visible_range
self._cached_offset = offset
def is_valid(self, current_range: slice, current_offset: float) -> bool:
"""checkcache"""
if self._pixmap is None or self._valid_range is None:
return False
# cacherange
offset_diff = abs(current_offset - self._cached_offset)
return offset_diff < 50 # reuse cache only for small viewport shifts
def draw(self, painter: QPainter, rect: QRect, offset: float):
"""cache(support)"""
if self._pixmap is None:
return
# source area to copy from the cached pixmap
source_x = int(offset - self._cached_offset)
source_rect = QRect(source_x, 0, rect.width(), rect.height())
# draw cached content
painter.drawPixmap(rect, self._pixmap, source_rect)
4. GPU texture management
# illustrative code, not production code
class GPUTextureManager:
"""GPU texture manager for allocation and release."""
def __init__(self, max_textures: int = 10):
self.max_textures = max_textures
self._texture_pool: List[QOpenGLTexture] = []
self._active_textures: Dict[str, QOpenGLTexture] = {}
self._lru_order: List[str] = []
def acquire_texture(self, key: str, width: int, height: int) -> QOpenGLTexture:
"""texture(reuse)"""
if key in self._active_textures:
# LRU (recentuse)
self._lru_order.remove(key)
self._lru_order.append(key)
return self._active_textures[key]
# needcreatetexture
if len(self._texture_pool) > 0:
texture = self._texture_pool.pop()
texture.setSize(width, height)
texture.allocateStorage()
else:
texture = QOpenGLTexture(QOpenGLTexture.Target2D)
texture.setSize(width, height)
texture.setFormat(QOpenGLTexture.RGBA8_UNorm)
texture.allocateStorage()
self._active_textures[key] = texture
self._lru_order.append(key)
# if, use
if len(self._active_textures) > self.max_textures:
lru_key = self._lru_order.pop(0)
old_texture = self._active_textures.pop(lru_key)
self._texture_pool.append(old_texture)
return texture
Reconstruction results
Performance improvements:
| index | Before refactoring | After reconstruction | promote |
|---|---|---|---|
| Initial rendering time | 780ms | 45ms | 17x |
| Drag response delay | 120ms/frame | 8ms/frame | 15x |
| Zoom redraw time | 350ms | 25ms | 14x |
| Memory usage | 485MB | 85MB | 5.7x |
| Frame rate (drag and drop) | 15 FPS | 60 FPS | 4x |
Optimization strategy comparison:
| Optimization points | Implementation method | Effect |
|---|---|---|
| sliding window | Only load viewport + buffer | Memory reduced by 5.7x |
| Virtualized rendering | Only render visible area | Render time reduced by 17x |
| Off-screen caching | Pre-render a large range and pan when dragging | Smooth dragging 60 FPS |
| Geometry cache | Cache K-line shape calculation | 60% reduction in CPU usage |
| GPU texture pool | Reuse texture objects | Reduce GC pauses |
Core Cognition:
The essence of chart performance optimization is to reduce invalid calculations:
- Data level: Use sliding windows to control the amount of data in memory
- Rendering Level: Use virtualization to draw only the visible area
- Interactive level: Use caching to avoid double calculations
- Hardware level: Use texture pools to reduce GPU memory allocation
This solution allows the chart to smoothly process 100,000+ K lines, providing a foundation for high-frequency real-time data display.
Architect Review: Symptoms, Triggers, Verification and Residual Costs
| Review field | Readers should see the judgment |
|---|---|
| symptom | The structure is clearer after MVC splitting, but initial rendering, drag, zoom, and real-time updates still trigger too much drawing, and users see lags instead of clear boundaries. |
| trigger signal | Dragging and dropping frames under 10,000 K lines, scaling and redrawing have reached hundreds of milliseconds, and GPU texture uploading and geometry calculation have become new bottlenecks. |
| Before refactoring | The renderer mixes the full data pool with the visible area of the screen, and reprocesses a large number of invisible candlesticks when the viewport changes. |
| After reconstruction | VirtualizedCandleRenderer takes the viewport as the entry point and only loads the viewport + buffer; the offscreen cache supports translation multiplexing, and texture reuse reduces GPU allocation costs. |
| Decision basis | The performance bottleneck has been transferred from the data layer to the view layer, and continuing to optimize indicator calculations cannot improve the drag and drop experience; the rendering path must be made to surround the visible window, rather than the complete historical data. |
| Verification results | Initial rendering, drag response, zoom redraw, memory usage and frame rate enter the benchmark together; whether the optimization is successful is based on the user-perceivable interaction delay. |
| residual cost | Virtualization introduces cache invalidation, edge K-line truncation, coordinate mapping and real-time refresh sequence issues; tests must cover window boundaries, fast scaling and market appending. |
| Rollback strategy | Reserve the non-virtualized renderer as a low-data fallback; when the viewport calculation is abnormal or the cache fails frequently, temporarily switch back to direct rendering and record the trigger conditions. |
The fifth reconstruction: separation of multi-process architecture
Performance bottlenecks of Python GIL
When the project developed to a certain stage, we encountered an inherent bottleneck of Python: GIL (Global Interpreter Lock).
Problem scenario:
- Indicator calculation takes 100% CPU, but only 1 core can be used
- The UI freezes during data recording because SQLite writing blocks the main thread.
- ATR calculation and daily period calculation slow down real-time market processing
Performance analysis data:
8-core CPU, but the Python process uses only 12.5%, one core saturated
indicator calculation: 150ms(User)
data Tick: seconds 50 Tick 3-5
Multi-process architecture design
**Core decision: offload CPU-intensive tasks to independent child processes. **
This picture answers the question “Why multi-threading is not enough”. Before the reconstruction, although the UI, indicators, recording and backtesting were divided into multiple threads, they still competed for the GIL, event loop and log context in the same Python process; after the reconstruction, the UI, calculation, recording and backtesting were split into different processes, and IPC + shared_memory was only responsible for necessary data exchange and control signals.
Detailed explanation of child process responsibilities
| child process | Responsibilities | Communication method | Trigger condition |
|---|---|---|---|
| ComputeClient | ATR calculation, daily cycle minute calculation | Pipe + shared memory | Real-time market driven |
| IndicatorWorkerPool | Parallel calculation of indicators (N Workers) | Queue + shared memory | When data is updated |
| OfflineWorkerPool | Offline tasks (full calculation, pre-calculation) | Queue + LMDB | User trigger/timing |
| DataRecorder | Tick/K line recording to database | Pipe + shared memory | Subscription quotes start automatically |
| Backtest | Independent backtesting engine | Queue + shared memory | Backtest request |
| Trading | Real-time trading module | Queue | Trading instructions |
Inter-process communication mechanism
1. Shared Memory
High-frequency data exchange achieves zero copy through shared memory:
# ComputeClient starts the Worker Process # illustrative code, not production code
class ComputeClient:
def start_worker(self, gds_1m_shm_name: Optional[str] = None):
# createprocess Pipe (REQ-NF-13)
parent_reader, parent_writer = Pipe(duplex=False)
self._parent_pipe_writer = parent_writer
# create IPC Pipe
ipc_child_conn, ipc_parent_conn = Pipe(duplex=True)
self._ipc_parent_conn = ipc_parent_conn
# process(support Qt UI)
self._worker_process = Process(
target=run_compute_worker_with_qt,
args=(prefix, self._worker_stop_event),
kwargs={
"gds_1m_shm_name": self._gds_1m_shm_name,
"ipc_conn": ipc_child_conn,
"parent_pipe_reader": parent_reader,
},
daemon=True, # process
)
self._worker_process.start()
2. ATR calculation process
After the ATR calculation is moved out of the main process, the key is not to be “faster”, but that the real-time market will no longer be blocked by CPU-intensive calculations. The main process is only responsible for submitting tasks, reading results and handling exceptions; the sub-process is responsible for calculations; shared_memory is responsible for high-frequency data exchange; the log link is responsible for stringing together requests, processes and results.
3. Data recording sub-process
Data recording runs independently and receives Ticks through the shared memory RingBuffer:
# Recorder subprocess # illustrative code, not production code
def run_recorder_subprocess(
event_queue_main_to_sub: Queue,
event_queue_sub_to_main: Queue,
tick_shm_symbols: List[str],
):
# 1. create QApplication
app = QApplication([])
# 2. create RecorderMainFacade
facade = RecorderMainFacade(sub_to_main_queue=event_queue_sub_to_main)
# 3. TickShmReader thread
tick_thread = Thread(
target=_run_tick_shm_thread,
args=(recorder_engine, tick_shm_symbols),
daemon=True
)
tick_thread.start()
# 4. thread
forward_thread = Thread(
target=_run_event_forward_thread,
args=(event_queue_main_to_sub, event_engine, facade),
daemon=True
)
forward_thread.start()
# 5.
_run_command_loop(app, recorder_engine, facade)
Refactoring benefits
Performance improvements:
| index | Before refactoring | After reconstruction | promote |
|---|---|---|---|
| CPU utilization | 12.5% (single core) | 85% (multi-core) | 6.8x |
| Indicator calculation delay | 150ms | 25ms | 6x |
| UI frame rate | 15 FPS | 60 FPS | 4x |
| Tick loss rate | 6-10% | < 1% | 10x |
| Impact of offline tasks | Block UI | Running in the background | No perception |
Architectural advantages:
- GIL Bypass: CPU-intensive tasks run in sub-processes, taking full advantage of multi-cores
- UI response: The main process focuses on the UI and is not affected by background tasks
- Fault isolation: The crash of the child process will not cause the main program to exit
- Independent expansion: Each sub-process can independently expand or contract according to the load
Key points of technical implementation
1. Parent process exits listening (REQ-NF-13)
def _run_parent_exit_watchdog(parent_pipe_reader: Connection, stop_event: Event): # illustrative code, not production code
"""process, process"""
try:
# block until the parent process closes the pipe
parent_pipe_reader.recv()
except EOFError:
# process
stop_event.set()
2. Shared memory data format
class ComputeShmBackend: # illustrative code, not production code
"""memory, process and Worker data"""
def create_slots(self):
# daily-period calculation input/output
self._daily_in = shared_memory.SharedMemory(
name=f"{self.name_prefix}_daily_in", create=True, size=64
)
self._daily_out = shared_memory.SharedMemory(
name=f"{self.name_prefix}_daily_out", create=True, size=8
)
# ATR input/output
self._atr_in = shared_memory.SharedMemory(
name=f"{self.name_prefix}_atr_in", create=True, size=1024
)
self._atr_out = shared_memory.SharedMemory(
name=f"{self.name_prefix}_atr_out", create=True, size=16
)
3. Worker process pool management
class IndicatorWorkerPool: # illustrative code, not production code
def __init__(self, num_workers: int = 4):
self._workers: List[IndicatorWorker] = []
self._supervisor: Optional[WorkerSupervisor] = None
for i in range(num_workers):
worker = IndicatorWorker(
worker_id=f"worker_{i}",
shared_memory_store=shm_store,
)
self._workers.append(worker)
# start health monitoring
self._supervisor = WorkerSupervisor(
workers=self._workers,
on_worker_restart=self._restart_worker,
)
self._supervisor.start()
Summary of architecture evolution
First refactor: decouple data layer and UI layer
↓
Second refactor: incremental indicator updates
↓
Third refactor: split chart components with MVC
↓
Fourth refactor: virtualized chart rendering optimization
↓
Fifth refactor: separate multi-process architecture
↓
Future plan:
- backtest subprocess, isolating backtest from live trading
- live-trading subprocess, independent trading logic
- possible microservice split, cross-machine deployment
Core Cognition:
Python’s GIL is not a shackle, but a reminder to “use multiple processes when you should use multiple processes”**. The multi-process architecture of micang-trader is not an over-design, but an inevitable choice to solve actual performance problems.
Architect Review: Symptoms, Triggers, Verification and Residual Costs
| Review field | Readers should see the judgment |
|---|---|
| symptom | UI, indicator calculation, data recording and backtesting compete for the same Python process. GIL prevents CPU-intensive tasks from fully utilizing multiple cores. What users see is interface lag and tick loss. |
| trigger signal | The indicator calculation delay has reached a perceptible level, the recording thread blocks the main interface, offline tasks affect real disk monitoring, and single-thread optimization can no longer reduce the delay. |
| Before refactoring | All tasks share the main process event loop, and calculation, I/O, drawing, and user operations amplify each other’s faults; it is also difficult to distinguish which type of task is causing the delay in the log. |
| After reconstruction | Computing, recording, backtesting and UI are separated into different failure domains; shared_memory is responsible for high-frequency data exchange, IPC is responsible for command and control, and logs are associated with pid, worker_id and trace_id. |
| Decision basis | Multi-processing is not to pursue a sense of distribution, but because the main process can no longer satisfy UI response, real-time market conditions and CPU-intensive calculations at the same time. Whenever live monitoring is slowed down by backtesting, execution domains must be isolated. |
| Verification results | CPU utilization, indicator latency, UI frame rate, tick loss rate and child process abnormal recovery are verified together; fault injection must prove that the exit of the child process will not bring down the main interface. |
| residual cost | Process serialization, shared_memory life cycle, exception propagation, log correlation, and resource release all become new governance objects; the complexity shifts from function calls to process collaboration. |
| Rollback strategy | The child process pool is enabled one by one; offline calculations are first moved out of the main process, and then real-time indicators are migrated; any worker exception allows the main process to downgrade to synchronous calculation or suspend the corresponding function. |
Part 2: Reconstructing the decision-making framework
When to refactor
Based on the experience of multiple reconstructions, reconstruction signals can be divided into three categories: they must be processed immediately, they should be scheduled, and they can be recorded as debts. This classification should not only look at whether the code is ugly, but also whether it affects transaction correctness, real-time response, backtest consistency and team maintenance risks.
Signal that must be reconstructed (immediate execution)
Signal 1: Modify fear index > 7
Measurement method:
change fear index = (failed tests after change + unexpectedly affected features) / changed lines × 100
-
7: Refactor immediately
- 3-7: Scheduling reconstruction
- < 3: acceptable
Signal 2: Risk of single point of failure
Checklist:
- Only 1 person can change a module
- When this person is on vacation, no one dares to fix the bugs in this module.
- The departure of key personnel will cause the project to stall
Signal 3: Tests fail to cover core logic
Reasons usually include:
- Code coupling is too high and cannot be tested independently
- Depends on external services (database, network)
- Too many side effects and unpredictable status
Signals that should be reconstructed (scheduled execution)
| Signal | Measurement method | threshold | action |
|---|---|---|---|
| code duplication | jscpd or similar tool | > 20% | Extract common code |
| Number of file lines | wc -l | > 1,000 | split module |
| Cyclomatic complexity | radon | > 10 | Simplify the logic |
| function length | average number of rows | > 50 | Extract function |
| Performance bottleneck | profiling | > 50% of the time | Algorithm optimization |
Signal that it’s okay to put it on hold (recording debt)
- The code is ugly but works, and the frequency of changes is < 1 time/quarter
- No tests but not core logic (like one-off scripts)
- The performance is available and the optimization benefit is < 10%
5 questions you must answer before refactoring
This picture answers “how a refactoring request enters formal execution.” If there are no reproductions, no tests, no rollbacks, and no team capacity, you shouldn’t refactor the code just because it doesn’t look pleasing to your eyes.
Question 1: What is the goal of refactoring?
Vague goals (precursors of failure):
- “Make code better”
- “Improve code quality”
- “Reduce technical debt”
Clear Goals (Measurable):
- “Split chart_widget.py from line 3847 into 4 files, < 800 lines each”
- “Reduce indicator calculation time from 180s to less than 5s”
- “Increase unit test coverage from 31% to 80%”
SMART Principles of Goal Setting:
| in principle | Example |
|---|---|
| Specific | Not “optimize performance”, but “reduce backtest time to 5s” |
| Measurable | Have clear numerical indicators |
| Achievable | Can be completed based on existing resources |
| Relevant | Relevant to business goals |
| Time-bound | Specify completion time |
Question 2: How to verify that the reconstruction is successful?
Functional Verification:
# record all test case results before refactoring # illustrative code, not production code
pre_refactor_results = run_all_tests()
# comparison after refactoring
post_refactor_results = run_all_tests()
assert post_refactor_results == pre_refactor_results
Performance Verification:
# establish the performance baseline # illustrative code, not production code
benchmark = {
'backtest_10k_bars': 180.5, # seconds
'memory_peak': 1200, # MB
'cpu_usage': 85 # %
}
# comparison after refactoring
assert new_performance['backtest_10k_bars'] < 5
assert new_performance['memory_peak'] < 200
Code Metrics Verification:
- Reduced repeatability
- Increased coverage
- complexity reduction
- Reasonable file length
Question 3: What are the rollback options?
Three-layer rollback strategy:
Level 1: Git (development stage)
- commit
- test failure git revert
Level 2: Feature Branch()
- refactor
- fulltest
-
Level 3: ()
- code
- configuration
- error
Question 4: What is the input-output ratio?
Reconstruct ROI calculation formula:
refactoring benefit = saved maintenance time × expected change count
= (old maintenance time - new maintenance time) × change count in the next N months
refactoring cost = development time + testing time + risk cost
= refactoring days + testing days + (bug-fix time × bug probability)
ROI = refactoring benefit / refactoring cost
Rule of Thumb:
- ROI > 3: Highly recommended
- ROI 1.5-3: Worth considering
- ROI < 1.5: On hold
ROI calculation example for micang-trader chart component reconstruction:
Benefit:
- modify, /
- multiple expected changes per month and per year
- annual saving: significant
Cost:
- development time: several days
- testing time: several days
- risk cost( Bug ): some time
- total cost: controlled
ROI = / > 3(recommended)
Question 5: Is the team ready?
Technical preparation:
- Has complete test coverage
- There are code specifications (lint, format)
- Has CI/CD process
Process preparation:
- There is a Code Review mechanism
- There is code ownership division
- Have document maintenance habits
Mental preparation:
- The team understands the value of refactoring
- Product teams accept short-term extensions
- Have management support
Part Three: Refactoring Strategies and Techniques
Strategy 1: Strangler Fig Pattern
Core idea: Rather than rewriting the entire module at once, replace it incrementally, with the old and new code running in parallel.
Implementation steps:
This picture answers the question “Why small-step migration is more suitable for trading systems than a one-time rewrite”. For high-risk boundaries such as data services, chart renderers, and indicator interfaces, the interfaces should be extracted first, then the new and old ones can be parallelized, and finally, regression testing and rollback switches should be used to gradually close them.
Practical Case: Data Service Reconstruction
Phase 1: Extracting the interface
from abc import ABC, abstractmethod # illustrative code, not production code
class DataService(ABC):
"""data service interface"""
@abstractmethod
def get_kline(self, symbol: str, days: int) -> List[Bar]:
pass
Stage 2: Coexistence of old and new
class ChartWidget: # illustrative code, not production code
def __init__(self, use_new_service: bool = False):
if use_new_service:
self.data_service: DataService = NewDataService()
else:
self.data_service = LegacyDataAccess() # adapter pattern
Phase 3: Gradual migration
# # illustrative code, not production code
widget1 = ChartWidget(use_new_service=True) # use
widget2 = ChartWidget(use_new_service=False) # keep the old component temporarily
Phase 4: Remove old code
# after confirming the migration is complete # illustrative code, not production code
class ChartWidget:
def __init__(self):
self.data_service = NewDataService() # direct instantiation
Strategy 2: Test first
Test preparation before refactoring:
# 1. ensure all current tests pass # illustrative code, not production code
$ pytest --tb=short
# 127 passed, 0 failed
# 2. establish the performance baseline
$ python benchmark.py --save-baseline
# Baseline saved to .benchmark/baseline.json
# 3. add missing tests, especially around the refactored area
$ python -m coverage run -m pytest
$ python -m coverage report --show-missing
# add tests for modules below 80% coverage
Test Guard in Refactoring:
# use watchdog to run tests automatically
$ pip install pytest-watch
$ ptw --onpass "notify 'Tests passed'" --onfail "notify 'Tests failed'"
Test verification after refactoring:
# behavior compatibility verification # illustrative code, not production code
def test_functional_parity():
old_result = run_with_old_code(input_data)
new_result = run_with_new_code(input_data)
assert old_result == new_result
# performance verification
def test_performance_regression():
new_time = benchmark_new_code()
baseline = load_baseline()
assert new_time < baseline * 1.1 # allow a 10% tolerance
Strategy three: AI-assisted reconstruction
Effective usage scenarios of AI:
Scenario 1: Code smell analysis
Prompt:
Analyze the following code and identify 3-5 refactoring issues:
1. Point out specific issues
2. Explain why each issue matters
3. Provide refactoring suggestions
Code:
[paste code]
Scenario 2: Refactoring plan generation
Prompt:
Design a refactoring plan based on these requirements:
- Goal: split a 3000-line ChartWidget
- constraint: preserve behavior and do not break existing features
- Requirements: provide a staged implementation plan where each stage can be rolled back independently
Scenario 3: Test Generation
Prompt:
Generate unit tests for the following function:
- cover normal cases
- cover boundary cases
- cover error cases
- use pytest
Function:
[paste function]
Notes on using AI:
| ✅ Can be used | ❌ Need to be vigilant |
|---|---|
| Identify code smells | blindly accept all suggestions |
| Generate refactoring template | Over-engineering (e.g. unnecessary factory pattern) |
| Generate test cases | Not verifying the correctness of the test |
| Explain complex code | Let AI make architectural decisions |
Part 4: Technical Debt Management
Debt Classification and Assessment
Technical Debt Matrix:
This picture answers the question “Which debts should be paid off first?” Debt in a quantitative trading system is not an ordinary TODO list. It must consider both the impact and the cost of repair: debts that affect real-time security, data correctness, and backtest consistency have higher priority than pure code style issues.
Debt Priority Assessment Form:
| debt item | Scope of influence | Modify frequency | Repair cost | priority |
|---|---|---|---|---|
| ChartWidget takes too much responsibility | All charting features | 4 times a week | 14 days | P0 (immediately) |
| Indicator calculation performance bottleneck | Backtest function | 20 times a day | 10 days | P0 (immediately) |
| Data layer coupling | Data related functions | 2 times a week | 5 days | P1 (this month) |
| Utility function missing type annotation | Development experience | 1 time per month | 5 days | P2 (quarterly) |
| No tests for old scripts | Stabilized | 1 time per quarter | 3 days | P3 (record) |
Debt visualization and tracking
technical_debt.md template:
# Technical debt list
## P0 - handle immediately, blocks development
### DEBT-001: ChartWidget too many responsibilities
- **location**: ui/chart_widget.py
- **symptom**:,
- **impact**:
- **suggested solution**: DataManager/ChartRenderer/ChartController
- **estimated cost**: 14
- **expected benefit**: improvement
- **created date**: 2025-10-15
- **status**: 🟡 in progress
- **owner**: milome
### DEBT-002: indicator calculation performance bottleneck
- **location**: core/indicators.py
- **symptom**: 10000 bars K need
- **impact**: low strategy validation efficiency
- **suggested solution**: implementation IncrementalIndicator interface
- **estimated cost**: 10
- **expected benefit**: performanceimprovement
- **created date**: 2025-11-20
- **status**: 🔴 to be scheduled
- **owner**: unassigned
## P1 - handle this month, affects efficiency
### DEBT-003: data layer and UI coupling
- **location**: ui/*.py
- **symptom**: dataneedmodify
- **impact**: datamigration
- **suggested solution**: DataService interface
- **estimated cost**: 5
- **created date**: 2025-10-25
- **status**: 🟢 recorded
## P2 - handle this quarter, improves experience
### DEBT-004: insufficient type annotation coverage
- **location**: utils/*.py
- **symptom**: IDE insufficient, type
- **impact**: developer experience
- **suggested solution**: add type annotations gradually
- **estimated cost**: 5
- **created date**: 2025-10-28
- **status**: 🟢 recorded
Develop a debt repayment plan
Q2 2025 Technical Debt Repayment Plan
April (focus on P0)
-
DEBT-001: ChartWidget split
- Person in charge: milome
- When: Weeks 1-3
- Acceptance criteria:
- Unit test coverage > 80%
- Code review passed
- Functional regression test passed
-
DEBT-002: Indicator incremental calculation
- Person in charge: milome
- Time: Week 4
- Acceptance criteria:
- Backtest performance < 5 seconds
- The result is consistent with the full calculation
May (processing P1)
- DEBT-003: Data layer decoupling
- Person in charge: to be assigned
- When: Weeks 1-2
June (reserved buffer)
- DEBT-004: Type annotations (time permitting)
- or deal with newly discovered P0/P1 debt
Part 5: Architecture Evolution Roadmap
The complete evolution timeline of micang-trader
The roadmap shows the evolution of micang-trader from prototype to maintenance period. Readers can focus on which of the most critical system constraints is solved at each stage, rather than stuffing later complexity into the prototype stage in advance.
Key decisions at each stage
Prototype stage → Growth stage:
- Decision: Should I invest time in building the architecture, or continue with heap functionality?
- Our choice: do the first refactoring at the right time
- Result: Avoiding larger technical debt later on
Growth Phase → Performance Phase:
- Decision: Prioritize performance optimization or feature development?
- Our choice: Pause the feature for 2 weeks and do performance refactoring
- Result: User satisfaction increased significantly
Performance period → Stability period:
- Decision: Does the chart component need a complete rewrite?
- Our choice: Split rather than rewrite (Strangler Fig pattern)
- Result: Smooth transitions, no functional fallback
Stable period → Expansion period:
- Decision: How to optimize chart performance issues? Is virtualization worth the investment?
- Our choice: implement sliding window + virtualized rendering + off-screen caching
- Result: Supports smooth display of 100,000+ K lines and reduces memory usage by 5.7x
Extension period → Maintenance period:
- Decision-making: How to break through the Python GIL bottleneck?
- Our choice: multi-process architecture separation, CPU-intensive task offload
- Result: CPU utilization increased from 12.5% to 85%, and metric calculation latency dropped from 150ms to 25ms
Maintenance Period:
- Decision: How to prevent technical debt from accumulating again?
- Our options: Set up a debt tracking mechanism and regular payoff plan
- Result: Debt is controllable and development speed is stable
Architecture Decision Record (ADR)
ADR Template:
This diagram answers “How does ADR transform from a discussion into a chain of evidence that can be reviewed in the future?” Readers can understand it as a transaction log at the architectural level: issues, candidate solutions, final decisions, verification results and review triggering conditions must be tracked by subsequent maintainers.
# ADR-005: Switch indicator calculation from Pandas to incremental implementation
## status
- date: 2025-12-15
- status: Accepted
- decision maker: milome
## context
backtest performance became a serious bottleneck, 10,000 bars K need 180 seconds.
Profiling indicator calculation 80%.
## decision
implementation IncrementalIndicator interface, support bar Update, fullrecalculation.
## consequences
### positive
- 10x performance improvement(180s → 3s)
- 80% lower memory usage(1.2GB → 200MB)
- support unified real-time calculation and backtesting
### negative
- implementation complexity increases
- requires state maintenance and is harder to debug
- some indicators such as Bollinger Bands are harder to implement incrementally
## alternatives
| | pros | cons | decision |
|------|------|------|------|
| Numba | small change | improvement 3x, insufficient | ❌ |
| Cython | best performance | high development cost and hard maintenance | ❌ |
| incremental calculation | balanced | moderate complexity | ✅ |
## References
- performance test report: docs/benchmarks/indicator-perf.md
- implementation code: core/indicators/incremental/
Part Six: Team and Culture Building
Resistance and coping strategies for reconstruction
Resistance 1: “As long as it can run, don’t move it”
symptom:
- Although the code is bad, it “works”
- Worry about refactoring introducing new bugs
- Emotional attachment to existing code (“I wrote this”)
Coping strategies:
-
Let data speak
"This module produced 23 bugs in the past three months, accounting for 40% of all project bugs. After refactoring, it is expected to drop below 5." -
Small step verification
- Refactor a small module first (1-2 days of work)
- Demonstrate benefits (increased development speed, reduced bugs)
- Gain the trust of your team before expanding your scope
-
Establish a sense of security
- Perfect testing
- Clear rollback plan
- Incremental replacement rather than big bang rewrite
Resistance 2: “I don’t have time to refactor”
symptom:
- Products rush to launch new features
- Consider refactoring to be “extra work”
Coping strategies:
-
Refactoring is an investment, not a cost
"Spend 5 days on refactoring and save 2 hours on each future change. With 4 changes per month, the investment pays back in 3 months." -
Allow 20% time for technical debt
- Set aside 20% of time each sprint to work on technical debt
- New feature estimates include “debt repayment time”
-
Technical Debt Visualization
- Regularly present the debt list to the product team
- Explain the impact of debt on development speed
Resistance 3: “Refactoring is too risky”
symptom:
- Worry about system instability caused by reconstruction
- Fear of affecting online users
Coping strategies:
-
Test first
- Supplementary testing to 80% coverage before refactoring
- Establish a performance baseline
-
Grayscale release
# use a feature flag # illustrative code, not production code if feature_flags.enable_new_chart: render_with_new_code() else: render_with_old_code() -
Monitoring and Rollback
- Key indicator monitoring (error rate, performance)
- Exception automatic rollback mechanism
Code Review Checklist
Architectural level:
- Whether to follow the single responsibility principle (a class/function only does one thing)
- Whether module dependencies are reasonable (no circular dependencies)
- Is the interface design clear (easy for the caller to understand)
- Whether it introduces unnecessary complexity
Code level:
- Whether necessary tests (unit tests, integration tests) have been added
- Whether the documentation (docstring, README) has been updated
- Whether new technical debt has been introduced (temporary solutions must have TODO)
- Whether the code style complies with the specifications (lint, format)
Refactoring project:
- Whether behavioral consistency is maintained (functionality unchanged)
- Whether a performance baseline has been established (no performance rollback)
- Is there a rollback plan (feature flag or branch)
- Whether to implement it in phases (can be verified step by step)
Build a culture of refactoring
1. Refactoring as a routine
It’s not “wait until the code sucks before refactoring it”, it’s “clean it up when you see the bad smell”.
Boy Scout Rule:
“Leave the campground cleaner than you found it.”
Every time you submit code, make it a little better than before.
2. Technical Debt Meeting
Monthly 30-minute meeting:
- A look back at debt discovered this month
- Evaluate priorities
- Assign repayment tasks for next month
3. Refactoring Sharing Session
Quarterly 1-hour sharing:
- Share refactoring cases
- Discuss challenges encountered
- Summarize best practices
4. Incentive mechanism
- Recognize refactoring contributions (not just feature development)
- Incorporate code quality into performance reviews
- Establish a “Cleanest Code Award”
Summary: Complete Checklist for Refactoring
Before refactoring
- Clear goal: The refactoring goal is quantifiable (such as “split the file from 3000 lines to 4 files of 800 lines”)
- Test Improvement: Current test coverage > 80%, all passed
- ROI Assessment: Input-output ratio > 1.5
- Rollback plan: Have a clear rollback strategy (Git/Feature Flag)
- Team Ready: The team understands and supports refactoring
- Time Reserve: Have enough time to complete without being interrupted by urgent needs
Under reconstruction
- Small steps: < 100 lines per change, frequent submissions
- Test Guard: Run tests immediately after each commit
- Code Review: Critical changes require Code Review
- Documentation Update: Synchronously update documents and comments
- Performance Monitoring: Compare performance indicators before and after reconstruction
After reconstruction
- Functional verification: 100% test passed, no function regression
- Performance Verification: Achieve preset performance goals
- Code Metrics: Reduced duplication, increased coverage, reduced complexity
- Team Synchronization: Share refactoring experiences and best practices to the team
- Debt Update: Update the technical debt list and mark paid items
Series Review and Outlook
The core ideas of the seven articles
| table of contents | core themes | Key takeaways |
|---|---|---|
| one | Architecture design | The data layer is independent, the abstraction is unified in multiple periods, and the backtest real market is consistent. |
| two | Python practice | Floating point precision, time zone handling, memory management, concurrency safety |
| Second supplement | Python pitfalls | 50 deep trap analysis, AI-assisted pit avoidance |
| three | AI engineering | Specifications first, multi-agent collaboration, human-machine division of labor |
| Four | Performance optimization | profiler positioning, algorithm optimization, compilation acceleration, caching strategy |
| five | testing strategy | AI-assisted TDD, attribute testing, boundary time testing |
| six | Architecture evolution | Five refactoring records, decision-making framework, technical debt management, multi-process architecture |
a core understanding
**The development of quantitative systems is not a one-time event, but a continuous evolution process. **
Good architecture is not designed, it evolves. The key is:
- Make the right choice at every decision point (using the decision framework from earlier)
- Pay off technical debt promptly (don’t accumulate it beyond your means)
- Let the architecture grow with the business (the architecture serves the business, not the other way around)
Tips for readers
If you are a quantitative system developer:
- Don’t pursue a perfect initial architecture - let the system run first, and then gradually optimize it
- Build technical debt awareness - record debt, pay off regularly, and prevent accumulation
- Invest in Testing - Testing is a safety net for refactoring and a source of confidence
- Make good use of multiple processes - Python’s GIL is not a shackle, use it when you should use multiple processes
If you are a technical lead:
- Give the team time to refactor - Set aside 20% time for technical debt
- Establish a culture of refactoring - Recognize refactoring contributions, not just feature development
- Let data speak - ROI assessment, performance benchmarks, code metrics
- Architecture grows with business - from single process to multi-process, from single to distributed
Reference resources
books
- “Refactoring: Improving the Design of Existing Code” (Martin Fowler)
- Clean Architecture (Robert C. Martin)
- “Code Encyclopedia” (Steve McConnell)
- “release! Software Design and Deployment” (Michael T. Nygard)
Articles & Dissertations
- “Technical Debt Quadrant” (Martin Fowler)
- “The Boy Scout Rule”
- “Strangler Fig Pattern”
- “Architecture Decision Records”
tool
- Code Analysis: SonarQube, CodeClimate, pylint, mypy
- Test: pytest, coverage.py, hypothesis (attribute test)
- Performance: cProfile, line_profiler, memory_profiler
- Visualization: Mermaid, PlantUML
Series context
You are reading: Quantitative trading system development record
This is article 6 of 7. Reading progress is stored only in this browser so the full series page can resume from the right entry.
Series Path
Current series chapters
Chapter clicks store reading progress only in this browser so the series page can resume from the right entry.
- Quantitative trading system development record (1): five key decisions in project startup and architecture design Taking Micang Trader as an example, this article starts from system boundaries, data flow, trading-session ownership, unified backtesting/live-trading interfaces, and AI collaboration boundaries to establish the architecture thread for the quantitative trading system series.
- Quantitative trading system development record (2): Python Pitfalls practical pitfall avoidance guide (1) Reorganize Python traps from a long list into an engineering risk reference for quantitative trading systems: how to amplify the three types of risks, syntax and scope, type and state, concurrency and state, into real trading system problems.
- Record of Quantitative Trading System Development (Part 3): Python Pitfalls Practical Pitfalls Avoidance Guide (Part 2) Continuing to reorganize Python risks into a reference piece: how GUI lifecycles, asynchronous network failures, security boundaries, and deployment infrastructure affect the long-term stability of quantitative trading systems.
- Quantitative trading system development record (4): test-driven agile development (AI Agent assistance) Starting from a cross-night trading day boundary bug, we reconstruct the test defense line of the quantitative trading system: defect-oriented testing pyramid, AI TDD division of labor, boundary time, data lineage and CI Gate.
- Quantitative trading system development record (5): Python performance tuning practice Transform performance optimization from empirical guesswork into a verifiable investigation process: start from the 3-second chart delay, locate the real bottleneck, compare optimization solutions, and establish benchmarks and rollback strategies.
- Record of Quantitative Trading System Development (6): Architecture Evolution and Reconstruction Decisions Review the five refactorings of Micang Trader, explaining how the system evolved from the initial snapshot to a clearer target architecture, and incorporated technical debt and ADR decisions into long-term governance.
- Quantitative trading system development record (7): AI engineering implementation - from speckit to BMAD Taking the trading calendar and daily aggregation requirements as a single case, explain how AI engineering can enter the delivery of real quantitative systems through specification drive, BMAD role handover and manual quality gate control.
Reading path
Continue along this topic path
Follow the recommended order for Quantitative system development practice instead of jumping through random articles in the same topic.
Next step
Go deeper into this topic
If this article is useful, continue from the topic page or subscribe to follow later updates.
Loading comments...
Comments and discussion
Sign in with GitHub to join the discussion. Comments are synced to GitHub Discussions