Article

Original Interpretation: The Three-Layer World of Python Memory Architecture

Why doesn't memory drop after deleting large lists? Understanding the engineering trade-offs and design logic of Python's Arena-Pool-Block three-layer memory architecture

Topic · Python Series Python Memory Model Deep Dive 1/7

Original Interpretation Python Memory Management Cpython Performance

Prologue: The Confusion After Deleting a Large List

Imagine this scenario.

You’re training a large model in a Jupyter Notebook, loading an 8GB embedding matrix. After training, you execute del large_matrix, then stare at the system monitoring tool—the memory usage drops from 12GB to 11.5GB. Where did the remaining 3.5GB go?

You restart the Python process, and memory instantly drops to zero. Run the same code again, and this time the memory peak is only 8.5GB, stabilizing at 4GB after training.

This is not a memory leak. This is Python’s memory pooling strategy at work.

More counter-intuitively: this phenomenon is not a bug, but an engineering trade-off made for performance. Understanding this trade-off requires rebuilding our cognitive framework of Python memory management.

The Old Framework Fails: Why “Garbage Collection” Is Not Enough

Most developers’ understanding of Python memory stops at three concepts:

Reference Counting: Objects are recycled immediately when no one references them
Garbage Collection: The gc module handles circular references
Memory Release: Recycling equals releasing

This framework completely fails when explaining the scenario above.

del large_matrix does set the reference count to zero, the object is recycled, but memory is not returned to the operating system. Why?

The answer is: Python’s memory management is a layered architecture. Recycling only handles the object layer; releasing requires traversing three layers back to the operating system.

What We Really Need to Understand

Before diving into the three-layer architecture, we need to understand the essence of Python objects at the C layer.

CPython is the Python interpreter implemented in C. Every Python object—including every integer, string, and list you create—is a C struct at the underlying layer called PyObject:

typedef struct _object {
    Py_ssize_t ob_refcnt;      // Reference count
    struct _typeobject *ob_type;  // Type pointer
} PyObject;

This struct has only two fields: reference count and type pointer. The reference count tracks how many names point to this object, and when it drops to zero, the object’s memory can be recycled.

But “recycling” is not equal to “releasing”.

For small objects (less than 512 bytes), CPython uses a dedicated memory pool system. This system does not directly request and release memory from the operating system, but maintains a private memory pool for internal recycling.

This leads us to our three-layer framework.

Arena → Pool → Block: The Three-Layer Memory Architecture

Python Memory Architecture Three-Layer Model Figure 1: CPython Arena-Pool-Block Three-Layer Memory Architecture

Layer 1: Arena (256KB)

Arena is the largest memory unit, fixed at 256KB, aligned to memory page boundaries.

When Python needs more memory, it requests an Arena from the operating system. These Arenas are organized into a doubly linked list usable_arenas, sorted by the number of free Pools they contain. CPython prioritizes using Arenas with the fewest free Pools—the goal is to allow those emptier Arenas to potentially be completely released back to the operating system.

But Arenas are rarely truly released. Only when all Pools in an Arena become empty will it be returned. In long-running services, this means Python process memory usage often “only increases, never decreases”.

This is not a memory leak. This is Arena-level pooling strategy.

Layer 2: Pool (4KB)

Each Arena is divided into multiple Pools, each Pool being 4KB, exactly the size of a virtual memory page.

Pools have three states:

used: Has available Blocks that can be allocated
full: All Blocks have been allocated
empty: No data, can be allocated to any size class

Pools are managed by the usedpools array. This array is indexed by size class (0-63), with each element being a doubly linked pointer to available Pools for that size class. When 8 bytes of memory need to be allocated, CPython directly looks up usedpools[0], no traversal needed.

When no available Pool exists, the system takes an empty Pool from the freepools linked list for initialization.

This design avoids the overhead of frequently requesting memory from the operating system.

Layer 3: Block (8-512 bytes)

Inside the Pool, it’s divided into smaller Blocks, which are the actual units storing data.

Block size is determined by size class, ranging from 8 bytes to 512 bytes, totaling 64 size classes (index 0-63). Size class mapping follows a specific algorithm:

Requested Bytes	Allocated Block Size	Size Class Index
1-8	8 bytes	0
9-16	16 bytes	1
17-24	24 bytes	2
…	…	…
505-512	512 bytes	63

This alignment strategy ensures efficient allocation of small objects, but also brings internal fragmentation. An object that only needs 9 bytes will occupy 16 bytes.

Blocks within a Pool are managed by the freeblock pointer—a singly linked list, with freed Blocks added to the head of the list. The next allocation takes directly from the head.

This is the truth of “free”: the Block is marked as available, but memory remains in the Python process.

How This Framework Guides Practical Judgments

Understanding the three-layer architecture, we can now answer the opening question.

Diagnosing Memory Usage

Python 3.4+ provides the tracemalloc module to track memory allocation:

import tracemalloc

tracemalloc.start()
# ... your code ...
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory: {current / 1024 / 1024:.1f} MB")
print(f"Peak memory: {peak / 1024 / 1024:.1f} MB")

tracemalloc.stop()

But this only tracks object-level allocation. To understand Arena-level behavior, more low-level tools are needed.

Memory Optimization for Long-Running Services

For long-running services (such as large model inference servers):

Pre-allocation Strategy: Pre-allocate required memory at startup to avoid allocation fragmentation at runtime
Object Pool Reuse: Use __slots__ to reduce small object overhead
Periodic Restarts: For long-running processes, consider periodic restarts rather than forced release

Similarities with PyTorch CUDA Memory Pool

Interestingly, PyTorch’s CUDA memory management adopts a strategy similar to Python’s pymalloc. Understanding this similarity helps us make more informed memory decisions in deep learning engineering.

PyTorch CUDA Memory Pool Mechanism

PyTorch’s caching_allocator is the core component of GPU memory management, with a design philosophy remarkably similar to Python’s three-layer architecture:

Segment Layer (Analogous to Arena): Large chunks of GPU memory (typically 2MB) requested from the CUDA driver, aligned to GPU page boundaries
Block Layer (Analogous to Pool): Segments are divided into Blocks of different sizes, maintained in free lists
Chunk Layer (Analogous to Block): The smallest units actually allocated to tensors

// PyTorch caching allocator core structure (simplified)
struct Block {
    size_t size;           // Block size
    Block* prev;           // Doubly linked list
    Block* next;
    bool allocated;        // Whether allocated
    int device;            // GPU device ID
};

struct BlockPool {
    std::unordered_map<size_t, std::list<Block*>> small_blocks;
    std::list<Block*> large_blocks;  // Blocks > 1MB
};

Caching Allocator vs pymalloc Comparison

Feature	Python pymalloc	PyTorch CUDA Allocator
Top Unit	Arena (256KB)	Segment (2MB)
Middle Unit	Pool (4KB)	Block Pool (dynamic)
Bottom Unit	Block (8-512B)	Chunk (variable)
Allocation Granularity	64 size classes	Power-of-2 size classes
Release Strategy	Lazy release, return only when Arena empty	Lazy release, explicit empty_cache
Fragmentation Source	Interleaving of different size classes	Inconsistent tensor lifecycles
Locking Mechanism	GIL protection	Per-GPU independent lock

Special Challenges of GPU Memory Fragmentation

GPU memory fragmentation is more destructive than CPU memory fragmentation:

import torch

# Scenario: Repeatedly loading different sized model weights
for model_id in range(100):
    # Load randomly sized weight matrices
    size = 1024 * (model_id % 10 + 1)
    weights = torch.randn(size, size, device='cuda')
    
    # Release immediately after forward pass
    output = weights @ weights
    del weights, output
    
    # GPU memory usage keeps growing!
    print(f"Model {model_id}: {torch.cuda.memory_allocated()/1e9:.2f}GB "
          f"(reserved: {torch.cuda.memory_reserved()/1e9:.2f}GB)")

# Sample output:
# Model 0: 1.05GB (reserved: 2.10GB)
# Model 50: 3.20GB (reserved: 5.80GB)
# Model 99: 4.10GB (reserved: 8.20GB)  <- Heavy fragmentation!

torch.cuda.empty_cache() vs Arena Release

Both mechanisms implement “return unused memory” functionality, but with different trigger conditions and costs:

Dimension	Python Arena Release	torch.cuda.empty_cache()
Trigger Condition	All Pools in Arena empty	Explicit call or OOM
Time Cost	Microseconds (pure CPU)	Milliseconds (GPU sync)
Sync Cost	None	Forces CUDA sync, blocks all streams
Use Case	Long-running services	Interactive debugging, low GPU memory
Call Frequency	Automatic	Use cautiously, excessive calls hurt performance

# Production recommendation: Use empty_cache() cautiously
class MemoryEfficientInference:
    def __init__(self, empty_cache_threshold_gb=0.5):
        self.threshold = empty_cache_threshold_gb * 1e9
        self.allocated_prev = 0
    
    def maybe_empty_cache(self):
        """Intelligently trigger empty_cache to avoid frequent synchronization"""
        allocated = torch.cuda.memory_allocated()
        reserved = torch.cuda.memory_reserved()
        
        # Only trigger when fragmentation is severe
        if reserved - allocated > self.threshold:
            torch.cuda.empty_cache()
            print(f"Empty cache triggered: freed {reserved - torch.cuda.memory_reserved():.0f}MB")
        
        self.allocated_prev = allocated

This comparison reveals a fundamental pattern: memory pooling is a universal pattern for high-performance systems. Whether CPU memory or GPU memory, both face the trade-off between “frequent allocation/release overhead” and “high memory usage”.

Memory Management Comparison Across Python Implementations

So far, this article has discussed CPython implementation details—the official Python interpreter written in C. However, the Python language specification does not dictate memory management implementation, and different Python implementations adopt completely different strategies. Understanding these differences helps choose the most appropriate runtime environment for specific scenarios.

PyPy: Generational GC and Object Moving

PyPy is a Python interpreter written in Python (translated to C via the RPython toolchain), with fundamentally different memory management from CPython:

Generational Garbage Collection

PyPy abandons CPython’s reference counting in favor of pure generational garbage collection:

# PyPy GC behavior example
import gc

# PyPy has no sys.getrefcount()
# Object survival determined by GC, not reference count
class Node:
    def __init__(self, value):
        self.value = value
        self.next = None

# Circular references are not a problem in PyPy
a = Node(1)
b = Node(2)
a.next = b
b.next = a

# No manual handling needed, generational GC reclaims automatically

Object Moving and Memory Compaction

PyPy’s most significant feature is support for object moving:

CPython: Objects never move after creation
         -> Can pass pointers directly to C extensions
         -> But memory fragmentation cannot be solved

PyPy:    GC can move live objects, compacting memory
         -> Effectively eliminates fragmentation
         -> Requires "write barriers" to track pointer changes
         -> More complex interaction with C extensions (requires pinning)

This design makes PyPy’s memory usage more stable in long-running applications, but at the cost of poorer compatibility with C extensions (such as NumPy, PyTorch).

GraalPython: Static Analysis Optimization

GraalPython is the Python implementation in the GraalVM ecosystem, leveraging advanced JIT compilation:

Static Analysis-Driven Memory Optimization

# GraalPython can perform escape analysis at compile time
def create_vector():
    return [1, 2, 3, 4, 5]

# If analysis shows the list doesn't escape the function
# GraalPython can allocate it on the stack instead of heap
# Automatically released when function returns, no GC needed

Key Features:

Partial Escape Analysis: Identifies objects that don’t escape their scope, stack allocation instead of heap
Scalar Replacement: Decomposes object fields into independent variables, eliminating object header overhead
Java Interoperability: Seamlessly call Java class libraries, share JVM garbage collector

Jython / IronPython: Managed Memory Model

Jython (Python on JVM) and IronPython (Python on .NET) delegate memory management entirely to the host runtime:

# Jython example - Using JVM's G1/ZGC/Shenandoah GC
import java.lang.System as System

# Can call JVM memory management API
runtime = System.getRuntime()
print(f"JVM heap memory: {runtime.totalMemory() / 1024 / 1024}MB")
print(f"Free memory: {runtime.freeMemory() / 1024 / 1024}MB")

# GC behavior controlled entirely by JVM parameters
# -Xmx4g -XX:+UseG1GC -XX:MaxGCPauseMillis=200

Advantages of Managed Model:

Mature GC Algorithms: Can directly use G1, ZGC, Shenandoah and other advanced collectors
Cross-Language Memory Sharing: Share heap memory with Java/C# objects, zero-copy interoperability
Enterprise Monitoring: Integrate with VisualVM, JFR and other mature toolchains

Limitations:

Only supports Python 2.7 (Jython 3 still in development)
Cannot use CPython’s C extensions (need Java/.NET rewrite)
Longer startup time (JVM warmup)

Selection Guide: Choose by Scenario

Scenario	Recommended	Reason
Web Services / General	CPython 3.11+	Richest ecosystem, fast startup, best compatibility
Long-running / Pure Python	PyPy 3.9	Better GC, memory compaction, JIT acceleration
Java Ecosystem	GraalPython / Jython	Seamless Java library calls, shared JVM GC
.NET Ecosystem	IronPython	C# interoperability, uses CLR GC
Data Science / ML	CPython + NumPy	C extension compatibility is critical
GIL-free Multithreading	CPython 3.13 (nogil)	PEP 703 introduces true parallelism
Cloud-native / Serverless	CPython 3.11+	Prioritize startup speed, cold-start sensitive

Key Insight: Memory management strategy choice is a systematic trade-off. CPython’s three-layer pooling architecture wins on startup speed and C extension compatibility, but falls behind PyPy on long-running memory efficiency; Jython/IronPython rely on mature JVM/.NET GC but sacrifice Python 3 compatibility and ecosystem richness. There is no “best” implementation, only the “most suitable” choice for the scenario.

Large Object Memory Allocation Strategy

The previous sections detailed pymalloc’s fine-grained management of small objects (≤512 bytes), but large objects follow a completely different path. Understanding large object strategy is crucial for large model weight loading, scientific computing, and similar scenarios.

Large Object malloc Path Explained

When allocation requests exceed 512 bytes, CPython bypasses pymalloc and calls the system’s malloc directly:

// CPython source: Objects/obmalloc.c
static void*
pymalloc_alloc(void *ctx, size_t nbytes) {
    if (nbytes > SMALL_REQUEST_THRESHOLD) {  // > 512 bytes
        return PyMem_RawMalloc(nbytes);      // Direct to system malloc
    }
    // ... Small objects go Arena-Pool-Block path
}

Characteristics of this path:

No Pool Management: Each allocation directly calls malloc, freed directly via free
No Fragmentation Accumulation: Memory returned to OS promptly, not held by process
Higher Overhead: System call + page table update, 5-10x slower than pymalloc

import time
import tracemalloc

# Compare small vs large object allocation performance
tracemalloc.start()

# Small objects: pymalloc path
start = time.perf_counter()
small_objects = [[] for _ in range(100000)]  # Empty list ~56 bytes
t1 = time.perf_counter() - start

# Large objects: system malloc path
start = time.perf_counter()
large_objects = [bytearray(1024) for _ in range(100000)]  # 1KB
t2 = time.perf_counter() - start

print(f"Small objects 100K allocations: {t1*1000:.2f}ms")
print(f"Large objects 100K allocations: {t2*1000:.2f}ms")
print(f"Performance gap: {t2/t1:.1f}x")

Memory Alignment and SIMD Optimization

Large models and scientific computing have special memory alignment requirements:

SIMD Alignment Requirements

import numpy as np

# NumPy defaults to 16/32/64 byte aligned memory
# Critical for AVX-512 (512-bit = 64 bytes) instruction set
arr = np.zeros(1024, dtype=np.float32)
print(f"Array alignment: {arr.ctypes.data % 64 == 0}")  # True

# Penalty for unaligned access (pseudo-code illustration)
# AVX-512 loading unaligned memory may require 2 micro-ops instead of 1
# In loop-intensive computations, this can cause 20-30% performance loss

Alignment Strategy Comparison

Allocation Method	Alignment Guarantee	Use Case
System malloc	8 or 16 bytes	General allocation
posix_memalign	Arbitrary alignment	Manual SIMD optimization
NumPy allocator	64 bytes	Scientific computing, ML
PyTorch allocator	512 bytes	GPU tensor alignment

NumPy Array Memory Layout Analysis

NumPy’s ndarray memory layout is key to performance optimization:

import numpy as np

# C-order (row-major) vs F-order (column-major)
arr_c = np.zeros((1000, 1000), order='C')
arr_f = np.zeros((1000, 1000), order='F')

# Memory layout affects cache hit rate
# When iterating by row, C-order is contiguous
%timeit arr_c.sum(axis=1)  # Fast
%timeit arr_c.sum(axis=0)  # Slow (strided access)

# View vs copy memory strategy
view = arr_c[::2, ::2]  # View, shared memory
print(f"View contiguous: {view.flags['C_CONTIGUOUS']}")  # False

# Force copy when contiguous memory needed
contiguous = np.ascontiguousarray(view)

NumPy Memory Management Key Decisions:

# 1. Pre-allocation: Avoid repeated allocation in loops
result = np.empty((batch_size, feature_dim))  # Pre-allocate
for i, batch in enumerate(data_loader):
    np.multiply(batch, weights, out=result)   # Reuse memory

# 2. Memory pool: Use numpy.memmap for very large arrays
mmapped = np.memmap('large_array.dat', dtype='float32',
                    mode='r', shape=(1000000, 4096))
# Pages loaded on demand, not explicitly occupying physical memory

# 3. In-place operations: Reduce temporary array allocation
# Avoid: result = a + b + c  # Creates 2 temporary arrays
# Prefer: result = a.copy(); np.add(result, b, out=result); np.add(result, c, out=result)

Large Model Weight Loading Memory Optimization

Large model (LLM) weight loading is an extreme scenario for memory management requiring special strategies:

1. Chunked Loading and Memory Mapping

import torch
import mmap

# Strategy: Use memory mapping instead of full loading
def load_weights_mmap(checkpoint_path):
    """Use mmap for lazy weight loading, read on demand"""
    # PyTorch 2.0+ supports mmap loading
    state_dict = torch.load(
        checkpoint_path,
        mmap=True,  # Key parameter
        map_location='cpu'
    )
    return state_dict

# Memory comparison
# Full loading: RSS = model size + framework overhead
# mmap loading: RSS ≈ currently active weights + page cache (reclaimable)

2. Quantized Loading to Reduce Memory Footprint

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# 4-bit quantization loading, memory footprint reduced to 1/4
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,  # Nested quantization for further compression
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b",
    quantization_config=quant_config,
    device_map="auto",  # Auto assign layers to GPU/CPU
)

# Memory comparison (Llama-2-7B):
# FP32: ~28GB
# FP16: ~14GB
# INT8: ~7GB
# INT4: ~3.5GB

3. Layer-wise Loading and CPU Offload

import torch
from accelerate import init_empty_weights, load_checkpoint_and_dispatch

# Create empty model on Meta device (no memory allocation)
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config)

# Smart dispatch: Active layers on GPU, inactive on CPU/disk
model = load_checkpoint_and_dispatch(
    model,
    checkpoint_path,
    device_map="auto",
    offload_folder="offload",  # Excess layers offloaded to disk
    offload_state_dict=True,
)

# Memory strategy:
# - GPU: Only holds layers needed for current computation (1-2 layers)
# - CPU: Holds preloaded next batch of layers
# - Disk: Holds infrequently used layers

4. Gradient Checkpointing and Activation Recomputation

from torch.utils.checkpoint import checkpoint

# Memory optimization during training: trade computation for memory
class MemoryEfficientLayer(nn.Module):
    def forward(self, x):
        # Drop activations during forward, recompute during backward
        return checkpoint(self._forward_impl, x)
    
    def _forward_impl(self, x):
        # Actual layer computation
        return self.layer(x)

# Memory savings:
# Standard training: O(N) where N = number of layers
# Checkpoint: O(1) only saves inputs
# Cost: 20-30% extra compute time

Large Model Memory Optimization Decision Matrix

Technique	Memory Savings	Performance Impact	Applicable Stage
Memory Mapping	Medium (swappable)	Low	Inference
4-bit Quantization	High (4x)	Medium	Inference/Finetune
CPU/Disk Offload	Very High	High (IO bottleneck)	Inference
Gradient Checkpointing	Very High	Medium (recompute)	Training
Flash Attention	High (linear→log)	Low	Training/Inference
ZeRO Sharding	Very High (scales with GPU count)	Low	Distributed Training

Understanding the hierarchical relationship of these memory management strategies—from CPython’s pymalloc to system malloc, to NumPy/PyTorch specialized allocators—is the foundation of efficient engineering practice in the large model era.

Memory Leak Investigation in Practice: The Mystery of Memory Growth in a Recommendation System

Case Background: Memory Issues in an E-commerce Recommendation System

In Q3 2023, a recommendation system service at a leading e-commerce platform experienced severe memory issues in production. The service was built on Python + TensorFlow Serving architecture, responsible for real-time computation of user-product recommendation scores.

System Architecture Overview:

Service Type: RESTful API service (FastAPI)
Deployment: Kubernetes Pod, memory limit 8GB
Load Characteristics: 20 million requests/day, peak QPS 150
Model Scale: User embeddings 128-dim × 50 million users, product embeddings 128-dim × 10 million products

Problem Symptoms:

Timeline                Memory Usage    Pod Status
Day 1 00:00            2.1 GB         Running
Day 1 12:00            3.8 GB         Running
Day 2 00:00            5.2 GB         Running
Day 2 08:30            6.9 GB         OOMKilled
Day 2 08:35 (restarted) 2.0 GB         Running

The service triggered OOM approximately every 18-24 hours. While Kubernetes automatic restart ensured availability, request failure rate increased by 0.3% during restart periods, affecting about 60,000 recommendation requests daily.

Diagnosis Path: From Python Object Layer to Arena Layer

Step 1: Confirm Whether It’s a “Real” Memory Leak

Many developers assume leakage at the first sign of continuous memory growth. But first, distinguish: Is it a real leak (objects cannot be reclaimed), or a “false leak” caused by Arena pooling strategy?

import gc
import sys

def diagnose_memory():
    """Basic diagnosis: Object count vs memory usage"""
    # Force full GC
    gc.collect()
    gc.collect()
    gc.collect()

    # Count objects by generation
    gen_counts = gc.get_count()
    total_objects = len(gc.get_objects())

    print(f"GC generation counts: {gen_counts}")
    print(f"Total live objects: {total_objects}")

    # Top 10 by type
    type_counts = {}
    for obj in gc.get_objects():
        obj_type = type(obj).__name__
        type_counts[obj_type] = type_counts.get(obj_type, 0) + 1

    top_types = sorted(type_counts.items(), key=lambda x: -x[1])[:10]
    print("\nObject type distribution (Top 10):")
    for name, count in top_types:
        print(f"  {name}: {count}")

diagnose_memory()

Output:

GC generation counts: (245, 8, 3)
Total live objects: 184,532

Object type distribution (Top 10):
  dict: 45,231
  list: 23,456
  str: 18,902
  UserEmbedding: 12,340  # <-- Abnormal!
  ProductEmbedding: 11,892  # <-- Abnormal!
  float: 8,234
  tuple: 7,654
  function: 6,543
  builtin_function_or_method: 5,432
  cell: 4,321

Key finding: Abnormally high count of UserEmbedding and ProductEmbedding objects. Theoretically, the recommendation service should only cache embeddings for active users, not store large numbers of objects.

Step 2: tracemalloc Locates Allocation Hotspots

import tracemalloc
from functools import lru_cache

# Start tracing
tracemalloc.start(25)  # Keep 25 stack frames

# Baseline snapshot
baseline = tracemalloc.take_snapshot()

# Simulate 1000 recommendation requests
for user_id in generate_test_users(1000):
    get_recommendations(user_id)

# Compare snapshots
current = tracemalloc.take_snapshot()
diff = current.compare_to(baseline, 'lineno')

print("Memory growth hotspots (Top 10):")
for stat in diff[:10]:
    print(f"\n{stat.traceback.format()[-1]}")
    print(f"  Size: {stat.size_diff / 1024 / 1024:.2f} MB")
    print(f"  Count: {stat.count_diff}")

Output:

Memory growth hotspots (Top 10):

File "recommender/cache.py", line 42
  Size: 245.67 MB
  Count: +12,340

File "recommender/embeddings.py", line 88
  Size: 198.34 MB
  Count: +11,892

File "recommender/feature_store.py", line 156
  Size: 67.23 MB
  Count: +3,456

Problem code located at cache.py:42:

# Problem code (cache.py)
class EmbeddingCache:
    def __init__(self):
        self._cache = {}  # Unbounded local cache

    def get(self, user_id: str) -> Optional[UserEmbedding]:
        if user_id not in self._cache:
            # Load from Redis and permanently cache locally
            embedding = self._load_from_redis(user_id)
            self._cache[user_id] = embedding  # <-- Only grows!
        return self._cache.get(user_id)

Root Cause Analysis: Engineers avoided repeated Redis queries by maintaining local in-memory caches in each Pod, but without eviction policies. As different user requests were randomly distributed across Pods, each Pod’s cache grew indefinitely.

Step 3: pympler + guppy Analyze Object Memory Usage

After fixing the cache (switching to LRU Cache with 10,000 limit), memory growth slowed but persisted. Need deeper analysis of actual object memory usage.

from pympler import tracker, muppy, summary
from guppy import hpy
import pandas as pd

def detailed_analysis():
    """Deep analysis using pympler and guppy"""

    # Method 1: pympler growth tracking
    tr = tracker.SummaryTracker()
    tr.print_diff()

    # Method 2: guppy heap analysis
    hp = hpy()
    h = hp.heap()

    print("=== Guppy Heap Memory Distribution ===")
    print(h)

    print("\n=== Memory Usage by Type ===")
    by_type = h.bytype
    print(by_type)

    # Trace reference chains for specific objects
    print("\n=== UserEmbedding Object Reference Analysis ===")
    emb_heap = h.bytype[UserEmbedding]
    print(f"UserEmbedding instances: {len(emb_heap)}")
    print(f"Single instance size: {emb_heap[0].size if emb_heap else 'N/A'}")

    # See who holds these objects
    for obj in emb_heap[:5]:  # Sample first 5
        refs = hp.heap().referrers
        print(f"\nObject {id(obj)} reference status:")
        for ref in refs:
            print(f"  - {type(ref).__name__}")

detailed_analysis()

Output:

=== Guppy Heap Memory Distribution ===
Partition of a set of 184532 objects. Total size = 2145678900 bytes.
 Index  Count   %     Size   % Cumulative  % Type
     0  12340   7 987200000  46  987200000  46 UserEmbedding
     1  11892   6 761088000  35 1748288000  81 ProductEmbedding
     2  45231  24  72369600   3 1820657600  85 dict
     3  23456  13  37529600   2 1858187200  87 list
     4  18902  10  30243200   1 1888430400  88 str

=== Memory Usage by Type ===
UserEmbedding occupies 46% memory, ~987MB
ProductEmbedding occupies 35% memory, ~761MB

Step 4: Arena Layer Fragmentation Analysis

Even if objects are reclaimed, memory may still be held by Arenas. Use CPython debug interface to inspect:

def arena_analysis():
    """Analyze Arena layer memory state (requires --with-pydebug compiled Python)"""
    import sys

    # Output malloc stats
    sys._debugmallocstats()

arena_analysis()

Sample output:

# arenas allocated = 4096
# arenas reclaimed = 12
# arenas highwater = 4096
# arenas allocated current = 4084

# Memory usage calculation:
# 4084 arenas × 256KB = 1,045,504 KB ≈ 1.02 GB

# Pool usage statistics:
# size class 24 (256 bytes): 1024 pools, 32% fragmentation
# size class 32 (320 bytes): 2048 pools, 28% fragmentation
# ...

Key Findings:

Only 12 Arenas reclaimed, 4084 still held
Pool fragmentation 28-32%, objects of different size classes interleaved
Even after objects reclaimed, Arenas cannot be returned to OS

Solution: Three-Layer Collaborative Governance

Solution 1: Object Pool Reuse (Application Layer)

from functools import lru_cache
from typing import Dict, Optional
import weakref

class PooledEmbeddingCache:
    """Use object pool to reuse Embedding objects, reduce memory allocation"""

    def __init__(self, max_size: int = 10000, pool_size: int = 1000):
        self.max_size = max_size
        self._lru_cache: OrderedDict[str, UserEmbedding] = OrderedDict()

        # Embedding object pool: Pre-allocate and reuse
        self._embedding_pool: List[UserEmbedding] = [
            UserEmbedding(dim=128) for _ in range(pool_size)
        ]
        self._available: List[UserEmbedding] = self._embedding_pool.copy()

    def _get_from_pool(self) -> Optional[UserEmbedding]:
        """Get Embedding object from pool"""
        if self._available:
            return self._available.pop()
        return None

    def _return_to_pool(self, emb: UserEmbedding):
        """Return Embedding to object pool"""
        if len(self._available) < len(self._embedding_pool):
            emb.clear()  # Reset object state
            self._available.append(emb)

    def get(self, user_id: str) -> Optional[UserEmbedding]:
        # LRU cache retrieval
        if user_id in self._lru_cache:
            self._lru_cache.move_to_end(user_id)
            return self._lru_cache[user_id]

        # Load from Redis
        data = self._load_from_redis(user_id)
        if data is None:
            return None

        # Cache eviction
        if len(self._lru_cache) >= self.max_size:
            oldest_id, oldest_emb = self._lru_cache.popitem(last=False)
            self._return_to_pool(oldest_emb)  # Return to object pool

        # Allocate from pool or create new
        emb = self._get_from_pool() or UserEmbedding(dim=128)
        emb.load_from_bytes(data)

        self._lru_cache[user_id] = emb
        return emb

Solution 2: Periodic Forced Cleanup (GC Layer)

import gc
import signal
from contextlib import contextmanager

def aggressive_gc_cleanup():
    """Aggressive GC cleanup strategy"""
    # Round 1: Clean generation 0
    gc.collect(0)

    # Round 2: Clean generation 1
    gc.collect(1)

    # Round 3: Full GC
    freed = gc.collect(2)

    # Try to release Arenas (CPython 3.11+)
    try:
        import sys
        if hasattr(sys, 'malloc_info'):
            before = sys.malloc_info()
            gc.collect()  # May trigger Arena release
            after = sys.malloc_info()
            print(f"Arena release: {before[0] - after[0]} bytes")
    except:
        pass

    return freed

# Use APScheduler for periodic execution
from apscheduler.schedulers.background import BackgroundScheduler

scheduler = BackgroundScheduler()
scheduler.add_job(
    aggressive_gc_cleanup,
    'interval',
    minutes=30,  # Execute every 30 minutes
    id='gc_cleanup',
    replace_existing=True
)
scheduler.start()

Solution 3: Graceful Restart Strategy (Arena Layer)

import os
import sys
import signal
import psutil
from fastapi import FastAPI
import asyncio

app = FastAPI()

class MemoryAwareRestarter:
    """Memory-aware service graceful restarter"""

    def __init__(
        self,
        memory_threshold_gb: float = 6.5,
        graceful_timeout: int = 60,
        check_interval: int = 300
    ):
        self.threshold = memory_threshold_gb * 1024 * 1024 * 1024
        self.graceful_timeout = graceful_timeout
        self.check_interval = check_interval
        self.shutting_down = False
        self.request_count = 0

    async def start_monitoring(self):
        """Start memory monitoring loop"""
        while not self.shutting_down:
            await asyncio.sleep(self.check_interval)
            await self._check_memory()

    async def _check_memory(self):
        process = psutil.Process(os.getpid())
        mem_info = process.memory_info()

        if mem_info.rss > self.threshold:
            await self._graceful_restart()

    async def _graceful_restart(self):
        """Graceful restart process"""
        self.shutting_down = True
        print(f"Triggering graceful restart, current memory: {self._get_memory_mb():.1f} MB")

        # 1. Stop accepting new requests
        app.state.accepting_requests = False

        # 2. Wait for existing requests to complete
        wait_start = asyncio.get_event_loop().time()
        while self.request_count > 0:
            if asyncio.get_event_loop().time() - wait_start > self.graceful_timeout:
                print("Wait timeout, force exit")
                break
            await asyncio.sleep(0.5)

        # 3. Final GC
        gc.collect()

        # 4. Signal supervisor/systemd to restart
        os.kill(os.getpid(), signal.SIGTERM)

    def _get_memory_mb(self) -> float:
        process = psutil.Process(os.getpid())
        return process.memory_info().rss / 1024 / 1024

# Global middleware to count requests
@app.middleware("http")
async def request_counter(request, call_next):
    restarter.request_count += 1
    try:
        response = await call_next(request)
        return response
    finally:
        restarter.request_count -= 1

# Initialize
restarter = MemoryAwareRestarter()
@app.on_event("startup")
async def startup():
    asyncio.create_task(restarter.start_monitoring())

Fix Verification Results

Monitoring data after implementing three-layer solution:

Metric                  Before          After           Improvement
---------------------------------------------------------------
Average memory usage    5.8 GB         3.2 GB          -45%
Memory growth rate      +180 MB/hr     +15 MB/hr       -92%
Pod restart frequency   1/18 hrs       1/7 days        -89%
OOM events              12/week        0/week          -100%
p99 latency             45ms           38ms            -16%

Toolchain Combination Recommendations

Scenario	Recommended Tool	Key Metrics
Quick hotspot location	tracemalloc	Allocation stack, growth rate
Object-level analysis	pympler + guppy	Object size, reference chain
Arena-level analysis	sys._debugmallocstats	Arena count, fragmentation
Real-time monitoring	psutil + prometheus	RSS, VMS, GC frequency
Production diagnosis	objgraph	Object reference graph, cycles

Production Memory Diagnosis Checklist

□ Set tracemalloc baseline at startup, record initial memory state
□ Configure memory alert threshold (suggest 70% of limit)
□ Periodically scan gc.get_objects() for abnormal object growth
□ Check all caches/connection pools have rate limiting
□ Monitor gc.get_count() to confirm GC working normally
□ Implement periodic restart or rolling update for long-running services
□ Record malloc_stats for fragmentation analysis
□ Establish memory baseline, compare version change impact

Memory Diagnosis in Practice: A Production-Level LLM Service’s Memory Dilemma

Scenario Replay: Memory Explodes from 4GB to 16GB in 48 Hours

A large model inference service exhibited a strange phenomenon after going live:

Initial State: Service occupies 4GB memory after startup (model weights)
After 24 hours: Memory usage grows to 9GB
After 48 hours: Memory usage reaches 16GB, triggering OOM and being killed by the system

Initial Diagnosis:

# Monitoring shows continuous memory growth, but business request volume is stable
$ ps aux | grep python
PID   %MEM  RSS
1234  40.2  16GB  # After 48 hours

# Check Python object count
gc.get_count()  # Output: (452, 12, 5)
# Generation 0: 452 objects, Generation 1: 12 objects, Generation 2: 5 objects

GC appears to be working normally, with few objects. Where is the problem?

Diagnosis Process: Panoramic Analysis from Python Objects to Arena Level

Step 1: tracemalloc Snapshot Comparison

import tracemalloc

# At service startup
snapshot1 = tracemalloc.take_snapshot()

# After 24 hours
snapshot2 = tracemalloc.take_snapshot()

top_stats = snapshot2.compare_to(snapshot1, 'lineno')
for stat in top_stats[:10]:
    print(stat)

Output:

<traceback>: line 45
  size=5120 KiB (+5120 KiB), count=10 (+10)
  File: inference.py:45

Discovered that the cache dictionary at line 45 of inference.py continues to grow. This is the LRU cache for prompt templates.

Step 2: Check Cache Implementation

# Problem code - inference.py:45
@lru_cache(maxsize=None)  # Unlimited cache!
def get_prompt_template(template_id: str) -> str:
    return load_template_from_db(template_id)

Each new template_id generates a cache entry, and template_id is part of user input (UUID format), theoretically infinite. This is the root cause of the memory leak—a disguised infinite growth cache.

Step 3: Arena-Level Memory Analysis

After fixing the cache issue (limiting maxsize=100), observe after 24 hours:

# After 24 hours
$ python -c "
import sys
# Need to compile CPython with --with-pydebug
# View Arena statistics via sys._debugmallocstats()
"

Output (example):

# arenas allocated = 256
# arenas reclaimed = 0
# arenas highwater = 256
# arenas allocated current = 256

# Each Arena 256KB
# 256 * 256KB = 64MB resident memory

Key Findings:

Arenas were never released (reclaimed=0)
Even though objects were recycled, Arena-level memory is still held by the Python process
This is caused by small object allocation fragmentation—objects of different size classes mixed in the same Arena

Step 4: Pool-Level Fragmentation Analysis

// View usedpools statistics in CPython source (requires recompilation)
// Objects/obmalloc.c _PyObject_DebugMallocStats

# python3 -c "import sys; sys._debugmallocstats()"
# Output excerpt:
Small block requests: 1234567
Total size: 456.7 MB
# Pool fragmentation: 23.4%

Root Cause Summary:

Application Layer: Unlimited cache causes continuous object growth
GC Layer: Objects are recycled, but memory is not released
Arena Layer: Fragmentation prevents memory from being returned to OS

Solution: Three-Layer Collaborative Fix

Layer 1: Application Layer (Fix Cache)

from functools import lru_cache
import weakref

# Fix 1: Limit cache size
@lru_cache(maxsize=1000)
def get_prompt_template(template_id: str) -> str:
    return load_template_from_db(template_id)

# Fix 2: Use weak reference cache for non-critical objects
class WeakCache:
    def __init__(self):
        self._cache = weakref.WeakValueDictionary()
    
    def get(self, key):
        return self._cache.get(key)
    
    def set(self, key, value):
        self._cache[key] = value

Layer 2: GC Layer (Proactive Trigger)

import gc

# For long-running services, periodically trigger full GC
# Note: This brings STW (Stop-The-World) pauses

def scheduled_gc_cleanup():
    """Execute full GC once per hour"""
    gc.collect()  # Force collection of all generations
    print(f"GC completed, objects freed: {gc.garbage}")

# Use APScheduler or similar scheduled task framework
scheduler.add_job(scheduled_gc_cleanup, 'interval', hours=1)

Layer 3: Arena Layer (Process Restart Strategy)

# Ultimate solution: Periodic graceful restart
# This is the industry best practice for handling Arena fragmentation

class GracefulRestarter:
    """
    When memory exceeds threshold, gracefully stop accepting new requests,
    wait for existing requests to complete, then restart the process.
    """
    def __init__(self, memory_threshold_gb: float = 12.0):
        self.threshold = memory_threshold_gb * 1024 * 1024 * 1024
        self.shutting_down = False
    
    def check_memory(self):
        import psutil
        process = psutil.Process()
        mem_info = process.memory_info()
        
        if mem_info.rss > self.threshold and not self.shutting_down:
            self.initiate_graceful_restart()
    
    def initiate_graceful_restart(self):
        self.shutting_down = True
        # Notify load balancer to stop sending traffic
        health_check.fail()
        # Wait for existing requests to complete (max 60 seconds)
        asyncio.wait_for(requests_queue.empty(), timeout=60)
        # Exit, automatically restarted by supervisor/systemd
        sys.exit(0)

Fix Results:

After cache limit: 48-hour memory stable at 6GB
After adding periodic GC: Memory peak drops to 5.5GB
After adding graceful restart strategy: Zero OOM events

Production Environment Memory Diagnosis Checklist

- Use tracemalloc to track memory allocation hotspots
- Check if cache/connection pools have unlimited growth
- Analyze gc.get_objects() to find unexpectedly long-lived objects
- Monitor Arena fragmentation level (sys._debugmallocstats)
- Implement memory ceiling monitoring + graceful restart strategy
- Regular stress testing, simulating long-running scenarios

pymalloc vs mimalloc: The PEP 703 Transformation

Python 3.13 introduced the --disable-gil build option (PEP 703), with one major change being replacing pymalloc with mimalloc.

mimalloc is a modern memory allocator developed by Microsoft, natively thread-safe, with more aggressive layering strategies. For multi-threaded large model workloads, this could completely transform memory management performance characteristics.

Performance Comparison (Theoretical):

Feature	pymalloc (GIL)	mimalloc (nogil)
Thread Safety	Relies on GIL	Native thread safety
Allocation Strategy	size class + pool	size class + segment
Small Object Allocation	Fast	Close to pymalloc
GC Integration	Maintains object linked list	Traverses mimalloc structures
Fragmentation	Higher	Lower

mimalloc’s size-class-based allocation strategy allows multiple threads to access objects of different size classes without locking, which is key to nogil performance. Compared to pymalloc’s pool strategy, mimalloc’s segment strategy provides better thread isolation and lower fragmentation.

Differences Across Python Implementations

This article describes the CPython implementation. Other Python implementations adopt completely different strategies:

PyPy: Uses generational garbage collection and object moving
Jython: Relies on the JVM garbage collector
IronPython: Relies on the .NET garbage collector

These differences mean: memory management behavior is an implementation detail of CPython, not a specification of the Python language.

Conclusion: Back to the Opening Question

Why doesn’t memory drop after deleting a large list?

Now you know: it’s not that garbage collection isn’t working, not a memory leak, but an engineering trade-off of the Arena-Pool-Block three-layer architecture. Python retains this memory for subsequent small object allocation, avoiding the overhead of frequent requests/releases to the operating system.

This trade-off is reasonable for most applications—it trades memory for speed. But for large model workloads, understanding this mechanism is crucial, because an 8GB embedding matrix and an 8-byte small integer follow completely different memory paths.

In the next article, we’ll dive deep into garbage collection mechanisms—seeing how reference counting, generational GC, and cycle detection collaborate, and why “recycling” and “releasing” are two different concepts.

References and Acknowledgments

Original: Memory Management in Python — Alexander VanTol: https://realpython.com/python-memory-management/
CPython Source: Objects/obmalloc.c
PyTorch CUDA Memory Management: https://pytorch.org/docs/stable/notes/cuda.html#memory-management

Series context

You are reading: Python Memory Model Deep Dive

This is article 1 of 7. Reading progress is stored only in this browser so the full series page can resume from the right entry.

View full series →

Reading path

Continue along this topic path

Follow the recommended order for Python instead of jumping through random articles in the same topic.

View full topic path →

Next step

Go deeper into this topic

If this article is useful, continue from the topic page or subscribe to follow later updates.

Prologue: The Confusion After Deleting a Large List

The Old Framework Fails: Why “Garbage Collection” Is Not Enough

What We Really Need to Understand

Arena → Pool → Block: The Three-Layer Memory Architecture

Layer 1: Arena (256KB)

Layer 2: Pool (4KB)

Layer 3: Block (8-512 bytes)

How This Framework Guides Practical Judgments

Diagnosing Memory Usage

Memory Optimization for Long-Running Services

Similarities with PyTorch CUDA Memory Pool

PyTorch CUDA Memory Pool Mechanism

Caching Allocator vs pymalloc Comparison

Special Challenges of GPU Memory Fragmentation

torch.cuda.empty_cache() vs Arena Release

Memory Management Comparison Across Python Implementations

PyPy: Generational GC and Object Moving

GraalPython: Static Analysis Optimization

Jython / IronPython: Managed Memory Model

Selection Guide: Choose by Scenario

Large Object Memory Allocation Strategy

Large Object malloc Path Explained

Memory Alignment and SIMD Optimization

NumPy Array Memory Layout Analysis

Large Model Weight Loading Memory Optimization

Memory Leak Investigation in Practice: The Mystery of Memory Growth in a Recommendation System

Case Background: Memory Issues in an E-commerce Recommendation System

Diagnosis Path: From Python Object Layer to Arena Layer

Step 1: Confirm Whether It’s a “Real” Memory Leak

Step 2: tracemalloc Locates Allocation Hotspots

Step 3: pympler + guppy Analyze Object Memory Usage

Step 4: Arena Layer Fragmentation Analysis

Solution: Three-Layer Collaborative Governance

Solution 1: Object Pool Reuse (Application Layer)

Solution 2: Periodic Forced Cleanup (GC Layer)

Solution 3: Graceful Restart Strategy (Arena Layer)

Fix Verification Results

Toolchain Combination Recommendations

Production Memory Diagnosis Checklist

Memory Diagnosis in Practice: A Production-Level LLM Service’s Memory Dilemma

Scenario Replay: Memory Explodes from 4GB to 16GB in 48 Hours

Diagnosis Process: Panoramic Analysis from Python Objects to Arena Level

Solution: Three-Layer Collaborative Fix

Production Environment Memory Diagnosis Checklist

pymalloc vs mimalloc: The PEP 703 Transformation

Differences Across Python Implementations

Conclusion: Back to the Opening Question

References and Acknowledgments

You are reading: Python Memory Model Deep Dive

Current series chapters

Continue along this topic path

Original Interpretation: Python Garbage Collection - The Three Most Common Misconceptions

Original Analysis: 72 Processes vs 1 Process—How GIL Becomes a Bottleneck for AI Training and PEP 703's Breakthrough

Original Analysis: Python as a Glue Language—How Bindings Connect Performance and Ease of Use

Continue with this topic

Original Analysis: Why FastAPI Rises in the AI Era—The Engineering Value of Type Hints and Async I/O

Original Analysis: Why Python Monopolizes LLM Development—Ecosystem Flywheel and Data Evidence

Original Analysis: Capability Building for Python Developers in the AI Tools Era—A Practical Guide for Frontline Engineers

Go deeper into this topic

Subscribe to updates

Comments and discussion