Article
Original Interpretation: Contextual Engineering—The Forgotten Core Battlefield in the AI Era
An in-depth analysis of the essential challenges of Agent memory systems and why context management is the key to determining the success or failure of AI products.
📋 Copyright Statement and Disclaimer
This article is an original analysis article based on the author’s personal practical experience, and is inspired by the Kaggle white paper “Context Engineering: Sessions & Memory”.
Opinion Attribution Statement:
- All specific cases, practical data, and pitfall experiences in the article come from the author’s personal project experience
- The core methodology and framework are reconstructed into the author’s original thinking
- Only refer to the academic expressions of the white paper for some concept definitions
Original reference:
- Title: “Context Engineering: Sessions & Memory”
- Link: Read original text
Original Nature: This article is an independently created practice summary article, not a translation or rewriting. The views expressed in this article represent only the author’s personal understanding and may differ from the original author’s position.
Introduction: The demo that made the customer leave on the spot
It was the spring of 2024, and the intelligent sales assistant that our team spent three months carefully building was about to face its most important test - a product demonstration to a manufacturing group with annual revenue of more than 5 billion. If we can win this customer, it will not only mean a seven-digit annual fee contract, but it will also become a key case for us to enter the enterprise market.
The first half of the demo went pretty smoothly. The agent fluently answered questions about product functions, accurately called the CRM system to display customer portraits, and even proactively recommended several related successful cases based on the customer’s industry characteristics. The client’s IT director nodded frequently, and the person in charge of procurement also began to ask about the implementation cycle and payment methods.
The turning point occurred in the 47th minute.
At that time, the customer’s vice president of production suddenly asked a key question: “Can your system be connected with our existing MES system? The data synchronization requirement we just mentioned…”
The Agent paused for a moment, and then said something that embarrassed the whole audience: “Sorry, I’m not sure what you mentioned about the MES system and data synchronization requirements. Could you please introduce your business scenario in more detail?”
The vice president of production was stunned: “We have been discussing for a long time, from the order management system to the MES data interface, and you said at the time that this docking solution was feasible…”
I looked at the innocent Agent on the screen and immediately realized what had happened - the context window had overflowed.
During this 47-minute conversation, the Agent has used up the context capacity of approximately 120,000 tokens. When the vice president of production asked that question, the system automatically cut off the earliest conversation record to make room for new input. And those truncated contents contain all the discussions about MES system docking in the past half hour.
In the end, the customer terminated the cooperation negotiations on the grounds of “insufficient technical maturity.” Three months of hard work and 47 minutes of hope evaporated at that moment.
At that moment, I deeply realized: **Context management is not a technical detail, but the core capability that determines the life or death of AI products. **
This lesson led me to delve into all aspects of contextual engineering. In the next two years, my team and I went through countless similar pitfalls, trial and error, and iterations, and gradually established a relatively complete set of context engineering methodology. This article is my systematic summary of this practical process.
Chapter 1: The Neglected Iceberg—Why Contextual Engineering is Agent’s Achilles’ Heel
1.1 Misunderstanding 1: Treat context as “technical parameters” rather than “product capabilities”
My team and I also made this mistake in the early stages of AI product development. We treat the context window size as a purely technical parameter - “GPT-4 supports 128K, then it must be enough for us”. We treat memory management as a back-end implementation detail - “Just let the engineer find a Redis to save it.”
The cost of this knowledge is huge.
In our first officially delivered project, an Agent for legal consultation received a large number of complaints in the first week of its launch. Feedback from users is surprisingly consistent: “It was good to chat at the beginning, but I lost my memory while chatting about AI.”
It took us three full days of troubleshooting to discover the root cause of the problem: when the conversation exceeds 15 rounds, the system will discard the earliest rounds of dialogue in order to ensure response speed. In the legal consultation scenario, the user will describe in detail the background of the case, the relationship between the parties involved, and key time points in the initial conversation - these are the basis for all subsequent legal analysis.
Once this information is “optimized” away, the Agent is like a lawyer who suddenly lost his memory and has to repeatedly ask the client, “Who did you just say the defendant was?”
This case made me realize deeply: **Contextual engineering is not a back-end optimization, but a decisive factor in the core experience of the product. **
When we treat context management as a technical parameter, we are actually treating the user’s core experience as a expendable variable. In the context of legal consultation, the premise that users are willing to pay is that the Agent can fully understand every detail of the case like a professional lawyer. Once the Agent “loses its memory”, users will feel not technical limitations, but a lack of professional capabilities.
This shift in perception has had a profound impact on our team. We began to incorporate context management into core considerations during the product design phase, rather than waiting until the performance optimization phase to think of it.
1.2 Misunderstanding 2: Underestimating the universality of long dialogue scenes
Many people think that the main use scenario of AI Agent is short one-time question and answer. But what we see in actual operating data is quite different.
In one of our customer service agent projects, we analyzed 100,000 real conversation records in the past three months:
- Only 31% of the dialogues took less than 5 rounds.
- 42% of dialogue rounds range from 5 to 15
- 27% of dialogues lasted more than 15 rounds
This means that nearly 30% of user sessions will enter the “long conversation” range - and this is precisely the range where context management is most likely to cause problems.
More importantly, high-value users tend to have long conversations. Our data analysis shows:
- For users with an average customer price of more than 10,000 yuan, the average number of dialogue rounds is 18.7.
- For users whose average customer price is less than 1,000 yuan, the average number of conversation rounds is only 6.2.
**This means that context management issues particularly harm high-value users. **
This discovery completely changed our product prioritization. Before this, we mainly focused on the onboarding experience of new users, because that is the stage where user churn is the most serious. But the data tells us that context management problems in long conversation scenarios are actually driving away the most valuable user groups.
A specific example is that we have an enterprise customer with an annual payment of more than 200,000. Their purchasing manager once reported back: “Your AI started asking the same questions repeatedly in the 20th minute of the chat, which makes me doubt whether it really understands our needs.” If this problem occurs to ordinary users, it may only be a small negative review; but if it occurs to such a large customer, it may mean a risk of contract renewal.
We also found that users with long conversations tend to have higher expectations for Agents. They invest more time and effort, and therefore have less tolerance for imperfections in the experience. An Agent that “loses its memory” during a long conversation will make users feel let down - “I spent so long chatting with you, but you can’t even remember my most basic information.”
The emotional damage is more severe than the functional impairment. Users will not give up because the Agent occasionally makes mistakes, but if users feel that the Agent “disrespects” their input, they are likely to leave permanently.
1.3 Misunderstanding 3: Confusing “can remember” and “remember well”
Even if they realize the importance of context, many teams’ understanding of “remember” still remains at the level of “not losing information.” But the real challenge goes beyond that.
In one of our A/B tests, we compared two memorization strategies:
- Control group: Simple sliding window, retaining the last 20 complete conversations
- Experimental Group: Intelligent layering strategy, retaining the summary after identifying key information
The test results were beyond our expectations:
- The experimental group’s “information retention rate” (that is, the proportion of users who answered correctly when asked about previously discussed content) was 47% higher than the control group.
- However, the “user satisfaction” of the experimental group was 12% lower than that of the control group.
After in-depth analysis, we found the reason: although the experimental group “remembered” more key information, it lost too many details and context. Users asked “What happened to the solution we talked about last time?” The experimental team could say “It was a solution to optimize database queries”, but completely forgot the specific technical details and reasons for the decision discussed at that time.
This case reveals a key insight: The best context management is not to maximize the amount of information, but to maximize the relevance and availability of information. **
Users do not care about how many tokens the Agent remembers. What they care about is whether the Agent can provide the right information at the right time. An Agent that “remembers” a lot but uses it poorly is worse than an Agent that “remembers” little but uses it properly.
1.4 Misunderstanding 4: Ignoring the coupling of context and business logic
This is one of the biggest pitfalls I’ve ever encountered in practice.
In one of our early e-commerce shopping guide agent projects, we completely abstracted context management into a common module. This module is responsible for storing and retrieving the conversation history, and the business logic module only calls it. It may seem like a good layered design, but in reality it creates serious problems.
The problem is: different business scenarios have different definitions of “what information is important”. In the e-commerce shopping guide scenario, the user’s budget range, brand preference, and size requirements are the core information; while in the medical consultation scenario, the user’s symptom description, medication history, and allergy information are the core information.
Our universal module cannot distinguish these differences and can only treat all information equally. The result is that in some scenarios, critical information is drowned in noise; in other scenarios, irrelevant information takes up valuable contextual space.
**Key insight: Context management must be deeply coupled with business logic and cannot be simply abstracted into a general module. **
This lesson led us to later adopt a “domain-driven” context management strategy - each business domain defines its own information priority, summary strategy, and retention rules. Although this increases the complexity of the system, it significantly improves the user experience in various scenarios.
Chapter 2: Three realms of Agent memory
After more than two years of practical exploration, I gradually formed a hierarchical understanding of the Agent memory system. This stratification is not a textbook theoretical division, but an engineering practice framework based on countless pitfalls.
2.1 The first level: working memory - dancing with the model
Working memory corresponds to the context window of a large language model. This is the most intuitive and easiest layer to understand, but it is also the layer most easily misused.
Core Contradiction: Limited Capacity vs Unlimited Demand
No matter how model manufacturers expand the context window from 4K to 128K to 200K, they are essentially solving a pseudo-problem - because the user’s dialogue needs are unlimited.
Calculated in one of our enterprise knowledge management projects:
- The average number of tokens consumed in a typical corporate policy consultation dialogue: about 8,000
- Project planning conversations including detailed technical solution discussions: approximately 25,000-40,000
- Complex problem requiring review of multiple historical documents: easily over 50,000
Even if you use the largest context window, you still cannot avoid the “cannot fit” problem.
Our solution: dynamic priority management
Instead of worrying about window size, think about how to fit the most valuable information into a limited space.
We have built a dynamic priority system in practice:
- P0 level (permanent): system instructions, user portraits, key constraints
- P1 level (recent): the last 5-8 rounds of original dialogue
- Level P2 (Summary): Intelligent summaries of early conversations
- Level P3 (on demand): Relevant history dynamically retrieved via RAG
This layered management allows our Agent to actually “perceive” the amount of information equivalent to 300K+ original conversations under the limit of 128K context - the cost is that some details are replaced by summaries, but the core information is basically retained.
Travel record: Summary is not everything
When we first implemented smart summarization, we were overly aggressive and compressed all early conversations into one-sentence summaries. As a result, a strange phenomenon occurred: the agent appeared to “know well” when answering questions, but always lacked a few key details.
A typical example: The user mentioned in the third round that “the budget is controlled within 500,000, which is a hard requirement.” Our summary system condenses it into “users have budget considerations.” In the 20th round, when the Agent proposed a plan of 800,000 yuan, the user was furious: “I said the upper limit of 500,000 yuan from the beginning!”
**Lesson: The summary must retain key constraints and values and cannot be over-compressed. **
This lesson led us to later establish a “constraint retention list” - during the summary process, all numerical constraints, time constraints, Boolean constraints and other key information must be identified and retained.
2.2 The second level: long-term memory – cognitive continuity across time
If working memory solves “not forgetting the current conversation”, long-term memory solves “cross-session cognitive continuity”.
Real case: Transformation from “stranger” to “old acquaintance”
Before we iterated on the third version, the common feedback from users on Agent was “you have to reintroduce yourself every time you chat.” A typical scenario:
Monday’s conversation:
- User: “I am an e-commerce business, mainly selling outdoor products.”
- Agent: “I understand, you are an e-commerce seller of outdoor products. How can I help you?”
- User: “I want to optimize inventory management”
- Agent gives targeted suggestions
Wednesday’s conversation (same user):
- User: “I tried the inventory plan I mentioned last time.”
- Agent: “Sorry, I’m not sure which solution you are referring to. Could you please introduce your business scenario in detail?”
- User: ”…We just talked on Monday. I am an e-commerce company of outdoor products.”
This experience makes users feel like they are talking to “different AI”, with no continuity at all.
Solution: User portrait + memory map
We later established a relatively complete long-term memory system:
Fact layer: Basic attributes of users
- Industry, company size, role
- Technology stack, tool preferences
- Communication style preference (prefers detailed explanations vs direct answers)
Experience layer: History of interactions with users
- Previous conversation topics
- Suggestions given and feedback from users
- Success stories and failed attempts
Relationship layer: the association between users and information
- Knowledge points that users pay attention to
- Frequently asked questions by users
- User’s knowledge blind spots
After implementation, user satisfaction increased by 34% and the repeat inquiry rate dropped by 67%. More importantly, users began to refer to the Agent as “you” instead of “it” - this subtle change in language indicates that users began to treat the Agent as a continuous conversation partner rather than a one-time tool.
Deep thinking: Where are the boundaries of long-term memory
In the practice of long-term memory, we face a fundamental question: How long should the agent remember the user? How much should be remembered?
We tried “permanent memory” in a project - remembering all the user’s interactions from the first use to the present. It turned out that not only did this bring huge storage costs, it also had unintended negative impacts.
A user reported back half a year later: “Your AI always brings up questions I asked six months ago, but I have solved them long ago. This makes me feel that it cannot keep up with my growth.”
This feedback made us realize: **Long-term memory also requires forgetting mechanisms. ** It’s not about remembering the more the better, but remembering the information that is most valuable to the user’s current state.
We later introduced a “memory decay” mechanism - the influence of old memories gradually decreases over time; and also introduced “memory conflict detection” - when new information conflicts with old memories, new information is given priority.
2.3 The third level: external memory - knowledge access that breaks through boundaries
The emergence of RAG (Retrieval Enhanced Generation) has brought Agent context engineering to a new stage. But RAG is not a silver bullet, it brings new complexities.
Our RAG evolution path
V1 stage: simple vector retrieval
The initial implementation was straightforward: all documents were segmented, vectorized, and stored in Pinecone, and the most relevant fragments were retrieved when users asked questions.
The problem quickly became apparent:
- Insufficient search accuracy: The user asked about “return policy”, and the search results included “recruitment policy” and “privacy policy”, but they all contained the word “policy”
- Context fragmentation: information is scattered across multiple document fragments, and the agent only sees fragments but not the whole picture.
- Timeliness issue: The latest information updates cannot be retrieved
V2 stage: hybrid search + reordering
We added keyword matching as a supplement and introduced Cross-Encoder for reordering, which significantly improved the retrieval accuracy.
But a new problem arises: retrieval does not mean good use.
A typical case: A user asked “How to apply for super deduction of R&D expenses”. Our RAG system successfully retrieved the correct policy document fragment, but the Agent’s answer posted the policy provisions to the user intact without any explanation or applicability analysis.
User feedback: “I can check these myself. What I want is to tell me what to do.”
This case reveals a key issue: RAG only solves the problem of “information acquisition”, but does not solve the problem of “information processing”. Agent needs not just original document fragments, but understood, integrated, and contextualized knowledge.
V3 stage: post-retrieval processing + knowledge graph
The current plan adds:
- Post-processing of search results (summarization, association, conflict detection)
- Knowledge graph support (understanding the relationship between concepts)
- Prejudging user query intent (loading potentially relevant knowledge in advance)
The performance of this version is much better, but the maintenance cost has also increased significantly - a dedicated knowledge management team is required to maintain document quality and knowledge graph accuracy.
**Key Insight: The effectiveness of RAG is highly dependent on the quality of knowledge. Without good knowledge management, no matter how advanced the search technology is, it will be useless. **
This understanding prompted us to establish a dedicated knowledge operations team responsible for quality control, update maintenance, and structural optimization of documents. This is unthinkable in traditional software development—we typically don’t invest dedicated operational resources in knowledge content. But in the Agent era, knowledge itself has become a core component of product competitiveness.
Chapter 3: Five major practical problems in context engineering
3.1 Problem 1: Hierarchy strategy for hot and cold data
We have gone through a detour in long-term memory storage.
The initial approach is simple: all memories are stored in PostgreSQL, and the entire table is retrieved when querying. As a result, after the number of users exceeded 10,000, the query delay soared from 50ms to 800ms+.
We tried a pure Redis solution, and the query delay dropped to 5ms, but the memory cost exploded - each user’s memory occupies an average of 2MB of memory, and 100,000 users means 200GB.
Final solution: hot-warm-cold three-tier architecture
| Hierarchy | storage media | data range | Query latency | cost |
|---|---|---|---|---|
| hot data | Redis | Last 7 days | <5ms | high |
| temperature data | Pinecone | 7 days-1 year | 20-50ms | middle |
| cold data | PostgreSQL | More than 1 year | 100-500ms | Low |
Actual operating effect:
- 85% of queries hit the hot data tier
- 12% hits the warm data layer
- Only 3% need to query cold data
The average query delay is controlled within 15ms, and the storage cost is reduced by 70% compared with the pure memory solution.
Trade-offs in practice
This three-tier architecture sounds great, but it faces many difficult trade-offs when implemented in practice.
The first is the definition of “hot data”. We initially regarded the “last 7 days” as hot data, but for some business scenarios, users may only come once a week, but decide whether to purchase in one conversation. In this scenario, the seven-day window may be too long or too short.
We later introduced the concept of “dynamic hot data” - not based on a fixed time window, but dynamically adjusted based on user activity and business value. The data of high-value users is retained in the hot tier for a longer period of time, while the data of low-frequency users sinks to the warm tier faster.
The second is the issue of data consistency. When a memory is migrated from the hot tier to the warm tier, how to ensure that the Agent gets a consistent experience across different queries? If the user has just asked a question involving a certain memory, and this memory happens to be migrated at this time, the Agent’s answer may be inconsistent.
The solution we adopt is “copy-on-write” - the hot tier data is not deleted during migration, but is marked as “migrating” and is not deleted until the warm tier data is confirmed to be available. This brings additional storage overhead, but ensures a consistent experience.
3.2 Problem 2: Cross-device status synchronization
A user may use Agent on three devices: mobile phone, computer, and tablet at the same time. How to keep the experience consistent?
Weird bug we encountered
The user discussed the project plan with the Agent on the mobile phone for half an hour, and then switched to the computer to continue. Agent behaves “strange” on the computer - completely unaware of the discussion on the phone.
After investigation, it was found that the mobile phone session and the computer session are two independent sessions in the system. They each maintain their own working memory and are not connected to each other.
Solution: User-Level Status Center
We have established a unified user status center:
- Working memory is still maintained separately (to ensure response speed)
- Key memories are synchronized to user-level storage in real time
- Automatically pull user-level memory when switching devices
After implementation, the cross-device experience score increased from 3.2 points to 4.6 points (out of 5 points).
Deeper Challenge: Equipment Differences
Synchronizing across devices is not just a technical issue, it also involves the complexity of product design.
Different devices have different usage scenarios and constraints. Conversations on mobile phones are usually short and fragmented; conversations on computers are usually deep and long. Users may be more willing to accept concise answers on mobile phones, but may expect more detailed explanations on computers.
Simply synchronizing all memories across all devices may create a mismatch in experience. We encountered such a problem in a project: the user quickly confirmed a certain plan on the mobile phone, but on the computer the Agent gave a detailed explanation based on the same memory, which made the user feel verbose.
We later introduced a “device-aware” memory system that not only stores memory content, but also stores the context in which the memory was generated (device type, conversation length, user behavior pattern). When recalled on different devices, the agent takes these contextual factors into account and adjusts the style and level of detail of the answer.
3.3 Problem 3: Balance between privacy and personalization
This is one of the hardest questions because it involves the intersection of ethics, law, and product strategy.
Real Dilemma: Should a financial advisory agent remember a user’s portfolio details?
Benefits to remember:
- Can provide more accurate suggestions
- More continuous user experience
- Conversations are more efficient
risk:
- Sensitive financial data leaked
- Compliance risks (GDPR, Financial Data Protection Regulation)
- Users may not want to be “remembered”
How we do it
Create a graded memory strategy:
- Explicit sensitive information (account balance, position details): not memorized, will be asked again after each conversation
- Implicit preferences (risk preference, investment style): memory after desensitization
- General Knowledge (Investment Philosophy, Market Viewpoint): Normal Memory
Also provides user control panel:
- See what the system remembers
- Delete specific memories
- Set the “forget” time (such as “automatically forget this conversation after 1 month”)
Ethical Thinking: Memory is Power
In the process of solving privacy issues, I gradually realized a deeper issue: **In the era of AI, memory is power. **
Whoever controls the Agent’s memory controls the rules for interaction between the user and the Agent. When we as product developers unilaterally decide what to remember and what to forget, we are actually exercising a hidden form of power.
This understanding prompted us to vigorously promote “memory transparency” in our products - not only letting users know what we remember, but also allowing users to understand why they are remembered and how these memories affect the behavior of the Agent. We believe that only transparent power is legitimate.
We have also established a Memory Ethics Committee to regularly review our memory strategies to ensure they meet ethical standards. This may sound excessive, but in products involving user data, this kind of caution is necessary. The committee’s work includes reviewing the privacy implications of new features, handling users’ data deletion requests, developing memory retention policies, and more.
Cross-cultural Perspective: Different Cultures’ Attitudes to Memory
In global deployment, we find that different cultures have different attitudes towards “being remembered”. European and American users are generally more concerned about privacy and tend to choose “not to remember” or “short-term memory”; while Asian users usually look forward to personalized services and are willing to let the Agent remember more information in exchange for a better experience.
This cultural difference requires that our products have flexible configuration capabilities and cannot apply the same memory strategy across the board. We set different default configurations for different regions, while allowing users to adjust according to personal preferences.
3.4 Problem 4: Difficulties in Assessment
The final dilemma is: how do you know if your context engineering is done well or poorly?
Traditional software has clear testing standards: input A, expected output B, actual output C, just compare.
But the assessment of context engineering is ambiguous:
- How much “remember” is enough?
- How much “forgetting” is too little?
- What should be remembered and what should not be remembered?
Assessment Framework We Established
Quantitative indicators:
- Memory hit rate: the proportion of the Agent that can recall correctly when the user asks questions related to history
- Repeat inquiry rate: the frequency with which the agent repeatedly asks for known information
- Contextual relevance: The relevance of the context used by the Agent to the user’s current problem
Qualitative Assessment:
- User satisfaction survey
- Manual evaluation by experts (spot checks on conversation quality)
- Comparative testing (A/B testing of different memory strategies)
actual effect
On our customer service agent:
- Memory hit rate: increased from 61% to 89%
- Repeat inquiry rate: reduced from 23% to 4%
- User satisfaction: increased from 3.4 points to 4.5 points
But behind these numbers, there are still many cases that “do not feel right” and require continuous manual tuning.
The Essential Dilemma of Assessment
After more than two years of evaluation practice, I believe that contextual engineering evaluation faces an essential dilemma: **We cannot use simple indicators to measure an inherently subjective experience. **
Whether the user thinks the Agent “remembers well” depends not only on the objective memory accuracy, but also on the timing, method, and context of memory. A technically perfect memory system will still annoy users if old information is always brought up at inappropriate times.
Our final solution was a “human-in-the-loop” assessment - automated metrics are used to quickly screen for obvious issues, but the final judgment is left to a human assessor. This is costly, but at the current level of technology, it is the only reliable solution.
Combination of Quantitative and Qualitative
In the practice of evaluation, we find that quantitative indicators and qualitative evaluation each have their own advantages and disadvantages. Quantitative metrics can be applied at scale to quickly identify problem trends; but they cannot capture the nuances of subjective experience. Qualitative assessments can provide in-depth insights, but are expensive and difficult to scale.
Our best practice is “tiered assessment”:
- Layer 1 (Automation): Monitor core indicators and identify anomalies
- Second level (manual sampling): Manual analysis of abnormal samples
- The third level (in-depth interviews): Conduct in-depth interviews with key users to obtain qualitative insights
This layered approach allows us to maintain control of overall quality while gaining a deep understanding of how users really feel.
3.5 Problem 5: Context coordination of multi-Agent systems
As the complexity of the system increases, we are increasingly faced with multi-Agent collaboration scenarios. This brings new context management challenges.
Scenario: A user consultation involves the collaboration of three professional agents: order inquiry agent, logistics tracking agent, and after-sales processing agent.
Problem: Each Agent has its own memory, but users expect a unified experience - they do not want to repeat the same problem for three Agents.
OUR SOLUTION
A “shared context layer” is established - on top of multiple Agents, there is a unified context coordinator:
- The user’s basic information (identity, preferences) is visible to all Agents
- The core intent of the current conversation is visible to all agents
- Each Agent’s professional judgment is reserved only within its own field.
This architecture is still evolving, but it has shown that the complexity of context management in multi-Agent systems far exceeds that of a single-Agent scenario.
Chapter 4: Deep integration of contextual engineering and product design
4.1 Context is the product interface
In AI product design, we usually understand interfaces as visual UI elements—buttons, input boxes, and cards. But in the Agent product, the context itself is part of the interface.
When the Agent brings up a question that the user asked last week in the conversation, it is conveying to the user: “I remember you.” When the Agent proactively asks for a previously discussed detail, it is conveying to the user: “I am listening carefully.” These are not traditional UI elements, but they are core components of the user experience.
Design principle: Controllability of context exposure
We need to consciously design “how context is exposed to users.” Not all memories should be made explicit, and not all forgetting should be hidden.
The principles we adopt are:
- Active exposure: Agent actively brings up relevant history and displays “I am connecting the context”
- Passive response: Accurately recall when asked by the user, but do not actively mention it
- Implicit fusion: Integrate historical information into the current answer, but do not explicitly mark the source
Different scenarios suit different strategies. In customer service scenarios, we usually adopt active exposure to make users feel valued; in consultation scenarios, we usually adopt passive responses to avoid giving users the feeling of being “monitored”.
A delicate balance: when to bring up the past
In actual design, we found that “when to bring up the past” is a delicate balance. If it is mentioned too frequently, the user will feel that the Agent is “reliving old scores”; if it is mentioned too little, the user will feel that the Agent “does not pay attention to history”.
We found through A/B testing that the best strategy is “relevance triggering” - only when historical information is highly relevant to the current topic, the Agent will take the initiative to mention it. For example, if the user asks “last question”, the Agent should naturally mention it; but if the user is just chatting, the Agent should not mention the past for no reason.
We also found that the tone of mention matters. “You mentioned before…” is more likely to be accepted by users than “I remember you said…”. The former appears more objective, while the latter may make users feel that they are being “monitored”.
4.2 Forgetting as a Product Design Tool
In traditional software design, we don’t often consider the “forgetting” function. But in Agent products, forgetting is an important design tool.
Scenario 1: Forgetting of embarrassing memories
Users may share information that is too personal in a conversation and regret it later. If the Agent never “remembers” this information, users will feel uneasy. We need to provide a “selective forgetting” function - allowing users to specify that certain content is not remembered.
Scenario 2: Forgetting of outdated information
User preferences change, and yesterday’s truth may become today’s mistake. Agent needs to be able to identify the timeliness of information and forget or update old information in a timely manner.
Scenario Three: Cognitive Load Management
Sometimes, remembering too much can be a burden. Agents may fall into the trap of “overfitting” historical information and lose flexibility. Moderate forgetting can help the agent remain open and adaptable.
4.3 Organizational capability requirements for contextual engineering
Contextual engineering is not only a technical issue, but also involves the building of team organizational capabilities.
Cross-functional collaboration
Contextual engineering requires close collaboration between product, engineering, and operations:
- Product defines business rules for “what information is important”
- Engineering for efficient memory storage and retrieval
- Quality and timeliness of operation and maintenance knowledge base
The traditional division of functions often results in contextual engineering being thrown to engineers as an “engineering problem”, resulting in a disconnect from business needs. We later established a dedicated “Context Engineering Group”, consisting of people from three functions, with overall responsibility for the contextual experience.
Continuous Operation
Contextual engineering is not a one-time development task, but an ongoing operation. Knowledge becomes obsolete, user preferences change, and business rules adjust—all of which require continuous maintenance and updating of memory systems.
We have established a “memory audit” mechanism to regularly review the Agent’s memory content to identify issues such as outdated information, incorrect associations, and privacy risks. This is rare in traditional software development, but it has become a necessary operational task in the Agent era.
Chapter 5: Advice for Practitioners
5.1 Checklist for the initial stage
If you are or are about to start building the Agent’s memory system, the following is a checklist I summarized based on my experience in pitfalls:
Must have basic abilities
- Working memory management (even a simple sliding window)
- User Identification and Persistence
- Basic long-term memory storage (at least remember who the user is and what they prefer)
- Explicit marking of key information (allowing users to say “Please remember…”)
STRONGLY RECOMMENDED ABILITIES
- Intelligent summary (don’t truncate directly, extract key information)
- Hierarchical memory (hot data, warm data, cold data)
- Cross-session memory (making the user feel like they are talking to the same Agent)
- Privacy control (allowing users to see and delete their own data)
Advanced abilities (depending on the scenario)
- RAG integration
- Multi-device synchronization
- Self-learning and optimization of memory
- Complex user portrait system
5.2 Common over-design traps
Trap 1: Prematurely optimizing the memory system
Symptom: Designing a complex memory architecture to support 1 million users when there are only 100 users.
Consequences: The development cycle is long, the code is complex and difficult to maintain, and many functions cannot be used at all.
Recommendation: Start simple and evolve based on actual data and user feedback. Our initial version only used Redis for simple caching, and only introduced a layered architecture after supporting 10,000 users.
Trap 2: Pursue “remember everything”
Symptoms: Save every word of the user permanently.
Consequences: storage costs explode, retrieval noise is high, and privacy risks are high.
Suggestion: Establish a memory value evaluation mechanism and regularly clean up low-value memories. We have set the rule of “archiving without access for 90 days”, which greatly reduces storage costs.
Trap 3: Ignoring the cold start problem
Symptom: When a new user uses it for the first time, the Agent behaves like a stranger who does not understand the user at all.
Consequences: poor new user experience and high churn rate.
Recommendation: Design a new user onboarding process and establish a common default portrait. We have also prepared a “general industry template” for new users, which automatically loads preset common knowledge based on the industry selected by the user.
Trap 4: Technology-driven rather than demand-driven
Symptoms: Because vector databases are very popular, we must use vector databases; because knowledge graphs are very advanced, we must build knowledge graphs.
Consequences: The technology stack is complex, but the business value is unclear.
Recommendation: Always start from the user’s problem and choose the simplest feasible solution. We have many scenarios where simple keyword matching is enough, and vector retrieval is not required.
5.3 Thinking framework for technology selection
When choosing a technology stack for contextual engineering, we established an evaluation framework that we hope will help other teams make more informed decisions.
Evaluation Dimension One: Delay Requirements
Different scenarios have different sensitivities to query delays:
- Real-time dialogue scenario: Requirements <100ms, usually require memory-level storage (Redis)
- Asynchronous task scenario: Acceptable 1-5 seconds, can use disk storage or remote retrieval
- Offline analysis scenario: Minute level is acceptable, batch processing and data warehouse can be used
Evaluation dimension two: data scale
- Small scale (<10,000 users): Single machine storage is enough, PostgreSQL + Redis is enough
- Medium scale (1-1 million users): requires a distributed solution, consider sharding, replication, and layering
- Large scale (>1 million users): Requires specialized vector database, distributed cache, data lake
Evaluation Dimension 3: Query Mode
- Point query mainly: Key-Value storage (Redis) is the most efficient
- Similarity search is the main priority: vector databases (Pinecone, Milvus) are more suitable
- Complex condition query: relational database (PostgreSQL) or search engine (Elasticsearch)
Assessment Dimension 4: Team Capability
Technology selection must consider the team’s operation and maintenance capabilities. A solution that requires maintenance by a dedicated SRE team may not be a good choice for small teams. We initially chose Pinecone instead of self-hosted Milvus because the team did not have specialized experience in vector database operation and maintenance.
5.4 Practical experience in cost control
The cost of contextual engineering is often underestimated. In our practice, storage costs, computing costs, and operation and maintenance costs all require careful management.
Storage Cost Optimization
- Data Compression: Compress and store historical conversations. Text can usually be compressed to 30-50% of its original size.
- Hiered Storage: Use expensive memory for hot data, use cheap object storage for cold data
- Automatic cleaning: Set TTL (time to live) and automatically delete expired data
Computational Cost Optimization
- Lazy Loading: Don’t load all memories at once, load them on demand
- Caching Strategy: Cache frequently accessed memories
- Batch processing: Vector embedding and other operations should be performed in batches as much as possible
Operation and maintenance cost optimization
- Hosting service first: Unless there is a strong need for self-hosting, give priority to hosting service
- Automated operation and maintenance: Establish automated monitoring, alarming, and capacity expansion mechanisms
- Cost Monitoring: Establish a cost dashboard to detect abnormal expenditures in a timely manner
5.5 Implementation path from 0 to 1
Based on our experience, a typical context engineering implementation path is as follows:
Phase 1: Basic Building (1-2 weeks)
- Implement basic management of working memory
- Create a user identity system
- Build simple long-term memory storage
Phase 2: Experience Optimization (2-4 weeks)
- Implement smart summarization
- Optimize query performance
- Add cross-session memory
Phase 3: Scaling (4-8 weeks)
- Introducing a tiered storage architecture
- Implement data synchronization and backup
- Establish monitoring and alerting
Phase 4: Advanced Features (8 weeks+)
- Integrated RAG
- Achieve multi-device synchronization
- Establish an automatic assessment system
This timeline is an estimate based on a team of 5, actual time may vary based on team size and technical base.
5.6 Pitfall guide: Mistakes we’ve made
Mistake 1: Ignoring edge cases
We once encountered a weird bug: when the user conversation happened to reach the boundary of the context window, the Agent would generate meaningless repeated content. The reason is that our truncation logic is flawed in handling edge cases.
Lesson: Be sure to test for various edge cases, including empty inputs, overlong inputs, boundary values, etc.
Mistake 2: Overreliance on automation
We have tried fully automated memory summarization in the hope of reducing manual intervention. The result is that in some scenarios, the summary loses key information, causing the agent to make wrong judgments.
Lesson: Manual review is still necessary in critical business scenarios. Automation can improve efficiency, but it cannot completely replace human judgment.
Mistake 3: Ignoring the complexity of data migration
When we upgraded from V1 architecture to V2 architecture, we underestimated the difficulty of data migration. The format of the old data was incompatible with the new system and required extensive cleaning and conversion work. The process resulted in several days of service outage.
Lesson: Consider the backward compatibility of the data format at the early stage of design and establish an SOP for data migration.
Error 4: No rollback mechanism is established
In an update, a bug appeared in the new memory algorithm, causing a large number of users’ memory data to be corrupted. Because there was no effective rollback mechanism in place, it took us a full week to restore service.
Lesson: Any updates involving memory systems must have a rollback plan and be rehearsed regularly.
5.7 Team Capacity Building
Contextual engineering has specific requirements for team capabilities and requires investment in the following areas:
Technical Skills
- Distributed storage and caching technology
- Information retrieval and NLP basics
- Data modeling and architecture design
- Vector database and similarity calculation
Business Capability
- Domain knowledge understanding (specific business areas of Agent services)
- User experience design capabilities
- Data analysis and experimental design
Operational capabilities
- Knowledge base maintenance
- Memory quality monitoring
- User feedback processing
These capabilities are often dispersed across different teams, requiring the establishment of cross-functional collaboration mechanisms.
5.8 Collaboration between context engineering and product evolution
Contextual engineering is not isolated, it needs to be coordinated with product evolution. We have summarized some practical experiences:
Coordination of product release rhythm and context engineering
- MVP stage: Focus on working memory to ensure basic conversation experience
- Growth stage: Introduce long-term memory and improve user retention
- Mature stage: Improve external memory and RAG to provide in-depth value
- Expansion Phase: Establishing a context sharing mechanism for multi-Agent collaboration
Data-driven iteration
We have established a data analysis link of “memory-behavior-value”:
- Memory hit rate affects conversation fluency
- Conversation fluency affects user satisfaction
- User satisfaction affects paid conversion rate
Through this link, we can quantify the business value of context engineering and provide data support for resource investment.
Chapter 6: Future Outlook—The Evolutionary Direction of Contextual Engineering
6.1 From explicit memory to implicit understanding
Current context engineering relies primarily on explicit memory storage and retrieval. The future development direction may be more implicit “understanding” - the agent does not memorize information mechanically, but truly “understands” the user’s intention and context.
This involves several technical directions:
- Continuous Learning: Agent can continuously learn user preferences and patterns from conversations
- Conceptual abstraction: Agent can understand abstract concepts instead of just remembering literal information
- Situation Awareness: Agent can perceive the current situation and dynamically adjust its behavior
6.2 Integration of multimodal contexts
With the development of multimodal models, context is no longer limited to text. Images, audio, and video will all become part of the context.
This brings new challenges:
- Multimodal Storage: How to efficiently store and retrieve multimodal information
- Cross-modal association: How to establish associations between different modalities
- Unified Representation: How to uniformly represent and process multi-modal information
6.3 Combination of privacy computing and context engineering
Privacy protection will become an important consideration in context engineering. Technologies such as federated learning, differential privacy, and homomorphic encryption will be introduced:
- Federal Memory: The user’s sensitive memory is stored locally, and only the desensitized mode is shared.
- Differential Privacy: Add noise during memory retrieval to protect individual privacy
- Secure Computing: Memory retrieval and processing in an encrypted state
6.4 Standardization and interoperability of contexts
With the development of multi-Agent ecosystem, the standardization of context will become important:
- Standard format: Defines a common context exchange format
- Interoperability protocol: Different agents can share and understand context
- Context Market: Specialized context providers may appear to provide agents with preset contextual knowledge.
6.5 My thoughts: The essence of contextual engineering
After more than two years of practice, I believe that the essence of contextual engineering is not a technical issue, but a question of “how to maximize the value of information under limited resources.”
This involves several core trade-offs:
- Storage vs Forgetting: Remembering more costs more, but may also lead to a better experience
- Accurate vs. Fuzzy: Accurate memory is costly, and fuzzy memory may lose key information.
- Personalization vs Privacy: The more personalization, the more data is needed, but the more data, the higher the privacy risk
- Real-time vs Offline: Real-time memory has fast response but high cost, offline processing has low cost but high latency
There are no standard answers to these trade-offs, and choices need to be made based on specific scenarios and business goals.
Core Insight: Excellent contextual engineering is not about pursuing technological advancement, but about making optimal trade-off decisions under constraints.
Appendix: In-depth analysis of three real pitfall cases
In order to illustrate the complexity of context engineering more concretely, I selected three real cases for in-depth analysis. These cases cover the main types of problems we encounter and hope to provide reference for other teams.
Case 1: The medical consultation agent who “suddenly lost his memory”
Background: We developed a health consultation agent for an Internet medical platform. Users can consult it about symptoms, medication, medical advice, etc.
Problem Phenomenon: In the first week of launch, we received a large number of user complaints. The core feedback was “AI doctors suddenly lost their memory.” The specific performance is: the user describes his symptoms and medical history at the beginning of the conversation, and the Agent gives preliminary suggestions; but when the user asks for the specific dosage of medication, the Agent asks “What symptoms did you just say you had?”
Troubleshooting:
We spent three days troubleshooting and finally found that the root cause of the problem was that the context truncation strategy was too aggressive.
In a medical consultation scenario, users usually start the conversation with a long symptom description, which may include hundreds of words of medical history, symptom details, allergies, and other information. In order to control token consumption, our Agent sets a truncation threshold of 15 rounds of dialogue. When the dialogue exceeds 15 rounds, the system will automatically discard the oldest dialogue record.
The problem is that users in medical scenarios rarely end consultations within 15 rounds. The average conversation round is 23, which means that almost all conversations will trigger truncation. The earliest rounds of conversations are exactly the key information for users to describe their symptoms.
Solution:
We’ve implemented several improvements:
-
Special protection of medical information: Mark key medical information such as symptom description, medical history, and allergy history as “high priority” and do not participate in ordinary truncation logic
-
Smart summarization instead of truncation: No longer simply discarding early conversations, but extracting key medical information and generating structured medical summaries
-
Active confirmation mechanism: Before providing key answers such as medication suggestions, the Agent will actively repeat the user’s symptoms to confirm that the understanding is correct.
Effect: After improvement, complaints about “sudden amnesia” dropped by 92%, and user satisfaction increased from 3.2 points to 4.4 points.
Lessons Learned:
- Different scenarios have huge differences in context requirements, and a unified strategy cannot be used
- Critical information requires special protection mechanisms
- Proactive confirmation at key nodes can significantly enhance user trust
Case 2: The customer service agent who “gets slower the more we chat”
Background: We developed a customer service agent for an e-commerce platform to handle users’ pre-sales consultation, order inquiries, after-sales processing and other issues.
Problem Phenomenon: After being online for a period of time, users reported that the Agent “slowed down the more we chatted” - the response was very fast at the beginning of the conversation, but as the conversation progressed, the response time became longer and longer, sometimes even exceeding 10 seconds.
Troubleshooting:
We found through performance monitoring that the latency of context retrieval increases linearly as conversation turns increase. When the conversation exceeds 20 rounds, the average retrieval delay increases from 50ms to 800ms.
After in-depth analysis, we found that the problem lies in our RAG implementation. Each time a user asks a question, the system retrieves relevant historical conversations as context. However, as the number of dialogue rounds increases, more and more historical records need to be retrieved, causing the retrieval time to increase linearly.
The deeper problem is that we have not established an effective “context deduplication” mechanism. Much of the retrieved historical information is actually duplicated or redundant, but is still loaded into context.
Solution:
-
Hierarchical search strategy:
- First retrieve the last 5 rounds of conversations (hot data, latency <10ms)
- Retrieve older history (warm/cold data) only if necessary
- Cluster historical conversations and merge similar issues
-
Smart preloading:
- Anticipate the context that may be needed based on the topic of the conversation
- Load asynchronously in the background to reduce real-time retrieval pressure
-
Context deduplication:
- Establish a semantic deduplication mechanism to keep only one copy of similar content
- Regularly clean up redundant history records
Effect: After optimization, the response delay after 20 rounds of dialogue was reduced from 800ms to 120ms, and user complaints about “slowness” dropped by 85%.
Lessons Learned:
- The performance of context retrieval will deteriorate as the amount of data grows and needs to be designed in advance
- Not all historical information needs to be retrieved in real time, and a hierarchical strategy is important
- Deduplication and compression can significantly improve performance
Case 3: The internal assistant who “leaked privacy”
Background: We developed an internal assistant Agent for a company. Employees can use it to query internal policies, submit applications, obtain technical support, etc.
Problem Phenomenon: Shortly after going online, a serious privacy leak occurred. An employee discovered that when he asked about the “overtime pay policy”, the Agent’s answer contained other employees’ overtime pay application records - including names, amounts and other sensitive information.
Troubleshooting:
Investigation revealed that the problem was with our RAG data isolation mechanism.
To enable cross-department policy sharing, we store all policy documents in the same vector database. But we did not establish a strict permission isolation mechanism - when the Agent retrieves relevant information, it will retrieve all accessible documents, including sensitive applications of other employees.
The deeper problem is that our Agent does not have “permission awareness” capabilities. It doesn’t know who the current user is and what data it should access, it just retrieves and generates it mechanically.
Solution:
-
Data hierarchical isolation:
- Public Policy: Accessible to all employees
- Department policy: Access only to employees of this department
- Personal data: access restricted to the individual and authorized personnel
-
Permission-aware retrieval:
- During each retrieval, the permission information of the current user is passed in
- Only retrieve data that the user has permission to access
- When generating answers, check information permissions again
-
Desensitization of sensitive information:
- Automatically identify sensitive information such as names, amounts, etc.
- Desensitization during retrieval and generation
- Establish a sensitive vocabulary database and update it regularly
-
Audit Log:
- Log all data access actions
- Regular audits to detect abnormal access
- Establish an alarm mechanism and respond promptly
Effect: After implementing these measures, similar privacy leaks have not occurred again. At the same time, employees’ trust in Agent has increased significantly, and usage has increased by 40%.
Lessons Learned:
- Privacy protection must be an architecture-level consideration and cannot be patched after the fact.
- Agents need to have “permission awareness” capabilities and understand the access boundaries of data.
- Auditing and monitoring are key to detecting and preventing privacy issues
Common inspirations from the three cases
Looking back at these three cases, I found that they have a common revelation: **The problem of context engineering is often not a problem of the technology itself, but a problem of insufficient understanding of the scene. **
- Problems with medical cases stem from a lack of understanding of the particularities of medical scenarios
- Problems with performance cases stem from insufficient understanding of data size growth
- Problems with privacy cases stem from insufficient understanding of the complexity of the permissions model
This reminds us that when doing context engineering, we should not only look at technical implementation, but also have a deep understanding of business scenarios, user behaviors, and data characteristics. Technical solutions must be based on a deep understanding of the scenario.
Going back to the failure case at the beginning of the article - if we had a really good context management system at that time:
- It will identify MES system docking as the core topic of this conversation and mark it as a high priority memory
- It prioritizes retaining information relevant to the current issue when context is tense, rather than simply discarding the oldest conversations
- It will proactively confirm key information when necessary: “Do I understand the MES system docking requirements you just mentioned correctly?”
- It will instantly recall all relevant previous discussions when asked by the VP of Production
The results may be completely different.
**Context engineering is important because it determines whether the Agent can demonstrate “emotional intelligence” - remembering what should be remembered, forgetting what should be forgotten, and mentioning the right things at the right time. **
An Agent without memory, like a goldfish, can only give conditioned responses. An agent with memory but chaotic management is like a talkative person who can’t grasp the key points, which makes people crazy. **Only an Agent with excellent context engineering can become an intelligent partner that truly understands and accompanies users. **
This is my biggest realization in more than two years of practice, and it is also the core message I want to convey in this article.
Looking to the future, with the development of multi-modal interaction, multi-Agent collaboration, and long-term companion agents, the importance of context engineering will only become more and more important. It is not only a technical implementation issue, but also a product design issue, a user experience issue, and even an ethical issue.
As builders of Agent products, we need to elevate context engineering to a position as important as model selection and prompt engineering, and invest sufficient resources and attention. Because what ultimately determines the success or failure of an Agent product may not be how “smart” it is, but how “understanding you” it is.
And this “understanding” is the core mission of contextual engineering.
Reference resources
original:
Related Reading:
- MemGPT: Towards LLMs as Operating Systems
- Augmenting Language Models with Long-Term Memory
- Large Language Model Conversations: A Framework for Memory Management
Tool recommendation:
- MemGPT: https://github.com/memgpt/memgpt
- LangChain Memory: https://python.langchain.com/docs/modules/memory/
- Pinecone: https://www.pinecone.io/
- Milvus: https://milvus.io/
*This article is an original practical summary, written based on personal project experience. *
Last updated: 2026-03-12
Reading path
Continue along this topic path
Follow the recommended order for AI engineering practice instead of jumping through random articles in the same topic.
Next step
Go deeper into this topic
If this article is useful, continue from the topic page or subscribe to follow later updates.
Loading comments...
Comments and discussion
Sign in with GitHub to join the discussion. Comments are synced to GitHub Discussions