{"id":79144,"date":"2026-04-20T00:29:38","date_gmt":"2026-04-19T18:59:38","guid":{"rendered":"https:\/\/www.tothenew.com\/blog\/?p=79144"},"modified":"2026-04-22T11:23:12","modified_gmt":"2026-04-22T05:53:12","slug":"memory-in-ai-why-your-agent-forgets-everything-and-how-to-fix-it","status":"publish","type":"post","link":"https:\/\/www.tothenew.com\/blog\/memory-in-ai-why-your-agent-forgets-everything-and-how-to-fix-it\/","title":{"rendered":"Memory in AI: Why Your Agent Forgets Everything \u2014 And How to Fix It"},"content":{"rendered":"<p>Three months into building our DevOps AI agent, I gave a demo of it to the team. Checked pods, read logs, suggested fixes. Everyone was impressed. Then one engineer asked it: &#8220;Remember that ingress issue we sorted on Tuesday?&#8221;<\/p>\n<p>The agent had no idea what she was talking about.<\/p>\n<p>I had spent weeks on tool integration, prompt engineering, safety guardrails. I had not spent a single hour thinking about memory. And that gap made the whole thing feel like a toy instead of a tool.<\/p>\n<p>This post is what I figured out over the month that followed. Not a framework tutorial. Not a list of concepts. Just the actual problems I ran into, what I tried, and what ended up working.<\/p>\n<h2>Why LLMs Forget \u2014 And Why It Actually Matters<\/h2>\n<p>The short version: LLMs have no state. Zero. Every time you call the API, the model starts completely blank \u2014 no knowledge of what you discussed a minute ago, no awareness of your infrastructure, nothing. The only &#8220;memory&#8221; it has is whatever you put in the prompt you send right now.<\/p>\n<p>That sounds abstract, so here is what it looks like in real life.<\/p>\n<p>You ask the agent to check your pods. It finds one crashing \u2014 api-deployment-xyz, OOMKilled. You then ask &#8220;what is causing this?&#8221; Without memory, the agent says: &#8220;I&#8217;d need more context \u2014 which pod are you referring to?&#8221;<\/p>\n<p>The exact conversation that triggered this post<\/p>\n<p><strong>\u00a0 \u00a0 \u00a0 Me<\/strong>: Check all pods in the default namespace.<\/p>\n<p><strong>\u00a0 \u00a0 \u00a0 Agent<\/strong>: [runs kubectl] Found api-xyz in CrashLoopBackOff.<\/p>\n<p><strong>\u00a0 \u00a0 \u00a0 Me<\/strong>: Why is it crashing?<\/p>\n<p><strong>\u00a0 \u00a0 \u00a0Agent<\/strong>: Could you clarify which pod you are referring to?<\/p>\n<p>The frustrating part is that this is not the model being bad at its job. It is doing exactly what it is designed to do. Each API call is isolated. The question &#8220;why is it crashing?&#8221; arrives with no context whatsoever, so the model asks for clarification. Makes sense from its perspective. From yours, it is maddening.<\/p>\n<p>And it gets worse the more you rely on the agent. Every session you re-explain your setup. Every incident you re-describe the symptoms. Every fix you re-establish the context. The agent never gets better at knowing your system because it starts from scratch every single time.<\/p>\n<p>The smarter the model, the more you notice how much context it is missing. A dumber model you expect to be useless. A smart one feels like it should know better.<br \/>\nI spent a week with a colleague trying to figure out why our agent kept asking basic questions about our infrastructure \u2014 questions it had already &#8220;answered&#8221; in previous sessions. The problem was not intelligence. It was memory. Two very different things.<\/p>\n<p>That second response is when I realised we had a problem<\/p>\n<h2>The Different Kinds of Memory \u2014 How I Think About Them<\/h2>\n<p>When I started digging into this, I kept finding articles that listed memory types in a clean structured way: short-term, long-term, semantic, entity. Four types, four bullets, move on. That never quite clicked for me until I stopped thinking about memory types and started thinking about questions.<\/p>\n<p>Each type of memory answers a different question your agent might need to ask itself:<\/p>\n<p>&nbsp;<\/p>\n<ul>\n<li><strong>Short-term (In-context) &#8211;<\/strong> What did we just talk about? Lives in the prompt as a growing message list. Works instantly, zero setup. Vanishes the moment the session closes. Good enough for a demo, nowhere near good enough for production.<\/li>\n<\/ul>\n<ul>\n<li><strong>Long-term (Persistent) &#8211;<\/strong> What did we talk about last week? Stored in a real database \u2014 SQLite, Postgres, Redis. Survives restarts. Same interface as short-term, just one line different. The first upgrade any serious agent needs.<\/li>\n<\/ul>\n<ul>\n<li><strong>Semantic (Vector store) &#8211;<\/strong> What past conversations are relevant to this question? Messages stored as vector embeddings. Instead of dumping the entire history into every prompt, only the 5 most relevant past exchanges get retrieved. Handles months of history without exploding the context window.<\/li>\n<\/ul>\n<ul>\n<li><strong>Entity \/ Knowledge graph <\/strong>-What are the fixed facts about our environment? Prod cluster IP, deployment schedule, service owners. Structured data that does not change turn-to-turn \u2014 stored in JSON or a knowledge graph, queried when needed.<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<p>We went through all four stages over about six weeks. The jump from short-term to SQLite persistence was the biggest single improvement in agent usefulness we saw. Took maybe twenty minutes to implement.<\/p>\n<h2>How the Plumbing Actually Works<\/h2>\n<p>Here is the thing about AI memory that took me embarrassingly long to understand: the model is not remembering anything. It physically cannot. What you are doing is feeding it the past conversation as text at the start of every new request.<\/p>\n<p>The architecture, stripped back to its bones:<\/p>\n<div id=\"attachment_79243\" style=\"width: 635px\" class=\"wp-caption alignnone\"><img aria-describedby=\"caption-attachment-79243\" decoding=\"async\" loading=\"lazy\" class=\"wp-image-79243 size-large\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2026\/03\/gpt-generated-1024x683.png\" alt=\"The memory cycle on every .invoke() call\" width=\"625\" height=\"417\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2026\/03\/gpt-generated-1024x683.png 1024w, \/blog\/wp-ttn-blog\/uploads\/2026\/03\/gpt-generated-300x200.png 300w, \/blog\/wp-ttn-blog\/uploads\/2026\/03\/gpt-generated-768x512.png 768w, \/blog\/wp-ttn-blog\/uploads\/2026\/03\/gpt-generated-624x416.png 624w, \/blog\/wp-ttn-blog\/uploads\/2026\/03\/gpt-generated.png 1536w\" sizes=\"(max-width: 625px) 100vw, 625px\" \/><p id=\"caption-attachment-79243\" class=\"wp-caption-text\">The memory cycle on every .invoke() call<\/p><\/div>\n<p>&nbsp;<\/p>\n<p>That slot in the middle \u2014 where history gets injected into the prompt \u2014 is the MessagesPlaceholder in your LangChain prompt template. Every single LLM call, the full conversation history is dropped in there. The model reads it fresh, as if it just happened.<\/p>\n<p>This is also why context window limits are not just a theoretical concern. Llama 3.1 supports around 128k tokens. A long conversation generates a lot of history. If you never trim or summarise it, you will eventually hit that ceiling, and older messages get quietly cut off. The semantic and summary memory types exist specifically because of this.<\/p>\n<p>&nbsp;<\/p>\n<p>The string that has to match \u2014 and nobody warns you<\/p>\n<p>MessagesPlaceholder(variable_name=&#8221;chat_history&#8221;) \u2190 in your prompt<\/p>\n<p>history_messages_key=&#8221;chat_history&#8221; \u2190 in RunnableWithMessageHistory<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<h2>The Code \u2014 Three Stages of Memory<\/h2>\n<p>Here is what the progression looks like in actual LangChain code. I am using Ollama because running models locally is free and I am not made of GPU budget.<\/p>\n<p>&nbsp;<\/p>\n<p><strong>Stage 1 \u2014 Short-term (what you probably have)<\/strong><\/p>\n<p>If you have followed any LangChain getting-started guide, this is likely where you are. It works within a session, and that is it.<\/p>\n<p><!--more--><\/p>\n<div id=\"attachment_79179\" style=\"width: 851px\" class=\"wp-caption alignnone\"><img aria-describedby=\"caption-attachment-79179\" decoding=\"async\" loading=\"lazy\" class=\"wp-image-79179 size-full\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2026\/03\/ShortTermMemory-1.png\" alt=\"ShortTermMemory\" width=\"841\" height=\"404\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2026\/03\/ShortTermMemory-1.png 841w, \/blog\/wp-ttn-blog\/uploads\/2026\/03\/ShortTermMemory-1-300x144.png 300w, \/blog\/wp-ttn-blog\/uploads\/2026\/03\/ShortTermMemory-1-768x369.png 768w, \/blog\/wp-ttn-blog\/uploads\/2026\/03\/ShortTermMemory-1-624x300.png 624w\" sizes=\"(max-width: 841px) 100vw, 841px\" \/><p id=\"caption-attachment-79179\" class=\"wp-caption-text\">Short Term Memory<\/p><\/div>\n<p>&nbsp;<\/p>\n<p><strong>Stage 2 \u2014 Persistent (the upgrade I should have done week one)<\/strong><\/p>\n<p>One import. One changed function. Your agent now remembers across restarts.<\/p>\n<div id=\"attachment_79180\" style=\"width: 843px\" class=\"wp-caption alignnone\"><img aria-describedby=\"caption-attachment-79180\" decoding=\"async\" loading=\"lazy\" class=\"wp-image-79180 size-full\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2026\/03\/LongTermMemory.png\" alt=\"LongTermMemory\" width=\"833\" height=\"260\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2026\/03\/LongTermMemory.png 833w, \/blog\/wp-ttn-blog\/uploads\/2026\/03\/LongTermMemory-300x94.png 300w, \/blog\/wp-ttn-blog\/uploads\/2026\/03\/LongTermMemory-768x240.png 768w, \/blog\/wp-ttn-blog\/uploads\/2026\/03\/LongTermMemory-624x195.png 624w\" sizes=\"(max-width: 833px) 100vw, 833px\" \/><p id=\"caption-attachment-79180\" class=\"wp-caption-text\">Long Term Memory<\/p><\/div>\n<p>&nbsp;<\/p>\n<p><strong>Stage 3 \u2014 Semantic retrieval (when history gets long)<\/strong><\/p>\n<p>After a few weeks of real usage, injecting the entire history into every prompt started causing problems. Prompts got long, responses got slower, context windows started straining. The fix: stop injecting everything, start retrieving only what is relevant.<\/p>\n<div id=\"attachment_79181\" style=\"width: 833px\" class=\"wp-caption alignnone\"><img aria-describedby=\"caption-attachment-79181\" decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-79181\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2026\/03\/SemanticMemory-1.png\" alt=\"SemanticMemory\" width=\"823\" height=\"341\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2026\/03\/SemanticMemory-1.png 823w, \/blog\/wp-ttn-blog\/uploads\/2026\/03\/SemanticMemory-1-300x124.png 300w, \/blog\/wp-ttn-blog\/uploads\/2026\/03\/SemanticMemory-1-768x318.png 768w, \/blog\/wp-ttn-blog\/uploads\/2026\/03\/SemanticMemory-1-624x259.png 624w\" sizes=\"(max-width: 823px) 100vw, 823px\" \/><p id=\"caption-attachment-79181\" class=\"wp-caption-text\">Semantic Memory<\/p><\/div>\n<p>&nbsp;<\/p>\n<h2>What This Looks Like in a Real Conversation<\/h2>\n<p>The demo that finally sold my team on memory was a simple three-turn conversation. No fancy setup, no long explanation \u2014 just this:<\/p>\n<div id=\"attachment_79247\" style=\"width: 635px\" class=\"wp-caption alignnone\"><img aria-describedby=\"caption-attachment-79247\" decoding=\"async\" loading=\"lazy\" class=\"wp-image-79247 size-large\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2026\/03\/gpt-conversation-1024x683.png\" alt=\"Three turns, no re-explaining\" width=\"625\" height=\"417\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2026\/03\/gpt-conversation-1024x683.png 1024w, \/blog\/wp-ttn-blog\/uploads\/2026\/03\/gpt-conversation-300x200.png 300w, \/blog\/wp-ttn-blog\/uploads\/2026\/03\/gpt-conversation-768x512.png 768w, \/blog\/wp-ttn-blog\/uploads\/2026\/03\/gpt-conversation-624x416.png 624w, \/blog\/wp-ttn-blog\/uploads\/2026\/03\/gpt-conversation.png 1536w\" sizes=\"(max-width: 625px) 100vw, 625px\" \/><p id=\"caption-attachment-79247\" class=\"wp-caption-text\">Three turns, no re-explaining<\/p><\/div>\n<p>Turn 3 is two words. The agent patches the right pod with the right fix because it has read the full context from the previous two turns. No re-explaining, no re-specifying, no copy-pasting the pod name again.<\/p>\n<p>Run that same conversation without memory and Turn 2 falls apart. The agent has no idea which pod you mean. You have to re-explain everything. The whole thing stops feeling like a conversation and starts feeling like filling out a form.<\/p>\n<p>&nbsp;<\/p>\n<h2>Which Memory Strategy Should You Actually Use<\/h2>\n<p>Honest answer: start simple and add complexity only when you feel the pain of not having it. Here is how I would sequence it:<\/p>\n<div id=\"attachment_79249\" style=\"width: 635px\" class=\"wp-caption alignnone\"><img aria-describedby=\"caption-attachment-79249\" decoding=\"async\" loading=\"lazy\" class=\"wp-image-79249 size-large\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2026\/03\/gpt-memoryUse-1024x683.png\" alt=\"Memory strategy comparison\" width=\"625\" height=\"417\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2026\/03\/gpt-memoryUse-1024x683.png 1024w, \/blog\/wp-ttn-blog\/uploads\/2026\/03\/gpt-memoryUse-300x200.png 300w, \/blog\/wp-ttn-blog\/uploads\/2026\/03\/gpt-memoryUse-768x512.png 768w, \/blog\/wp-ttn-blog\/uploads\/2026\/03\/gpt-memoryUse-624x416.png 624w, \/blog\/wp-ttn-blog\/uploads\/2026\/03\/gpt-memoryUse.png 1536w\" sizes=\"(max-width: 625px) 100vw, 625px\" \/><p id=\"caption-attachment-79249\" class=\"wp-caption-text\">Memory strategy comparison<\/p><\/div>\n<p>&nbsp;<\/p>\n<p><strong>What I would do if starting over today<\/strong><\/p>\n<p>Day 1: SQLite. One line of code, zero thought required.<\/p>\n<p>Week 3: ConversationSummaryBufferMemory when prompts get too long.<\/p>\n<p>Month 2: Chroma vector store when you have real incident history to search.<\/p>\n<p>&nbsp;<\/p>\n<p>What worked for us: start with SQLite, deal with context length when it actually becomes a problem, and only touch vector stores once you have real history worth searching. We skipped straight to Chroma on one project and spent a week setting it up before we had enough data to make it useful.<\/p>\n<p>&nbsp;<\/p>\n<h2>Where This Is All Going \u2014 My Honest Take<\/h2>\n<p>Right now, AI memory is mostly a clever workaround. We are taking models with no persistent state and faking continuity by stuffing history into prompts. It works, but it is a bit like duct-taping a notepad to someone&#8217;s forehead every time they walk into a meeting.<\/p>\n<p>&nbsp;<\/p>\n<p>What worked for us: SQLite for persistence, summary buffer for context control, Chroma for semantic recall. Layer them in that order. Get the basics solid first.<\/p>\n<p>The biggest shift in how the agent feels is not about which memory system you use. It is just about having any persistence at all. The jump from zero to SQLite is enormous. Everything after that is refinement.<\/p>\n<p>&nbsp;<\/p>\n<h2>One Change, Right Now<\/h2>\n<p>If your LangChain agent is using ChatMessageHistory() \u2014 and it probably is if you started from any tutorial \u2014 swap it out:<\/p>\n<p><strong># Before<\/strong><\/p>\n<p>session_history[session_id] = ChatMessageHistory()<\/p>\n<p><strong># After<\/strong><\/p>\n<p>SQLChatMessageHistory(session_id=session_id, connection=&#8221;sqlite:\/\/\/memory.db&#8221;)<\/p>\n<p>&nbsp;<\/p>\n<p>That is it. Ninety seconds of work. Your agent wakes up tomorrow knowing what happened today.<\/p>\n<p>I genuinely wish someone had told me to do this on day one. Would have saved a fair amount of frustration and one particularly embarrassing demo.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Three months into building our DevOps AI agent, I gave a demo of it to the team. Checked pods, read logs, suggested fixes. Everyone was impressed. Then one engineer asked it: &#8220;Remember that ingress issue we sorted on Tuesday?&#8221; The agent had no idea what she was talking about. I had spent weeks on tool [&hellip;]<\/p>\n","protected":false},"author":1833,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"iawp_total_views":38},"categories":[2348],"tags":[8540,1892,6263,8541,6408],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/79144"}],"collection":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/users\/1833"}],"replies":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/comments?post=79144"}],"version-history":[{"count":18,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/79144\/revisions"}],"predecessor-version":[{"id":79665,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/79144\/revisions\/79665"}],"wp:attachment":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/media?parent=79144"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/categories?post=79144"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/tags?post=79144"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}