Knowledge Base

📝 Context Summary

This document provides a technical walkthrough for creating a custom memory layer for LLMs, addressing their stateless nature. It covers four key stages: extracting atomic memories from conversations using DSPy, embedding them into a vector DB (QDrant), retrieving relevant memories via tool-calling, and maintaining memory state (add, update, delete) with a ReAct agent. The architecture treats memory as a context engineering problem, enabling personalized and stateful AI agent interactions.

How to Build a Custom LLM Memory Layer from Scratch

Every LLM call is a fresh start. Unless you explicitly supply information from previous sessions, the model has no built‑in sense of continuity. This stateless design is a major challenge for applications requiring personalization. This guide provides a step-by-step process for building a simple, persistent memory system from scratch, inspired by the Mem0 architecture.

1. Memory as a Context Engineering Problem

Context Engineering is the technique of filling an LLM’s context with the relevant information it needs to complete a task. Memory is one of the most important and challenging context engineering problems, as it requires several key techniques:

  1. Extracting structured information from raw text.
  2. Summarization.
  3. Using vector databases.
  4. Query generation and similarity search.
  5. Agentic tool calling.

Diagram showing that LLMs do not come with memory

2. High‑Level Architecture

A robust memory system must perform four functions: extract, embed, retrieve, and maintain.

  • Extraction: Extracts atomic memories from user-assistant messages.
  • Vector DB: Embeds the extracted factoids and stores them in a vector database.
  • Retrieval: Generates a query to retrieve similar memories when needed.
  • Maintenance: Uses a ReAct (Reasoning and Acting) loop for an agent to decide whether to add, update, or delete memories based on new information.

Diagram of the Mem0 architecture

Crucially, each step is optional. The agent should only access memory when it determines it’s necessary to answer a query.

3. Memory Extraction with DSPy

The first step is to convert conversation transcripts into atomic factoids. A “good” memory is a short, self-contained fact that can be embedded and retrieved with high precision. Using DSPy, we can define a signature to extract a list of string factoids from a transcript.

import dspy

class MemoryExtract(dspy.Signature):
    """
    Extract relevant information from the conversation. 
    Memories are atomic independent factoids that we must learn about the user.
    If transcript does not contain any information worth extracting, return empty list.
    """
    transcript: str = dspy.InputField()
    memories: list[str] = dspy.OutputField()

memory_extractor = dspy.Predict(MemoryExtract)

The signature’s docstring acts as the system prompt. We can then call this predictor with the conversation history to extract the memories.

4. Embedding and Storing Memories

Once memories are extracted, they are embedded and stored in a vector database like QDrant. We use an efficient embedding model like text-embedding-3-small and create helper functions to insert, delete, update, and search for memories, filtering by user_id to ensure data isolation.

Diagram showing memories being uploaded to a vector database

5. Memory Retrieval via Tool Calling

Instead of always searching for memories, we create a tool-calling agent that decides when it’s necessary. We provide the agent a fetch_similar_memories tool, which it can invoke when it lacks the context to answer a user’s question.

We then wrap our logic in a dspy.ReAct agent. The ReAct (Reasoning and Acting) agent will observe the conversation, reason about the next step, and then act—either by generating an answer directly or by calling the tool to retrieve memories first.

Diagram of the memory retrieval process

The agent also determines if the latest interaction contains new information that should be saved to the memory layer.

6. Memory Maintenance

Memory is not a static log; it must evolve. When the agent decides to save a new memory, a separate maintenance agent determines how to integrate it.

Diagram showing the memory update decision process

This agent has four possible actions, implemented as tools:

  • add_memory(text): Inserts a new fact.
  • update_memory(id, updated_text): Corrects or refines an existing memory.
  • delete_memories(ids): Removes obsolete or contradictory memories.
  • no_op(): Does nothing if the new information is irrelevant or already captured.

This agentic loop ensures the memory remains accurate and relevant over time, allowing the primary agent to deliver increasingly personalized and context-aware responses.

7. Further Expansion

This guide covers the building blocks of a memory system. The concept can be expanded with more advanced techniques:

  • Graph Memory System: Store memories as triplets in a graph database.
  • Metadata Filtering: Add category tags (e.g., “food,” “hobbies”) to allow for more targeted queries.
  • System Prompt Injection: Automatically inject critical user information from memory directly into the system prompt for every session.
Key Concepts: LLM Memory Context Engineering DSPy QDrant Vector Database ReAct Agent Tool Calling Stateful AI

About the Author: Adam Bernard

How to Build a Custom LLM Memory Layer
Adam Bernard is a digital marketing strategist and SEO specialist building AI-powered business intelligence systems. He's the creator of the Strategic Intelligence Engine (SIE), a multi-agent framework that transforms business knowledge into autonomous, AI-driven competitive advantages.

Let’s Connect

Ready to Build Your Own Intelligence Engine?

If you’re ready to move from theory to implementation and build a Knowledge Core for your own business, I can help you design the engine to power it. Let’s discuss how these principles can be applied to your unique challenges and goals.