Building a RAG Agent with LangChain: Complete Tutorial
Learn how to create a Retrieval-Augmented Generation agent that can answer questions using your own documents with this step-by-step LangChain guide including complete code examples
When you have a conversation with someone, you rely on multiple types of memory simultaneously. You remember what was just said (short-term), draw on knowledge you’ve accumulated over years (long-term), and recall specific past experiences (episodic). AI agents face the same challenge—but with fundamentally different constraints and mechanisms.
Memory is what separates a stateless language model from a true agent. Without memory, every interaction starts from zero. With well-designed memory systems, agents can learn, adapt, and maintain coherent behavior across extended interactions. This deep dive explores how modern AI agents implement memory, the tradeoffs involved, and practical patterns for building memory-aware systems.
Language models like GPT-4 or Claude have a fundamental limitation: they’re stateless. Each API call is independent. The model doesn’t inherently remember previous conversations or accumulate knowledge over time. Everything it knows must fit in the context window—the limited amount of text it can process in a single call.
This creates several problems:
Agent memory systems solve these problems by selectively storing, retrieving, and managing information outside the model’s context window.
Short-term memory in AI agents mirrors human working memory—it holds the immediately relevant information needed for the current task. This typically includes:
The simplest short-term memory is raw conversation history:
class SimpleShortTermMemory:
def __init__(self, max_messages: int = 20):
self.messages = []
self.max_messages = max_messages
def add_message(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
# Sliding window: keep only recent messages
if len(self.messages) > self.max_messages:
self.messages = self.messages[-self.max_messages:]
def get_context(self) -> list:
return self.messages
This approach has an obvious limitation: it treats all messages equally. A more sophisticated approach uses summarization to compress older context:
class SummarizingMemory:
def __init__(self, llm, summary_threshold: int = 10):
self.llm = llm
self.summary = ""
self.recent_messages = []
self.threshold = summary_threshold
def add_message(self, role: str, content: str):
self.recent_messages.append({"role": role, "content": content})
if len(self.recent_messages) > self.threshold:
# Summarize older messages
to_summarize = self.recent_messages[:-5]
self.summary = self._summarize(self.summary, to_summarize)
self.recent_messages = self.recent_messages[-5:]
def _summarize(self, existing_summary: str, messages: list) -> str:
prompt = f"""Existing summary: {existing_summary}
New messages to incorporate: {messages}
Provide an updated summary that captures key information, decisions made, and current context."""
return self.llm.invoke(prompt)
This pattern—often called conversation compaction—preserves semantic content while reducing token usage. The tradeoff is that summaries lose detail and require additional LLM calls.
Different use cases call for different buffering strategies:
LangChain provides built-in implementations through ConversationBufferMemory, ConversationSummaryMemory, and ConversationBufferWindowMemory.
Long-term memory stores information that should persist across sessions and be retrievable when relevant. Unlike short-term memory, which is always in context, long-term memory requires explicit retrieval.
The most common pattern uses vector databases for long-term storage:
class VectorLongTermMemory:
def __init__(self, embeddings, vectorstore):
self.embeddings = embeddings
self.vectorstore = vectorstore
def store(self, text: str, metadata: dict = None):
"""Store information for later retrieval."""
self.vectorstore.add_texts(
texts=[text],
metadatas=[metadata] if metadata else None
)
def retrieve(self, query: str, k: int = 5) -> list[str]:
"""Retrieve relevant memories based on semantic similarity."""
docs = self.vectorstore.similarity_search(query, k=k)
return [doc.page_content for doc in docs]
This approach excels at finding semantically related information even when the query uses different terminology. The agent can store facts, user preferences, past interactions, and domain knowledge, then retrieve relevant pieces when needed.
Sometimes you need more than semantic search. Structured storage enables precise queries:
class StructuredMemory:
def __init__(self):
self.entities = {} # entity_name -> attributes
self.relationships = [] # (entity1, relation, entity2)
def add_entity(self, name: str, entity_type: str, attributes: dict):
self.entities[name] = {
"type": entity_type,
"attributes": attributes
}
def add_relationship(self, entity1: str, relation: str, entity2: str):
self.relationships.append((entity1, relation, entity2))
def query_entity(self, name: str) -> dict:
return self.entities.get(name)
def query_relationships(self, entity: str) -> list:
return [r for r in self.relationships if entity in (r[0], r[2])]
Knowledge graphs combine semantic retrieval with structured queries. Tools like Neo4j, and frameworks like LangChain’s GraphCypherQAChain, enable agents to reason over complex relationship networks.
Episodic memory stores specific experiences—complete interactions, task executions, or problem-solving sessions—that can be recalled and learned from. This is particularly valuable for:
class EpisodicMemory:
def __init__(self, embeddings, vectorstore):
self.embeddings = embeddings
self.vectorstore = vectorstore
def store_episode(self, episode: dict):
"""Store a complete episode with full context."""
# Episode structure:
# - trigger: what initiated the episode
# - actions: what the agent did
# - outcome: what happened (success/failure)
# - lessons: what was learned
episode_text = f"""
Situation: {episode['trigger']}
Actions taken: {episode['actions']}
Outcome: {episode['outcome']}
Key learnings: {episode.get('lessons', 'None recorded')}
"""
self.vectorstore.add_texts(
texts=[episode_text],
metadatas=[{
"type": "episode",
"timestamp": episode.get("timestamp"),
"success": episode.get("success", True)
}]
)
def recall_similar_episodes(self, situation: str, k: int = 3) -> list:
"""Find past episodes similar to the current situation."""
return self.vectorstore.similarity_search(
situation,
k=k,
filter={"type": "episode"}
)
The key distinction from long-term memory is that episodes are complete narratives with context, actions, and outcomes—not just facts. This enables agents to reason by analogy: “Last time I encountered a similar situation, I did X and it worked/failed.”
Real-world agents typically combine multiple memory types. Here’s a unified architecture:
class AgentMemorySystem:
def __init__(self, llm, embeddings, vectorstore):
self.short_term = SummarizingMemory(llm)
self.long_term = VectorLongTermMemory(embeddings, vectorstore)
self.episodic = EpisodicMemory(embeddings, vectorstore)
def build_context(self, current_input: str) -> str:
"""Assemble context from all memory systems."""
# Always include recent conversation
recent = self.short_term.get_context()
# Retrieve relevant long-term memories
relevant_facts = self.long_term.retrieve(current_input, k=3)
# Find similar past episodes
past_episodes = self.episodic.recall_similar_episodes(current_input, k=2)
context = f"""
## Conversation History
{recent}
## Relevant Knowledge
{relevant_facts}
## Similar Past Situations
{past_episodes}
"""
return context
Just as humans consolidate memories during sleep, agents benefit from periodic memory maintenance:
Memory is only useful if the right information is retrieved at the right time. Common issues include:
Solutions include hybrid search (combining semantic and keyword matching), recency weighting, and explicit memory invalidation.
Memory operations add latency and cost:
Design memory systems with these costs in mind. Not every interaction needs full memory retrieval—use heuristics to decide when memory lookup is worthwhile.
Stored memories may contain sensitive information. Consider:
Current memory systems are relatively primitive compared to human cognition. Emerging research explores:
Memory is a foundational capability for truly autonomous agents. As models grow more capable, sophisticated memory architectures will enable agents that learn from experience, maintain consistent personalities, and build genuine expertise over time.
Understanding memory systems is essential for building agents that can maintain context, learn from experience, and operate coherently over extended interactions. The patterns described here provide a foundation—adapt them to your specific use case and constraints.
This post concludes our Week 2 deep dive series. For hands-on practice with memory systems, check out our RAG tutorial, explore our Complete Guide to AI Agent Frameworks, or reference our AI Agents Glossary for memory-related terminology.
Learn how to create a Retrieval-Augmented Generation agent that can answer questions using your own documents with this step-by-step LangChain guide including complete code examples
A comprehensive comparison of Microsoft Semantic Kernel and LangChain for building AI agents, covering architecture, enterprise features, integration patterns, and when to use each framework
An architectural deep dive into how multiple AI agents work together, exploring hierarchical command structures, peer-to-peer collaboration, and hybrid approaches—with practical guidance on choosing the right pattern for your system