Skip to content

Architecting Brain's Memory To Solve AI Context Persistence

A security engineer's approach to tackling AI context persistence and hardware constraints by modeling how the human brain stores and retrieves memory.

· Original source
engram ai-memory compression post-quantum kv-cache lookup-tables semantic-indexing deepseek parquet

Architecting Brain’s Memory To Solve AI Context Persistence

A Semantic Indexing, KV Caching, and Lookup Tables approach for AI Memory.

AI context windows have a hard limit. Fill it up and the oldest memories fall off. Your assistant forgets what you told it last week.

Engram solves this the way the brain does. Your brain doesn’t hold everything in working memory. It tiers: recent experiences stay vivid (hot), recent days consolidate (warm), older memories compress into patterns (cold), deep memories take effort to surface (frozen). Each tier trades retrieval speed for storage efficiency.

This isn’t a new idea in AI. DeepSeek-V2 (arXiv:2405.04434) applied the same principle to attention itself: instead of recalculating full key/value tensors for every token, they precompute and cache compressed latent vectors — reducing KV cache by 93.3% and achieving 5.76x throughput. Same insight, different layer of the stack: don’t recompute what you can store compressed and recall on demand. DeepSeek compressed the attention cache. Engram compresses the context memory.

Engram applies this to your AI’s memory. Four compression tiers, a searchable semantic index, and optional post-quantum encryption — so months of context fit in the same token budget that used to hold a few sessions. Save hardware resources at scale while protecting what matters.

The brain already solved this

Your brain doesn’t hold everything in working memory. It tiers. Recent experiences stay vivid and fast. Older memories compress into patterns. Deep memories take the right cue to surface.

I applied the same architecture to AI memory:

TierBrain EquivalentCompressionRetrieval
HotWorking memory1x (raw)Instant
WarmRecent memory4–5x~10ms
ColdLong-term memory8–12x~500ms
FrozenDeep memory20–50x~5 seconds

Those ratios aren’t from cranking compression levels. That gets you 3.2x to 3.8x, which is barely noticeable across four tiers. Each tier applies different data transformations before the compressor even runs.

Warm strips whitespace from JSON. That’s 30–40% of most pretty-printed session logs gone before compression starts.

Cold strips boilerplate. Every AI session repeats the same 2,000–5,000 token system prompt verbatim. Across 4,500 sessions, that’s the same block of text repeated 4,500 times. Engram replaces these with 64-byte hash references and stores the original once. Then a dictionary trained on your actual session logs teaches the compressor the shared schema. It only compresses what’s actually unique.

Frozen converts JSONL to columnar Parquet. The string “role” appears on every single line of a conversation log. In a 10,000-turn session, that’s 10,000 redundant copies of every key name. Parquet transposes this into columns. The role column has two values. Run-length encoded, it compresses to almost nothing. Timestamps are monotonically increasing integers. Delta encoded, they compress to almost nothing. ClickHouse achieves 170x on logs with this approach.

Only the actual content carries real entropy. Everything else compresses away.

Lookup tables all the way down

The architecture is nested lookup tables — the same pattern DeepSeek uses when they absorb projection matrices into precomputed operations, and the same pattern MemoryFormer (arXiv:2411.12992) uses when it replaces linear layers with hash table lookups.

Every retrieval step in Engram is a lookup, not a recomputation:

LayerLookup TableWhat It Replaces
Keyword indexkeyword → matching artifactsScanning every file for a string
Compression dictionarybyte pattern → short codeRelearning compression patterns per file
Boilerplate storehash → full prompt textStoring 4,500 copies of the same system prompt
HNSW vector graphquery embedding → nearest neighborsLinear scan over all embeddings
PQ codebookcentroid ID → approximate vectorStoring full 3,072-byte embeddings
Binary embeddings96 bytes → Hamming distanceFull float32 cosine similarity

The compression dictionary alone cuts cold-tier ratio from 3.5x to 8–12x. The boilerplate store eliminates 40–70% of total content before compression even starts. Product quantization reduces embedding storage by 384x for frozen artifacts.

Semantic indexing: the hippocampus

Compression without search is a write-only archive. If you can’t find a memory, it doesn’t matter how efficiently it’s stored.

Every artifact gets indexed before compression. Keywords extracted. Summary generated. The index is under 1 MB for thousands of artifacts. Always loaded. Never compressed. It’s the hippocampus of the system: a small structure that knows where everything is stored.

When your AI starts a session, Engram feeds it a budget-optimized block of relevant summaries — not the full files. Summaries cost 10–20% of the tokens. If a summary isn’t enough, the assistant explicitly recalls the full artifact. The AI never decompresses everything hoping to find something.

The retrieval stack combines keyword lookup (BM25-equivalent) with vector similarity (HNSW) and reciprocal rank fusion — the hybrid approach that Anthropic’s contextual retrieval research showed reduces retrieval failure by 67% compared to embeddings alone.

Your sessions are a target

NIST proposed deprecating RSA-2048 by 2030 (IR 8547). An adversary who captures your plaintext session files today can wait for quantum computers. That’s harvest-now-decrypt-later, and it’s a published federal timeline, not a theoretical concern.

Engram uses ML-KEM-768 (NIST FIPS 203), the post-quantum algorithm that OpenSSH 10.0 made the default for all key exchange in April 2025. Private keys are handled by a compiled Rust sidecar with memory locking and deterministic zeroing. They never enter Python’s memory. Never touch disk. Never appear in process arguments.

Your keys live in Keychain or Vault. Never as files. If you lose the key, data is gone forever. That’s the point of strong encryption.

What makes this different

FeatureOther pluginsEngram
CompressionNone or ~3x4–5x / 8–12x / 20–50x per tier
EncryptionNonePost-quantum (ML-KEM-768), per-artifact keys
SearchDecompress everythingSemantic index + vector search, no decompression
Key handlingKey file on diskRust sidecar, Keychain, keys never in Python
RetrievalKeyword onlyHybrid: BM25 + HNSW + reciprocal rank fusion
AI platformsOneClaude, Codex, ChatGPT, Cursor, Copilot, any
TelemetrySometimesZero. Nothing leaves your machine.

See what you’re wasting in 30 seconds

Install Engram, run the guided setup, and preview what would be compressed. The dry run scans your disk and shows you the file count, total size, and what would move to each tier. No files are modified until you explicitly choose to run it.

116 tests. 8 rounds of security review.

Dig deeper:

Open source. MIT license. Works with any AI assistant that writes files.

github.com/qinnovates/engram


Written with AI assistance (Claude). All claims verified against primary sources. The author takes full responsibility for all content.