Architecting Brain's Memory To Solve AI Context Persistence

Kevin L. Qi

Architecting Brain's Memory To Solve AI Context Persistence

A security engineer's approach to tackling AI context persistence and hardware constraints by modeling how the human brain stores and retrieves memory.

March 18, 2026 · Original source

engram ai-memory compression post-quantum kv-cache lookup-tables semantic-indexing deepseek parquet

Architecting Brain’s Memory To Solve AI Context Persistence

A Semantic Indexing, KV Caching, and Lookup Tables approach for AI Memory.

AI context windows have a hard limit. Fill it up and the oldest memories fall off. Your assistant forgets what you told it last week.

Engram solves this the way the brain does. Your brain doesn’t hold everything in working memory. It tiers: recent experiences stay vivid (hot), recent days consolidate (warm), older memories compress into patterns (cold), deep memories take effort to surface (frozen). Each tier trades retrieval speed for storage efficiency.

This isn’t a new idea in AI. DeepSeek-V2 (arXiv:2405.04434) applied the same principle to attention itself: instead of recalculating full key/value tensors for every token, they precompute and cache compressed latent vectors — reducing KV cache by 93.3% and achieving 5.76x throughput. Same insight, different layer of the stack: don’t recompute what you can store compressed and recall on demand. DeepSeek compressed the attention cache. Engram compresses the context memory.

Engram applies this to your AI’s memory. Four compression tiers, a searchable semantic index, and optional post-quantum encryption — so months of context fit in the same token budget that used to hold a few sessions. Save hardware resources at scale while protecting what matters.

The brain already solved this

Your brain doesn’t hold everything in working memory. It tiers. Recent experiences stay vivid and fast. Older memories compress into patterns. Deep memories take the right cue to surface.

I applied the same architecture to AI memory:

Tier	Brain Equivalent	Compression	Retrieval
Hot	Working memory	1x (raw)	Instant
Warm	Recent memory	4–5x	~10ms
Cold	Long-term memory	8–12x	~500ms
Frozen	Deep memory	20–50x	~5 seconds

Those ratios aren’t from cranking compression levels. That gets you 3.2x to 3.8x, which is barely noticeable across four tiers. Each tier applies different data transformations before the compressor even runs.

Warm strips whitespace from JSON. That’s 30–40% of most pretty-printed session logs gone before compression starts.

Cold strips boilerplate. Every AI session repeats the same 2,000–5,000 token system prompt verbatim. Across 4,500 sessions, that’s the same block of text repeated 4,500 times. Engram replaces these with 64-byte hash references and stores the original once. Then a dictionary trained on your actual session logs teaches the compressor the shared schema. It only compresses what’s actually unique.

Frozen converts JSONL to columnar Parquet. The string “role” appears on every single line of a conversation log. In a 10,000-turn session, that’s 10,000 redundant copies of every key name. Parquet transposes this into columns. The role column has two values. Run-length encoded, it compresses to almost nothing. Timestamps are monotonically increasing integers. Delta encoded, they compress to almost nothing. ClickHouse achieves 170x on logs with this approach.

Only the actual content carries real entropy. Everything else compresses away.

Lookup tables all the way down

The architecture is nested lookup tables — the same pattern DeepSeek uses when they absorb projection matrices into precomputed operations, and the same pattern MemoryFormer (arXiv:2411.12992) uses when it replaces linear layers with hash table lookups.

Every retrieval step in Engram is a lookup, not a recomputation:

Layer	Lookup Table	What It Replaces
Keyword index	keyword → matching artifacts	Scanning every file for a string
Compression dictionary	byte pattern → short code	Relearning compression patterns per file
Boilerplate store	hash → full prompt text	Storing 4,500 copies of the same system prompt
HNSW vector graph	query embedding → nearest neighbors	Linear scan over all embeddings
PQ codebook	centroid ID → approximate vector	Storing full 3,072-byte embeddings
Binary embeddings	96 bytes → Hamming distance	Full float32 cosine similarity

The compression dictionary alone cuts cold-tier ratio from 3.5x to 8–12x. The boilerplate store eliminates 40–70% of total content before compression even starts. Product quantization reduces embedding storage by 384x for frozen artifacts.

Semantic indexing: the hippocampus

Compression without search is a write-only archive. If you can’t find a memory, it doesn’t matter how efficiently it’s stored.

Every artifact gets indexed before compression. Keywords extracted. Summary generated. The index is under 1 MB for thousands of artifacts. Always loaded. Never compressed. It’s the hippocampus of the system: a small structure that knows where everything is stored.

When your AI starts a session, Engram feeds it a budget-optimized block of relevant summaries — not the full files. Summaries cost 10–20% of the tokens. If a summary isn’t enough, the assistant explicitly recalls the full artifact. The AI never decompresses everything hoping to find something.

The retrieval stack combines keyword lookup (BM25-equivalent) with vector similarity (HNSW) and reciprocal rank fusion — the hybrid approach that Anthropic’s contextual retrieval research showed reduces retrieval failure by 67% compared to embeddings alone.

Your sessions are a target

NIST proposed deprecating RSA-2048 by 2030 (IR 8547). An adversary who captures your plaintext session files today can wait for quantum computers. That’s harvest-now-decrypt-later, and it’s a published federal timeline, not a theoretical concern.

Engram uses ML-KEM-768 (NIST FIPS 203), the post-quantum algorithm that OpenSSH 10.0 made the default for all key exchange in April 2025. Private keys are handled by a compiled Rust sidecar with memory locking and deterministic zeroing. They never enter Python’s memory. Never touch disk. Never appear in process arguments.

Your keys live in Keychain or Vault. Never as files. If you lose the key, data is gone forever. That’s the point of strong encryption.

What makes this different

Feature	Other plugins	Engram
Compression	None or ~3x	4–5x / 8–12x / 20–50x per tier
Encryption	None	Post-quantum (ML-KEM-768), per-artifact keys
Search	Decompress everything	Semantic index + vector search, no decompression
Key handling	Key file on disk	Rust sidecar, Keychain, keys never in Python
Retrieval	Keyword only	Hybrid: BM25 + HNSW + reciprocal rank fusion
AI platforms	One	Claude, Codex, ChatGPT, Cursor, Copilot, any
Telemetry	Sometimes	Zero. Nothing leaves your machine.

See what you’re wasting in 30 seconds

Install Engram, run the guided setup, and preview what would be compressed. The dry run scans your disk and shows you the file count, total size, and what would move to each tier. No files are modified until you explicitly choose to run it.

116 tests. 8 rounds of security review.

Dig deeper:

Open source. MIT license. Works with any AI assistant that writes files.

github.com/qinnovates/engram

Written with AI assistance (Claude). All claims verified against primary sources. The author takes full responsibility for all content.