Architecting Brain's Memory To Solve AI Context Persistence
A security engineer's approach to tackling AI context persistence and hardware constraints by modeling how the human brain stores and retrieves memory.
Architecting Brain’s Memory To Solve AI Context Persistence
A Semantic Indexing, KV Caching, and Lookup Tables approach for AI Memory.
AI context windows have a hard limit. Fill it up and the oldest memories fall off. Your assistant forgets what you told it last week.
Engram solves this the way the brain does. Your brain doesn’t hold everything in working memory. It tiers: recent experiences stay vivid (hot), recent days consolidate (warm), older memories compress into patterns (cold), deep memories take effort to surface (frozen). Each tier trades retrieval speed for storage efficiency.
This isn’t a new idea in AI. DeepSeek-V2 (arXiv:2405.04434) applied the same principle to attention itself: instead of recalculating full key/value tensors for every token, they precompute and cache compressed latent vectors — reducing KV cache by 93.3% and achieving 5.76x throughput. Same insight, different layer of the stack: don’t recompute what you can store compressed and recall on demand. DeepSeek compressed the attention cache. Engram compresses the context memory.
Engram applies this to your AI’s memory. Four compression tiers, a searchable semantic index, and optional post-quantum encryption — so months of context fit in the same token budget that used to hold a few sessions. Save hardware resources at scale while protecting what matters.
The brain already solved this
Your brain doesn’t hold everything in working memory. It tiers. Recent experiences stay vivid and fast. Older memories compress into patterns. Deep memories take the right cue to surface.
I applied the same architecture to AI memory:
| Tier | Brain Equivalent | Compression | Retrieval |
|---|---|---|---|
| Hot | Working memory | 1x (raw) | Instant |
| Warm | Recent memory | 4–5x | ~10ms |
| Cold | Long-term memory | 8–12x | ~500ms |
| Frozen | Deep memory | 20–50x | ~5 seconds |
Those ratios aren’t from cranking compression levels. That gets you 3.2x to 3.8x, which is barely noticeable across four tiers. Each tier applies different data transformations before the compressor even runs.
Warm strips whitespace from JSON. That’s 30–40% of most pretty-printed session logs gone before compression starts.
Cold strips boilerplate. Every AI session repeats the same 2,000–5,000 token system prompt verbatim. Across 4,500 sessions, that’s the same block of text repeated 4,500 times. Engram replaces these with 64-byte hash references and stores the original once. Then a dictionary trained on your actual session logs teaches the compressor the shared schema. It only compresses what’s actually unique.
Frozen converts JSONL to columnar Parquet. The string “role” appears on every single line of a conversation log. In a 10,000-turn session, that’s 10,000 redundant copies of every key name. Parquet transposes this into columns. The role column has two values. Run-length encoded, it compresses to almost nothing. Timestamps are monotonically increasing integers. Delta encoded, they compress to almost nothing. ClickHouse achieves 170x on logs with this approach.
Only the actual content carries real entropy. Everything else compresses away.
Lookup tables all the way down
The architecture is nested lookup tables — the same pattern DeepSeek uses when they absorb projection matrices into precomputed operations, and the same pattern MemoryFormer (arXiv:2411.12992) uses when it replaces linear layers with hash table lookups.
Every retrieval step in Engram is a lookup, not a recomputation:
| Layer | Lookup Table | What It Replaces |
|---|---|---|
| Keyword index | keyword → matching artifacts | Scanning every file for a string |
| Compression dictionary | byte pattern → short code | Relearning compression patterns per file |
| Boilerplate store | hash → full prompt text | Storing 4,500 copies of the same system prompt |
| HNSW vector graph | query embedding → nearest neighbors | Linear scan over all embeddings |
| PQ codebook | centroid ID → approximate vector | Storing full 3,072-byte embeddings |
| Binary embeddings | 96 bytes → Hamming distance | Full float32 cosine similarity |
The compression dictionary alone cuts cold-tier ratio from 3.5x to 8–12x. The boilerplate store eliminates 40–70% of total content before compression even starts. Product quantization reduces embedding storage by 384x for frozen artifacts.
Semantic indexing: the hippocampus
Compression without search is a write-only archive. If you can’t find a memory, it doesn’t matter how efficiently it’s stored.
Every artifact gets indexed before compression. Keywords extracted. Summary generated. The index is under 1 MB for thousands of artifacts. Always loaded. Never compressed. It’s the hippocampus of the system: a small structure that knows where everything is stored.
When your AI starts a session, Engram feeds it a budget-optimized block of relevant summaries — not the full files. Summaries cost 10–20% of the tokens. If a summary isn’t enough, the assistant explicitly recalls the full artifact. The AI never decompresses everything hoping to find something.
The retrieval stack combines keyword lookup (BM25-equivalent) with vector similarity (HNSW) and reciprocal rank fusion — the hybrid approach that Anthropic’s contextual retrieval research showed reduces retrieval failure by 67% compared to embeddings alone.
Your sessions are a target
NIST proposed deprecating RSA-2048 by 2030 (IR 8547). An adversary who captures your plaintext session files today can wait for quantum computers. That’s harvest-now-decrypt-later, and it’s a published federal timeline, not a theoretical concern.
Engram uses ML-KEM-768 (NIST FIPS 203), the post-quantum algorithm that OpenSSH 10.0 made the default for all key exchange in April 2025. Private keys are handled by a compiled Rust sidecar with memory locking and deterministic zeroing. They never enter Python’s memory. Never touch disk. Never appear in process arguments.
Your keys live in Keychain or Vault. Never as files. If you lose the key, data is gone forever. That’s the point of strong encryption.
What makes this different
| Feature | Other plugins | Engram |
|---|---|---|
| Compression | None or ~3x | 4–5x / 8–12x / 20–50x per tier |
| Encryption | None | Post-quantum (ML-KEM-768), per-artifact keys |
| Search | Decompress everything | Semantic index + vector search, no decompression |
| Key handling | Key file on disk | Rust sidecar, Keychain, keys never in Python |
| Retrieval | Keyword only | Hybrid: BM25 + HNSW + reciprocal rank fusion |
| AI platforms | One | Claude, Codex, ChatGPT, Cursor, Copilot, any |
| Telemetry | Sometimes | Zero. Nothing leaves your machine. |
See what you’re wasting in 30 seconds
Install Engram, run the guided setup, and preview what would be compressed. The dry run scans your disk and shows you the file count, total size, and what would move to each tier. No files are modified until you explicitly choose to run it.
116 tests. 8 rounds of security review.
Dig deeper:
- How the compression pipeline works
- Encryption architecture (simple vs envelope mode)
- Threat model
- Full CLI reference
- Use cases and examples
Open source. MIT license. Works with any AI assistant that writes files.
Written with AI assistance (Claude). All claims verified against primary sources. The author takes full responsibility for all content.