A three-granularity memory architecture · verbatim dialogue + atomic facts + synthesized profiles
To enable reliable long-term interaction, LLM agents require a memory system that can faithfully store, efficiently retrieve, and deeply reason over accumulated dialogue history. Most existing methods adopt an extracted-fact paradigm: handcrafted static prompts compress raw dialogues into atomic facts, which are then stored, matched, and injected into downstream reasoning.
Such fact-centric designs inevitably discard fine-grained details in the original dialogue and fail to support deep reasoning over scattered isolated facts. Moreover, static prompts cannot maintain consistent extraction granularity across diverse dialogue styles.
We propose TriMem, which maintains three coexisting representation granularities: raw dialogue segments anchored by source identifiers for storage fidelity, extracted atomic facts for efficient retrieval, and synthesized profiles that aggregate dispersed facts into holistic semantic understanding for deep reasoning. We further adopt TextGrad-based prompt optimization to iteratively refine extraction and profiling prompts via response-quality feedback, achieving lifelong evolution without any parameter updating. Extensive experiments on LoCoMo and PerLTQA across multiple LLM backbones demonstrate that TriMem consistently outperforms strong memory baselines.
Existing memory systems treat extracted facts as the atomic unit for all three stages — storage, retrieval, and reasoning. We analyze them along these axes and uncover three concrete failure modes that have been overlooked.
Fact extraction is irreversible compression. Modifiers and contextual details (e.g., "with trans people") are dropped, so even when retrieval is correct the answer is incomplete. Extracted facts lose 14.5% more reference-answer tokens than the raw dialogue.
Reasoning quality collapses on multi-evidence questions (F1 35.8 vs. 55.3 for single-evidence). Isolated facts cannot support emotional inference, behavioural modeling, or holistic semantic portraits across dispersed evidence.
Fixed hand-written extraction prompts cannot adapt to heterogeneous dialogue styles — the Pomodoro technique is sometimes named explicitly, sometimes described as "25 minutes on, 5 off". Performance fluctuates wildly across speaker groups.
TriMem keeps the efficiency of fact-based retrieval, but anchors each fact to its source dialogue for fidelity and aggregates facts into entity profiles for deep reasoning. Extraction and profile prompts evolve over time via TextGrad — no model weights are updated.
Each extracted entry stores a source identifier ei.src that
points back to the original turns. Whenever a fact is retrieved, its verbatim context
can be recovered, preserving every contextual detail and modifier.
A multi-dimensional schema (restatement, time, person, location, entities, …) produces structured tuples per sliding window. The agent retrieves top-K relevant facts via dense similarity, enabling precise semantic matching.
Facts are grouped by person and synthesized into entity profiles (identity, personality, career, interests, behavioural tendencies). Profiles pre-integrate knowledge so the agent can reason holistically without re-aggregating scattered facts.
Each window wi is processed by an agent driven by a structured
prompt with dimensions for restatement, timestamps, persons, locations, and entities — and
crucially a src dimension that links the fact back to its raw dialogue.
Instead of using the raw question, an agent first analyses the required information and key entities. The resulting structured query enables more accurate matching against the fact bank, with raw dialogues and profiles fetched via predefined indices.
Failure cases are scored by an LLM judge; an LLM "gradient" agent emits natural-language rewriting instructions that update the extraction and profile prompts. No parameter updates — only prompts evolve, so the system stays compatible with API-only models.
We compare TriMem against Naive RAG and six competitive memory systems (Mem0, MemoryOS, A-Mem, LightMem, SimpleMem, xMemory) on LoCoMo and PerLTQA across high-capability and efficient LLM backbones. TriMem consistently delivers the best average performance while keeping retrieval tokens around 1.2 k.
| Method | MultiHop | Temporal | OpenDomain | SingleHop | Average | Tokens | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| BLEU | F1 | BLEU | F1 | BLEU | F1 | BLEU | F1 | BLEU | F1 | ||
| GPT-4.1-mini | |||||||||||
| LoCoMo | 8.00 | 17.26 | 10.17 | 14.89 | 8.29 | 16.28 | 17.43 | 19.36 | 13.62 | 17.85 | 16,863 |
| Naïve RAG | 11.49 | 13.24 | 20.52 | 28.80 | 11.79 | 10.75 | 22.85 | 30.29 | 19.59 | 25.64 | 1,119 |
| Mem0 | 28.81 | 31.44 | 35.41 | 46.24 | 18.51 | 17.93 | 31.25 | 35.34 | 30.88 | 35.81 | 1,153 |
| MemoryOS | 16.46 | 24.02 | 34.78 | 46.52 | 14.89 | 19.58 | 36.18 | 43.92 | 30.95 | 39.30 | 936 |
| A-Mem | 15.11 | 20.66 | 41.57 | 50.94 | 11.18 | 13.20 | 38.25 | 43.72 | 33.01 | 39.10 | 1,276 |
| LightMem | 32.93 | 40.33 | 47.53 | 55.23 | 18.31 | 21.91 | 37.68 | 48.39 | 37.66 | 46.69 | 695 |
| SimpleMem | 32.40 | 39.33 | 43.69 | 58.01 | 19.56 | 24.50 | 43.41 | 53.99 | 39.97 | 50.30 | 587 |
| TriMem (Ours) | 35.20 | 42.59 | 49.56 | 64.72 | 36.86 | 43.88 | 45.25 | 55.36 | 43.79 | 54.26 | 1,217 |
| GPT-4o | |||||||||||
| LoCoMo | 19.64 | 19.20 | 9.50 | 13.95 | 11.87 | 16.60 | 13.81 | 16.12 | 13.86 | 16.26 | 16,863 |
| Naïve RAG | 14.36 | 15.35 | 11.48 | 16.17 | 9.03 | 9.09 | 26.67 | 35.03 | 20.15 | 25.88 | 1,119 |
| Mem0 | 25.52 | 32.36 | 32.48 | 42.70 | 14.50 | 18.50 | 30.02 | 39.84 | 28.74 | 37.74 | 1,195 |
| MemoryOS | 22.52 | 31.76 | 38.31 | 47.08 | 12.91 | 18.06 | 38.26 | 43.67 | 33.81 | 40.60 | 944 |
| A-Mem | 20.90 | 26.12 | 35.39 | 48.64 | 10.74 | 12.33 | 37.11 | 42.08 | 32.14 | 38.67 | 1,152 |
| LightMem | 35.30 | 45.16 | 43.60 | 58.57 | 10.56 | 23.20 | 36.72 | 46.60 | 36.26 | 47.37 | 677 |
| SimpleMem | 31.34 | 35.58 | 35.78 | 46.96 | 18.96 | 17.01 | 37.11 | 43.94 | 34.64 | 41.36 | 627 |
| TriMem (Ours) | 40.36 | 46.00 | 51.39 | 60.41 | 39.27 | 50.15 | 40.61 | 47.78 | 42.73 | 50.23 | 1,272 |
| GPT-5-nano | |||||||||||
| LoCoMo | 20.45 | 19.04 | 12.69 | 16.56 | 13.83 | 20.85 | 13.50 | 15.23 | 14.62 | 16.56 | 16,863 |
| Naïve RAG | 10.13 | 13.29 | 8.78 | 13.09 | 9.25 | 12.24 | 20.29 | 28.44 | 15.34 | 21.46 | 1,119 |
| Mem0 | 22.55 | 28.58 | 35.52 | 48.82 | 18.33 | 16.75 | 28.99 | 35.65 | 28.51 | 35.92 | 1,074 |
| MemoryOS | 10.74 | 23.50 | 32.50 | 39.71 | 10.02 | 20.30 | 34.28 | 40.34 | 28.09 | 35.88 | 952 |
| A-Mem | 15.54 | 20.11 | 27.23 | 32.43 | 10.86 | 12.55 | 27.26 | 31.91 | 24.09 | 28.65 | 1,175 |
| LightMem | 28.63 | 38.21 | 39.72 | 55.51 | 18.79 | 22.74 | 31.19 | 42.01 | 31.73 | 42.93 | 723 |
| SimpleMem | 25.42 | 33.28 | 32.15 | 45.75 | 20.77 | 24.31 | 39.65 | 46.71 | 34.30 | 42.65 | 655 |
| TriMem (Ours) | 34.86 | 45.25 | 42.45 | 57.05 | 33.55 | 40.52 | 54.26 | 62.88 | 46.96 | 57.04 | 1,256 |
| Method | MultiHop | Temporal | OpenDomain | SingleHop | Average | Tokens | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| BLEU | F1 | BLEU | F1 | BLEU | F1 | BLEU | F1 | BLEU | F1 | ||
| Qwen3-8B | |||||||||||
| LoCoMo | 12.67 | 20.54 | 12.32 | 18.55 | 10.59 | 14.39 | 19.76 | 23.78 | 16.34 | 21.51 | 16,863 |
| Mem0 | 28.32 | 30.07 | 23.15 | 26.15 | 11.79 | 15.15 | 30.75 | 34.97 | 27.54 | 30.10 | 1,140 |
| MemoryOS | 14.38 | 22.72 | 18.67 | 22.79 | 11.06 | 13.52 | 25.65 | 33.52 | 21.22 | 28.06 | 911 |
| A-Mem | 16.02 | 21.08 | 28.10 | 37.51 | 14.01 | 14.19 | 33.60 | 40.77 | 28.01 | 34.83 | 1,180 |
| LightMem | 22.84 | 32.54 | 37.62 | 48.37 | 18.05 | 19.02 | 23.03 | 31.37 | 25.73 | 34.36 | 740 |
| SimpleMem | 23.39 | 30.39 | 24.66 | 34.51 | 14.04 | 15.39 | 35.73 | 41.26 | 29.81 | 36.25 | 608 |
| xMemory | 28.44 | 39.13 | 28.65 | 35.41 | 17.76 | 21.57 | 40.66 | 50.57 | 34.49 | 43.51 | 2,230 |
| TriMem (Ours) | 33.09 | 41.22 | 38.71 | 53.13 | 30.59 | 37.64 | 45.10 | 52.52 | 40.66 | 49.65 | 1,339 |
| Llama-3.1-8B-Instruct | |||||||||||
| LoCoMo | 13.73 | 23.36 | 13.15 | 20.30 | 11.54 | 19.42 | 18.64 | 25.86 | 16.15 | 23.84 | 16,863 |
| Mem0 | 13.27 | 16.40 | 8.26 | 12.62 | 7.45 | 8.45 | 21.75 | 31.28 | 16.49 | 23.24 | 1,085 |
| MemoryOS | 13.57 | 22.63 | 19.18 | 23.31 | 10.59 | 13.01 | 23.46 | 31.05 | 19.95 | 26.77 | 964 |
| A-Mem | 15.80 | 22.84 | 23.79 | 36.19 | 11.19 | 12.51 | 31.19 | 37.86 | 25.58 | 33.18 | 1,340 |
| LightMem | 13.19 | 19.64 | 16.93 | 28.06 | 17.39 | 20.68 | 27.62 | 41.06 | 22.11 | 33.16 | 758 |
| SimpleMem | 18.81 | 26.22 | 21.15 | 30.44 | 15.81 | 18.77 | 26.79 | 31.23 | 23.47 | 29.37 | 674 |
| xMemory | 21.89 | 31.24 | 21.78 | 26.84 | 12.37 | 16.62 | 27.75 | 41.36 | 24.47 | 34.94 | 2,375 |
| TriMem (Ours) | 25.42 | 34.56 | 25.98 | 32.36 | 28.40 | 32.71 | 35.76 | 43.20 | 31.37 | 38.70 | 1,388 |
| Method | Qwen3-8B | Llama-3.1-8B-Instruct | ||||||
|---|---|---|---|---|---|---|---|---|
| Profile | Social Rel. | Events | Dialogues | Profile | Social Rel. | Events | Dialogues | |
| Full-Context | 65.80 | 56.72 | 52.75 | 18.51 | 52.46 | 54.58 | 47.54 | 17.27 |
| Mem0 | 89.56 | 76.46 | 66.48 | 27.59 | 73.04 | 72.29 | 57.14 | 26.31 |
| LightMem | 64.93 | 78.00 | 73.03 | 47.01 | 53.85 | 74.08 | 69.21 | 44.72 |
| SimpleMem | 88.12 | 82.40 | 79.87 | 42.09 | 84.64 | 76.46 | 70.39 | 37.90 |
| TriMem (Ours) | 92.46 | 83.23 | 85.72 | 55.79 | 92.17 | 82.28 | 78.17 | 45.01 |
We isolate each design choice to understand why TriMem works: removing either the profile or the raw-dialogue branch hurts; prompt evolution helps up to ~4 steps; retrieval saturates around K = 25; and window size 40 balances quality with construction time.
l = 40, s = 38 as the default.
If you find TriMem useful, please cite our paper.
@article{sun2026trimem,
title = {Rethinking How to Remember: Beyond Atomic Facts in Lifelong LLM Agent Memory},
author = {Jingwei Sun and Jianing Zhu and Jiangchao Yao and Tongliang Liu and Bo Han},
journal = {arXiv preprint arXiv:2605.19952},
year = {2026}
}