arXiv Preprint · LLM Agent Memory

TriMem: Rethinking How to Remember
Beyond Atomic Facts in Lifelong LLM Agent Memory

A three-granularity memory architecture · verbatim dialogue + atomic facts + synthesized profiles

Jingwei Sun1,* Jianing Zhu2,* Jiangchao Yao3 Tongliang Liu4 Bo Han1,†
1TMLR Group, Hong Kong Baptist University   2The University of Texas at Austin   3Shanghai Jiao Tong University   4Sydney AI Center, The University of Sydney
*Equal contribution  ·  Corresponding author: Bo Han <bhanml@comp.hkbu.edu.hk>
3
Memory Granularities
Raw + Fact + Profile
0
Parameter Updates
TextGrad Prompt Evolution
~14×
Token Compression
1.2k vs 16.8k full context
+14.39%
Best F1 Gain
over prior SOTA (GPT-5-nano)
Comparison between previous systems and TriMem
Figure 1. Comparison with previous systems. TriMem establishes a three-level architecture that leverages raw dialogue to guarantee information fidelity in storage, relies on atomic facts for efficient retrieval, and provides integrated profiles for in-depth reasoning. Construction prompts are continuously optimized from answer feedback via TextGrad.

Abstract

To enable reliable long-term interaction, LLM agents require a memory system that can faithfully store, efficiently retrieve, and deeply reason over accumulated dialogue history. Most existing methods adopt an extracted-fact paradigm: handcrafted static prompts compress raw dialogues into atomic facts, which are then stored, matched, and injected into downstream reasoning.

Such fact-centric designs inevitably discard fine-grained details in the original dialogue and fail to support deep reasoning over scattered isolated facts. Moreover, static prompts cannot maintain consistent extraction granularity across diverse dialogue styles.

We propose TriMem, which maintains three coexisting representation granularities: raw dialogue segments anchored by source identifiers for storage fidelity, extracted atomic facts for efficient retrieval, and synthesized profiles that aggregate dispersed facts into holistic semantic understanding for deep reasoning. We further adopt TextGrad-based prompt optimization to iteratively refine extraction and profiling prompts via response-quality feedback, achieving lifelong evolution without any parameter updating. Extensive experiments on LoCoMo and PerLTQA across multiple LLM backbones demonstrate that TriMem consistently outperforms strong memory baselines.

Three Limitations of Fact-Centric Memory

Existing memory systems treat extracted facts as the atomic unit for all three stages — storage, retrieval, and reasoning. We analyze them along these axes and uncover three concrete failure modes that have been overlooked.

Obs. 1

Lossy Storage

Fact extraction is irreversible compression. Modifiers and contextual details (e.g., "with trans people") are dropped, so even when retrieval is correct the answer is incomplete. Extracted facts lose 14.5% more reference-answer tokens than the raw dialogue.

Obs. 2

Shallow Reasoning

Reasoning quality collapses on multi-evidence questions (F1 35.8 vs. 55.3 for single-evidence). Isolated facts cannot support emotional inference, behavioural modeling, or holistic semantic portraits across dispersed evidence.

Obs. 3

Suboptimal Prompts

Fixed hand-written extraction prompts cannot adapt to heterogeneous dialogue styles — the Pomodoro technique is sometimes named explicitly, sometimes described as "25 minutes on, 5 off". Performance fluctuates wildly across speaker groups.

Case studies of failure modes in prior memory systems
Figure 2. Analysis of existing memory systems. Although fact-only pipelines enable efficient retrieval, they suffer from lossy storage and shallow reasoning; fixed extraction prompts further destabilize performance across conversational styles.

TriMem: Three Coexisting Granularities

TriMem keeps the efficiency of fact-based retrieval, but anchors each fact to its source dialogue for fidelity and aggregates facts into entity profiles for deep reasoning. Extraction and profile prompts evolve over time via TextGrad — no model weights are updated.

Granularity 1

Raw Dialogue Segments

Each extracted entry stores a source identifier ei.src that points back to the original turns. Whenever a fact is retrieved, its verbatim context can be recovered, preserving every contextual detail and modifier.

Stage: Storage — fidelity
Granularity 2

Extracted Atomic Facts

A multi-dimensional schema (restatement, time, person, location, entities, …) produces structured tuples per sliding window. The agent retrieves top-K relevant facts via dense similarity, enabling precise semantic matching.

Stage: Retrieval — efficiency
Granularity 3

Synthesized Entity Profiles

Facts are grouped by person and synthesized into entity profiles (identity, personality, career, interests, behavioural tendencies). Profiles pre-integrate knowledge so the agent can reason holistically without re-aggregating scattered facts.

Stage: Reasoning — depth
Construction

Multi-dimensional Extraction

Each window wi is processed by an agent driven by a structured prompt with dimensions for restatement, timestamps, persons, locations, and entities — and crucially a src dimension that links the fact back to its raw dialogue.

Retrieval

Search-Query Reformulation

Instead of using the raw question, an agent first analyses the required information and key entities. The resulting structured query enables more accurate matching against the fact bank, with raw dialogues and profiles fetched via predefined indices.

Lifelong Evolution

TextGrad Prompt Optimization

Failure cases are scored by an LLM judge; an LLM "gradient" agent emits natural-language rewriting instructions that update the extraction and profile prompts. No parameter updates — only prompts evolve, so the system stays compatible with API-only models.

TriMem pipeline
Figure 3. Overview of TriMem. Historical dialogue is segmented into windows, multi-dimensional facts are extracted with traceable source indices, and entity profiles are constructed per person. Queries trigger structured search; failure cases feed back to refine extraction and profile prompts via TextGrad.

Main Results

We compare TriMem against Naive RAG and six competitive memory systems (Mem0, MemoryOS, A-Mem, LightMem, SimpleMem, xMemory) on LoCoMo and PerLTQA across high-capability and efficient LLM backbones. TriMem consistently delivers the best average performance while keeping retrieval tokens around 1.2 k.

54.26%+3.96
LoCoMo Avg. F1 (GPT-4.1-mini)
57.04%+14.39
LoCoMo Avg. F1 (GPT-5-nano)
~1.2k
Tokens per Retrieval (vs. 16.8k full-context)
Method MultiHop Temporal OpenDomain SingleHop Average Tokens
BLEUF1 BLEUF1 BLEUF1 BLEUF1 BLEUF1
GPT-4.1-mini
LoCoMo8.0017.2610.1714.898.2916.2817.4319.3613.6217.8516,863
Naïve RAG11.4913.2420.5228.8011.7910.7522.8530.2919.5925.641,119
Mem028.8131.4435.4146.2418.5117.9331.2535.3430.8835.811,153
MemoryOS16.4624.0234.7846.5214.8919.5836.1843.9230.9539.30936
A-Mem15.1120.6641.5750.9411.1813.2038.2543.7233.0139.101,276
LightMem32.9340.3347.5355.2318.3121.9137.6848.3937.6646.69695
SimpleMem32.4039.3343.6958.0119.5624.5043.4153.9939.9750.30587
TriMem (Ours)35.2042.5949.5664.7236.8643.8845.2555.3643.7954.261,217
GPT-4o
LoCoMo19.6419.209.5013.9511.8716.6013.8116.1213.8616.2616,863
Naïve RAG14.3615.3511.4816.179.039.0926.6735.0320.1525.881,119
Mem025.5232.3632.4842.7014.5018.5030.0239.8428.7437.741,195
MemoryOS22.5231.7638.3147.0812.9118.0638.2643.6733.8140.60944
A-Mem20.9026.1235.3948.6410.7412.3337.1142.0832.1438.671,152
LightMem35.3045.1643.6058.5710.5623.2036.7246.6036.2647.37677
SimpleMem31.3435.5835.7846.9618.9617.0137.1143.9434.6441.36627
TriMem (Ours)40.3646.0051.3960.4139.2750.1540.6147.7842.7350.231,272
GPT-5-nano
LoCoMo20.4519.0412.6916.5613.8320.8513.5015.2314.6216.5616,863
Naïve RAG10.1313.298.7813.099.2512.2420.2928.4415.3421.461,119
Mem022.5528.5835.5248.8218.3316.7528.9935.6528.5135.921,074
MemoryOS10.7423.5032.5039.7110.0220.3034.2840.3428.0935.88952
A-Mem15.5420.1127.2332.4310.8612.5527.2631.9124.0928.651,175
LightMem28.6338.2139.7255.5118.7922.7431.1942.0131.7342.93723
SimpleMem25.4233.2832.1545.7520.7724.3139.6546.7134.3042.65655
TriMem (Ours)34.8645.2542.4557.0533.5540.5254.2662.8846.9657.041,256
Table 1. Performance on LoCoMo with high-capability backbones. TriMem leads every model on average BLEU and F1, with especially large gains on the OpenDomain split.
Method MultiHop Temporal OpenDomain SingleHop Average Tokens
BLEUF1 BLEUF1 BLEUF1 BLEUF1 BLEUF1
Qwen3-8B
LoCoMo12.6720.5412.3218.5510.5914.3919.7623.7816.3421.5116,863
Mem028.3230.0723.1526.1511.7915.1530.7534.9727.5430.101,140
MemoryOS14.3822.7218.6722.7911.0613.5225.6533.5221.2228.06911
A-Mem16.0221.0828.1037.5114.0114.1933.6040.7728.0134.831,180
LightMem22.8432.5437.6248.3718.0519.0223.0331.3725.7334.36740
SimpleMem23.3930.3924.6634.5114.0415.3935.7341.2629.8136.25608
xMemory28.4439.1328.6535.4117.7621.5740.6650.5734.4943.512,230
TriMem (Ours)33.0941.2238.7153.1330.5937.6445.1052.5240.6649.651,339
Llama-3.1-8B-Instruct
LoCoMo13.7323.3613.1520.3011.5419.4218.6425.8616.1523.8416,863
Mem013.2716.408.2612.627.458.4521.7531.2816.4923.241,085
MemoryOS13.5722.6319.1823.3110.5913.0123.4631.0519.9526.77964
A-Mem15.8022.8423.7936.1911.1912.5131.1937.8625.5833.181,340
LightMem13.1919.6416.9328.0617.3920.6827.6241.0622.1133.16758
SimpleMem18.8126.2221.1530.4415.8118.7726.7931.2323.4729.37674
xMemory21.8931.2421.7826.8412.3716.6227.7541.3624.4734.942,375
TriMem (Ours)25.4234.5625.9832.3628.4032.7135.7643.2031.3738.701,388
Table 2. Performance on LoCoMo with efficient open-source backbones. TriMem keeps the performance/efficiency lead even on 8B models, and unlike xMemory it does not require access to model output logits.
Method Qwen3-8B Llama-3.1-8B-Instruct
ProfileSocial Rel.EventsDialogues ProfileSocial Rel.EventsDialogues
Full-Context65.8056.7252.7518.5152.4654.5847.5417.27
Mem089.5676.4666.4827.5973.0472.2957.1426.31
LightMem64.9378.0073.0347.0153.8574.0869.2144.72
SimpleMem88.1282.4079.8742.0984.6476.4670.3937.90
TriMem (Ours)92.4683.2385.7255.7992.1782.2878.1745.01
Table 3. Generalization on PerLTQA. LLM-judged correctness (%) across personal profiles, social relationships, historical events, and dialogue memories. TriMem achieves the best score on every sub-task.

Ablation & Analysis

We isolate each design choice to understand why TriMem works: removing either the profile or the raw-dialogue branch hurts; prompt evolution helps up to ~4 steps; retrieval saturates around K = 25; and window size 40 balances quality with construction time.

Ablation: profile and raw dialogue modules
Figure 4. Ablation of profile and raw dialogue branches. Removing either component causes a noticeable performance drop, confirming that both depth (profiles) and fidelity (raw dialogue) matter.
Ablation: number of prompt-evolution steps
Figure 5. Impact of TextGrad evolution steps. Performance improves up to four refinement rounds; further updates over-refine prompts and hurt performance.
Ablation: number of retrieved memory entries
Figure 6. Sensitivity to the number of retrieved memory entries. Too few entries miss key facts; too many entries introduce noise. Top‑K = 25 strikes the best balance.
Search query necessity and timing analysis
Figure 7. Necessity of the structured search query and efficiency analysis. Search-query reformulation costs a small amount of retrieval time but yields substantial accuracy gains; window size 40 keeps memory construction time competitive.
Ablation: sliding window size
Figure 8. Ablation over window size. Smaller windows also work well, but construction time grows quickly with the number of windows. We settle on l = 40, s = 38 as the default.

Citation

If you find TriMem useful, please cite our paper.

@article{sun2026trimem,
  title   = {Rethinking How to Remember: Beyond Atomic Facts in Lifelong LLM Agent Memory},
  author  = {Jingwei Sun and Jianing Zhu and Jiangchao Yao and Tongliang Liu and Bo Han},
  journal = {arXiv preprint arXiv:2605.19952},
  year    = {2026}
}