TriMem: Rethinking How to Remember — Beyond Atomic Facts in Lifelong LLM Agent Memory

Abstract

To enable reliable long-term interaction, LLM agents require a memory system that can faithfully store, efficiently retrieve, and deeply reason over accumulated dialogue history. Most existing methods adopt an extracted-fact paradigm: handcrafted static prompts compress raw dialogues into atomic facts, which are then stored, matched, and injected into downstream reasoning.

Such fact-centric designs inevitably discard fine-grained details in the original dialogue and fail to support deep reasoning over scattered isolated facts. Moreover, static prompts cannot maintain consistent extraction granularity across diverse dialogue styles.

We propose TriMem, which maintains three coexisting representation granularities: raw dialogue segments anchored by source identifiers for storage fidelity, extracted atomic facts for efficient retrieval, and synthesized profiles that aggregate dispersed facts into holistic semantic understanding for deep reasoning. We further adopt TextGrad-based prompt optimization to iteratively refine extraction and profiling prompts via response-quality feedback, achieving lifelong evolution without any parameter updating. Extensive experiments on LoCoMo and PerLTQA across multiple LLM backbones demonstrate that TriMem consistently outperforms strong memory baselines.

Three Limitations of Fact-Centric Memory

Existing memory systems treat extracted facts as the atomic unit for all three stages — storage, retrieval, and reasoning. We analyze them along these axes and uncover three concrete failure modes that have been overlooked.

Obs. 1

Lossy Storage

Fact extraction is irreversible compression. Modifiers and contextual details (e.g., "with trans people") are dropped, so even when retrieval is correct the answer is incomplete. Extracted facts lose 14.5% more reference-answer tokens than the raw dialogue.

Obs. 2

Shallow Reasoning

Reasoning quality collapses on multi-evidence questions (F1 35.8 vs. 55.3 for single-evidence). Isolated facts cannot support emotional inference, behavioural modeling, or holistic semantic portraits across dispersed evidence.

Obs. 3

Suboptimal Prompts

Fixed hand-written extraction prompts cannot adapt to heterogeneous dialogue styles — the Pomodoro technique is sometimes named explicitly, sometimes described as "25 minutes on, 5 off". Performance fluctuates wildly across speaker groups.

Case studies of failure modes in prior memory systems — **Figure 2.** Analysis of existing memory systems. Although fact-only pipelines enable efficient retrieval, they suffer from lossy storage and shallow reasoning; fixed extraction prompts further destabilize performance across conversational styles.

TriMem: Three Coexisting Granularities

TriMem keeps the efficiency of fact-based retrieval, but anchors each fact to its source dialogue for fidelity and aggregates facts into entity profiles for deep reasoning. Extraction and profile prompts evolve over time via TextGrad — no model weights are updated.

Granularity 1

Raw Dialogue Segments

Each extracted entry stores a source identifier e_i.src that points back to the original turns. Whenever a fact is retrieved, its verbatim context can be recovered, preserving every contextual detail and modifier.

Stage: Storage — fidelity

Granularity 2

Extracted Atomic Facts

A multi-dimensional schema (restatement, time, person, location, entities, …) produces structured tuples per sliding window. The agent retrieves top-K relevant facts via dense similarity, enabling precise semantic matching.

Stage: Retrieval — efficiency

Granularity 3

Synthesized Entity Profiles

Facts are grouped by person and synthesized into entity profiles (identity, personality, career, interests, behavioural tendencies). Profiles pre-integrate knowledge so the agent can reason holistically without re-aggregating scattered facts.

Stage: Reasoning — depth

Construction

Multi-dimensional Extraction

Each window w_i is processed by an agent driven by a structured prompt with dimensions for restatement, timestamps, persons, locations, and entities — and crucially a src dimension that links the fact back to its raw dialogue.

Retrieval

Search-Query Reformulation

Instead of using the raw question, an agent first analyses the required information and key entities. The resulting structured query enables more accurate matching against the fact bank, with raw dialogues and profiles fetched via predefined indices.

Lifelong Evolution

TextGrad Prompt Optimization

Failure cases are scored by an LLM judge; an LLM "gradient" agent emits natural-language rewriting instructions that update the extraction and profile prompts. No parameter updates — only prompts evolve, so the system stays compatible with API-only models.

TriMem pipeline — **Figure 3.** Overview of TriMem. Historical dialogue is segmented into windows, multi-dimensional facts are extracted with traceable source indices, and entity profiles are constructed per person. Queries trigger structured search; failure cases feed back to refine extraction and profile prompts via TextGrad.

Main Results

We compare TriMem against Naive RAG and six competitive memory systems (Mem0, MemoryOS, A-Mem, LightMem, SimpleMem, xMemory) on LoCoMo and PerLTQA across high-capability and efficient LLM backbones. TriMem consistently delivers the best average performance while keeping retrieval tokens around 1.2 k.

54.26%+3.96

LoCoMo Avg. F1 (GPT-4.1-mini)

57.04%+14.39

LoCoMo Avg. F1 (GPT-5-nano)

~1.2k

Tokens per Retrieval (vs. 16.8k full-context)

Method	MultiHop		Temporal		OpenDomain		SingleHop		Average		Tokens
Method	BLEU	F1	BLEU	F1	BLEU	F1	BLEU	F1	BLEU	F1	Tokens
GPT-4.1-mini
LoCoMo	8.00	17.26	10.17	14.89	8.29	16.28	17.43	19.36	13.62	17.85	16,863
Naïve RAG	11.49	13.24	20.52	28.80	11.79	10.75	22.85	30.29	19.59	25.64	1,119
Mem0	28.81	31.44	35.41	46.24	18.51	17.93	31.25	35.34	30.88	35.81	1,153
MemoryOS	16.46	24.02	34.78	46.52	14.89	19.58	36.18	43.92	30.95	39.30	936
A-Mem	15.11	20.66	41.57	50.94	11.18	13.20	38.25	43.72	33.01	39.10	1,276
LightMem	32.93	40.33	47.53	55.23	18.31	21.91	37.68	48.39	37.66	46.69	695
SimpleMem	32.40	39.33	43.69	58.01	19.56	24.50	43.41	53.99	39.97	50.30	587
TriMem (Ours)	35.20	42.59	49.56	64.72	36.86	43.88	45.25	55.36	43.79	54.26	1,217
GPT-4o
LoCoMo	19.64	19.20	9.50	13.95	11.87	16.60	13.81	16.12	13.86	16.26	16,863
Naïve RAG	14.36	15.35	11.48	16.17	9.03	9.09	26.67	35.03	20.15	25.88	1,119
Mem0	25.52	32.36	32.48	42.70	14.50	18.50	30.02	39.84	28.74	37.74	1,195
MemoryOS	22.52	31.76	38.31	47.08	12.91	18.06	38.26	43.67	33.81	40.60	944
A-Mem	20.90	26.12	35.39	48.64	10.74	12.33	37.11	42.08	32.14	38.67	1,152
LightMem	35.30	45.16	43.60	58.57	10.56	23.20	36.72	46.60	36.26	47.37	677
SimpleMem	31.34	35.58	35.78	46.96	18.96	17.01	37.11	43.94	34.64	41.36	627
TriMem (Ours)	40.36	46.00	51.39	60.41	39.27	50.15	40.61	47.78	42.73	50.23	1,272
GPT-5-nano
LoCoMo	20.45	19.04	12.69	16.56	13.83	20.85	13.50	15.23	14.62	16.56	16,863
Naïve RAG	10.13	13.29	8.78	13.09	9.25	12.24	20.29	28.44	15.34	21.46	1,119
Mem0	22.55	28.58	35.52	48.82	18.33	16.75	28.99	35.65	28.51	35.92	1,074
MemoryOS	10.74	23.50	32.50	39.71	10.02	20.30	34.28	40.34	28.09	35.88	952
A-Mem	15.54	20.11	27.23	32.43	10.86	12.55	27.26	31.91	24.09	28.65	1,175
LightMem	28.63	38.21	39.72	55.51	18.79	22.74	31.19	42.01	31.73	42.93	723
SimpleMem	25.42	33.28	32.15	45.75	20.77	24.31	39.65	46.71	34.30	42.65	655
TriMem (Ours)	34.86	45.25	42.45	57.05	33.55	40.52	54.26	62.88	46.96	57.04	1,256

Table 1. Performance on LoCoMo with high-capability backbones. TriMem leads every model on average BLEU and F1, with especially large gains on the OpenDomain split.

Method	MultiHop		Temporal		OpenDomain		SingleHop		Average		Tokens
Method	BLEU	F1	BLEU	F1	BLEU	F1	BLEU	F1	BLEU	F1	Tokens
Qwen3-8B
LoCoMo	12.67	20.54	12.32	18.55	10.59	14.39	19.76	23.78	16.34	21.51	16,863
Mem0	28.32	30.07	23.15	26.15	11.79	15.15	30.75	34.97	27.54	30.10	1,140
MemoryOS	14.38	22.72	18.67	22.79	11.06	13.52	25.65	33.52	21.22	28.06	911
A-Mem	16.02	21.08	28.10	37.51	14.01	14.19	33.60	40.77	28.01	34.83	1,180
LightMem	22.84	32.54	37.62	48.37	18.05	19.02	23.03	31.37	25.73	34.36	740
SimpleMem	23.39	30.39	24.66	34.51	14.04	15.39	35.73	41.26	29.81	36.25	608
xMemory	28.44	39.13	28.65	35.41	17.76	21.57	40.66	50.57	34.49	43.51	2,230
TriMem (Ours)	33.09	41.22	38.71	53.13	30.59	37.64	45.10	52.52	40.66	49.65	1,339
Llama-3.1-8B-Instruct
LoCoMo	13.73	23.36	13.15	20.30	11.54	19.42	18.64	25.86	16.15	23.84	16,863
Mem0	13.27	16.40	8.26	12.62	7.45	8.45	21.75	31.28	16.49	23.24	1,085
MemoryOS	13.57	22.63	19.18	23.31	10.59	13.01	23.46	31.05	19.95	26.77	964
A-Mem	15.80	22.84	23.79	36.19	11.19	12.51	31.19	37.86	25.58	33.18	1,340
LightMem	13.19	19.64	16.93	28.06	17.39	20.68	27.62	41.06	22.11	33.16	758
SimpleMem	18.81	26.22	21.15	30.44	15.81	18.77	26.79	31.23	23.47	29.37	674
xMemory	21.89	31.24	21.78	26.84	12.37	16.62	27.75	41.36	24.47	34.94	2,375
TriMem (Ours)	25.42	34.56	25.98	32.36	28.40	32.71	35.76	43.20	31.37	38.70	1,388

Table 2. Performance on LoCoMo with efficient open-source backbones. TriMem keeps the performance/efficiency lead even on 8B models, and unlike xMemory it does not require access to model output logits.

Method	Qwen3-8B				Llama-3.1-8B-Instruct
Method	Profile	Social Rel.	Events	Dialogues	Profile	Social Rel.	Events	Dialogues
Full-Context	65.80	56.72	52.75	18.51	52.46	54.58	47.54	17.27
Mem0	89.56	76.46	66.48	27.59	73.04	72.29	57.14	26.31
LightMem	64.93	78.00	73.03	47.01	53.85	74.08	69.21	44.72
SimpleMem	88.12	82.40	79.87	42.09	84.64	76.46	70.39	37.90
TriMem (Ours)	92.46	83.23	85.72	55.79	92.17	82.28	78.17	45.01

Table 3. Generalization on PerLTQA. LLM-judged correctness (%) across personal profiles, social relationships, historical events, and dialogue memories. TriMem achieves the best score on every sub-task.

Ablation & Analysis

We isolate each design choice to understand why TriMem works: removing either the profile or the raw-dialogue branch hurts; prompt evolution helps up to ~4 steps; retrieval saturates around K = 25; and window size 40 balances quality with construction time.

Figure 4. Ablation of profile and raw dialogue branches. Removing either component causes a noticeable performance drop, confirming that both depth (profiles) and fidelity (raw dialogue) matter.