Benchmarks and methodology - Tex | Memory API for agents

Tex scores 93.3% overall on LoCoMo with the full system. Tex scores 92.2% on LongMemEval_S with active retrieval only. This page shows the category tables, latency, tokens, and methodology behind those numbers.

LoCoMo · 93.3%

Full Tex system vs published baselines. EverMemOS was the prior headline number at 92.3%.

LongMemEval_S · 92.2%

Active retrieval track vs other retrieval-first systems. Emergence AI posted 86.0% on comparable reporting.

We generated answers with gpt-4o-mini and graded them with gpt-4o in an LLM-as-judge setup. Each evaluation ran on a single machine. The exact setup is in Methodology below.

LoCoMo

LoCoMo evaluates long-conversation memory across 10 multi-session conversations with 1,984 questions spanning 5 categories: single-hop retrieval, multi-hop reasoning, temporal reasoning, open-domain inference, and adversarial abstention.

Per-category results

Category	Tex Full	Tex Active	MemMachine v0.2	Hindsight (Gemini-3)
Single-Hop	96.08%	88.23%	94.41%	86.17%
Multi-Hop	92.14%	78.21%	89.72%	70.83%
Temporal	91.90%	87.23%	89.10%	83.80%
Open-Domain	94.79%	67.71%	75.00%	95.12%
Adversarial	99.33%	97.31%	—	—
Overall	93.3%	87.7%	91.69%	89.6%

Competitive landscape

On adversarial items Tex is 99.33%. It declines to answer when the transcript does not support an answer. If a public benchmark skips that bucket, compare headline numbers carefully.

LongMemEval_S

LongMemEval_S evaluates memory over 500 questions across ~48 sessions each (~115K tokens), testing information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention.

Per-ability results

Ability	Tex Active	Mastra OM (gpt-4o)	Emergence AI	Supermemory	Oracle GPT-4o	Zep
Abstention	100%	—	—	—	—	—
Knowledge Updates	95.8%	85.9%	83.3%	88.5%	85.9%	83.3%
Info Extraction	93.3%	82.1%	100%	97.1%	100%	92.9%
Temporal	91.3%	85.7%	85.7%	76.7%	76.7%	62.4%
Multi-Session	87.6%	79.7%	81.2%	71.4%	80.5%	57.9%
Overall	92.2%	84.2%	86.0%	81.6%	82.4%	71.2%

Competitive landscape

92.2% is close to Oracle GPT-o3 (92.0%) and well above Oracle GPT-4o (82.4%). That means retrieval is finding evidence close to what you would hand-pick for each question.

Latency

Configuration	p50	p90	End-to-end (including the model that reads context)
Tex (Active)	~120ms	~200ms	~0.6s
Tex (Full System)	~350ms	~500ms	~1.0s

Token efficiency

Per-question breakdown (Tex)

Component	Tokens
System prompt	~138
User template	~28
Retrieved context (avg 24 fragments × 44 tokens)	~1,047
Question	~20
Total input per question	~1,233
Output (answer)	~30

LLM tokens consumed on LoCoMo (1,984 questions)

System	Ingestion LLM	Reader	Total	Accuracy
Tex	0	2.4M	2.4M	93.3%
MemMachine Memory	not disclosed	4.2M	4.2M+	87.5%
MemMachine Agent	not disclosed	8.6M	8.6M+	91.7%
Mem0	not disclosed	19.2M	19.2M+	~66%

Ingest here costs zero LLM tokens. This run used offline embeddings only. If a system runs an LLM per message during ingest, that cost should show up in token usage.

Headline efficiency claims

Compared with published Mem0 numbers, Tex is about 27% higher on accuracy, uses roughly 87% fewer tokens, and runs at about 95% lower latency. Against MemMachine’s memory-only configuration, Tex is about 5.8% more accurate on 43% fewer tokens. On LoCoMo, Tex also has the lowest tokens-per-correct-answer in this comparison: about 1,296.

Ablation: what each part of the pipeline adds

On LongMemEval_S, removing pieces of the retrieval pipeline changes accuracy like this:

Capability	What it does	Δ accuracy
Adaptive retrieval	Adjusts retrieval depth + strategy per query type	+3.6%
Query understanding	LLM-free semantic expansion	+1.9%
Multi-stage relevance scoring	Progressive filter + rerank	+2.4%
Structured reasoning	Enumerate / count across fragments	+3.1%
Temporal awareness	Resolve relative dates, build chronology	+3.0%
Combined pipeline	Everything working together	+1.6%
	Base → Full System	76.6% → 92.2%

Methodology

Answers came from gpt-4o-mini. We graded them with an LLM-as-judge setup built on gpt-4o, using category-specific prompts, binary pass/fail per item, and a straight average inside each category. LoCoMo used all ten provided conversations (1,984 question-answer pairs) against the full Tex system (not retrieval-only). LongMemEval_S used 500 questions with about 48 sessions each (~115K tokens per trace) against Tex Active retrieval only. The full pipeline ran on a single machine. There was no multi-node orchestration.

On our roadmap for evals

We plan to add MemoryAgentBench (ICLR 2026), stronger multi-step reasoning for questions that need counting or arithmetic over evidence, and LongMemEval_M for questions that span hundreds of sessions.

References

Maharana et al. LoCoMo: Long-Context Conversations for Evaluating Conversational Memory. EMNLP 2024.
Wu et al. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. ICLR 2025.
Hu et al. Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions. ICLR 2026.
MemMachine. MemMachine v0.2 Delivers Top Scores on LoCoMo. Dec 2025.
Emergence AI. SOTA on LongMemEval with RAG. 2025.
Supermemory. State-of-the-Art Agent Memory Research. 2026.
Mastra. Observational Memory: 95% on LongMemEval. 2026.

Run the quickstart

Install the SDK, store one turn, and print a recall result.

LoCoMo · 93.3%

LongMemEval_S · 92.2%

​LoCoMo

​Per-category results

​Competitive landscape

​LongMemEval_S

​Per-ability results

​Competitive landscape

​Latency

​Token efficiency

​Per-question breakdown (Tex)

​LLM tokens consumed on LoCoMo (1,984 questions)

​Headline efficiency claims

​Ablation: what each part of the pipeline adds

​Methodology

​On our roadmap for evals

​References

Run the quickstart

LoCoMo

Per-category results

Competitive landscape

LongMemEval_S

Per-ability results

Competitive landscape

Latency

Token efficiency

Per-question breakdown (Tex)

LLM tokens consumed on LoCoMo (1,984 questions)

Headline efficiency claims

Ablation: what each part of the pipeline adds

Methodology

On our roadmap for evals

References