Skip to main content
Tex scores 93.3% overall on LoCoMo with the full system. Tex scores 92.2% on LongMemEval_S with active retrieval only. This page shows the category tables, latency, tokens, and methodology behind those numbers.

LoCoMo · 93.3%

Full Tex system vs published baselines. EverMemOS was the prior headline number at 92.3%.

LongMemEval_S · 92.2%

Active retrieval track vs other retrieval-first systems. Emergence AI posted 86.0% on comparable reporting.
We generated answers with gpt-4o-mini and graded them with gpt-4o in an LLM-as-judge setup. Each evaluation ran on a single machine. The exact setup is in Methodology below.

LoCoMo

LoCoMo evaluates long-conversation memory across 10 multi-session conversations with 1,984 questions spanning 5 categories: single-hop retrieval, multi-hop reasoning, temporal reasoning, open-domain inference, and adversarial abstention.

Per-category results

CategoryTex FullTex ActiveMemMachine v0.2Hindsight (Gemini-3)
Single-Hop96.08%88.23%94.41%86.17%
Multi-Hop92.14%78.21%89.72%70.83%
Temporal91.90%87.23%89.10%83.80%
Open-Domain94.79%67.71%75.00%95.12%
Adversarial99.33%97.31%
Overall93.3%87.7%91.69%89.6%

Competitive landscape

On adversarial items Tex is 99.33%. It declines to answer when the transcript does not support an answer. If a public benchmark skips that bucket, compare headline numbers carefully.

LongMemEval_S

LongMemEval_S evaluates memory over 500 questions across ~48 sessions each (~115K tokens), testing information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention.

Per-ability results

AbilityTex ActiveMastra OM (gpt-4o)Emergence AISupermemoryOracle GPT-4oZep
Abstention100%
Knowledge Updates95.8%85.9%83.3%88.5%85.9%83.3%
Info Extraction93.3%82.1%100%97.1%100%92.9%
Temporal91.3%85.7%85.7%76.7%76.7%62.4%
Multi-Session87.6%79.7%81.2%71.4%80.5%57.9%
Overall92.2%84.2%86.0%81.6%82.4%71.2%

Competitive landscape

92.2% is close to Oracle GPT-o3 (92.0%) and well above Oracle GPT-4o (82.4%). That means retrieval is finding evidence close to what you would hand-pick for each question.

Latency

Configurationp50p90End-to-end (including the model that reads context)
Tex (Active)~120ms~200ms~0.6s
Tex (Full System)~350ms~500ms~1.0s

Token efficiency

Per-question breakdown (Tex)

ComponentTokens
System prompt~138
User template~28
Retrieved context (avg 24 fragments × 44 tokens)~1,047
Question~20
Total input per question~1,233
Output (answer)~30

LLM tokens consumed on LoCoMo (1,984 questions)

SystemIngestion LLMReaderTotalAccuracy
Tex02.4M2.4M93.3%
MemMachine Memorynot disclosed4.2M4.2M+87.5%
MemMachine Agentnot disclosed8.6M8.6M+91.7%
Mem0not disclosed19.2M19.2M+~66%
Ingest here costs zero LLM tokens. This run used offline embeddings only. If a system runs an LLM per message during ingest, that cost should show up in token usage.

Headline efficiency claims

Compared with published Mem0 numbers, Tex is about 27% higher on accuracy, uses roughly 87% fewer tokens, and runs at about 95% lower latency. Against MemMachine’s memory-only configuration, Tex is about 5.8% more accurate on 43% fewer tokens. On LoCoMo, Tex also has the lowest tokens-per-correct-answer in this comparison: about 1,296.

Ablation: what each part of the pipeline adds

On LongMemEval_S, removing pieces of the retrieval pipeline changes accuracy like this:
CapabilityWhat it doesΔ accuracy
Adaptive retrievalAdjusts retrieval depth + strategy per query type+3.6%
Query understandingLLM-free semantic expansion+1.9%
Multi-stage relevance scoringProgressive filter + rerank+2.4%
Structured reasoningEnumerate / count across fragments+3.1%
Temporal awarenessResolve relative dates, build chronology+3.0%
Combined pipelineEverything working together+1.6%
Base → Full System76.6% → 92.2%

Methodology

Answers came from gpt-4o-mini. We graded them with an LLM-as-judge setup built on gpt-4o, using category-specific prompts, binary pass/fail per item, and a straight average inside each category. LoCoMo used all ten provided conversations (1,984 question-answer pairs) against the full Tex system (not retrieval-only). LongMemEval_S used 500 questions with about 48 sessions each (~115K tokens per trace) against Tex Active retrieval only. The full pipeline ran on a single machine. There was no multi-node orchestration.

On our roadmap for evals

We plan to add MemoryAgentBench (ICLR 2026), stronger multi-step reasoning for questions that need counting or arithmetic over evidence, and LongMemEval_M for questions that span hundreds of sessions.

References

  1. Maharana et al. LoCoMo: Long-Context Conversations for Evaluating Conversational Memory. EMNLP 2024.
  2. Wu et al. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. ICLR 2025.
  3. Hu et al. Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions. ICLR 2026.
  4. MemMachine. MemMachine v0.2 Delivers Top Scores on LoCoMo. Dec 2025.
  5. Emergence AI. SOTA on LongMemEval with RAG. 2025.
  6. Supermemory. State-of-the-Art Agent Memory Research. 2026.
  7. Mastra. Observational Memory: 95% on LongMemEval. 2026.

Run the quickstart

Install the SDK, store one turn, and print a recall result.