LoCoMo and LongMemEval_S scores, with category tables, latency, tokens, and methodology.
Tex scores 93.3% overall on LoCoMo with the full system. Tex scores 92.2% on LongMemEval_S with active retrieval only.This page shows the category tables, latency, tokens, and methodology behind those numbers.
LoCoMo · 93.3%
Full Tex system vs published baselines. EverMemOS was the prior headline number at 92.3%.
LongMemEval_S · 92.2%
Active retrieval track vs other retrieval-first systems. Emergence AI posted 86.0% on comparable reporting.
We generated answers with gpt-4o-mini and graded them with gpt-4o in an LLM-as-judge setup. Each evaluation ran on a single machine. The exact setup is in Methodology below.
On adversarial items Tex is 99.33%. It declines to answer when the transcript does not support an answer. If a public benchmark skips that bucket, compare headline numbers carefully.
LongMemEval_S evaluates memory over 500 questions across ~48 sessions each (~115K tokens), testing information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention.
92.2% is close to Oracle GPT-o3 (92.0%) and well above Oracle GPT-4o (82.4%). That means retrieval is finding evidence close to what you would hand-pick for each question.
Ingest here costs zero LLM tokens. This run used offline embeddings only. If a system runs an LLM per message during ingest, that cost should show up in token usage.
Compared with published Mem0 numbers, Tex is about 27% higher on accuracy, uses roughly 87% fewer tokens, and runs at about 95% lower latency. Against MemMachine’s memory-only configuration, Tex is about 5.8% more accurate on 43% fewer tokens. On LoCoMo, Tex also has the lowest tokens-per-correct-answer in this comparison: about 1,296.
Answers came from gpt-4o-mini. We graded them with an LLM-as-judge setup built on gpt-4o, using category-specific prompts, binary pass/fail per item, and a straight average inside each category.LoCoMo used all ten provided conversations (1,984 question-answer pairs) against the full Tex system (not retrieval-only). LongMemEval_S used 500 questions with about 48 sessions each (~115K tokens per trace) against Tex Active retrieval only.The full pipeline ran on a single machine. There was no multi-node orchestration.
We plan to add MemoryAgentBench (ICLR 2026), stronger multi-step reasoning for questions that need counting or arithmetic over evidence, and LongMemEval_M for questions that span hundreds of sessions.