RAG on Azure OpenAI - Tex | Memory API for agents

This recipe uses the same flow as the FastAPI recipe. Recall memory, answer with Azure OpenAI, then store the new turn.

Install

pip install tex-sdk openai

Set environment variables

TEX_API_KEY=tex_live_...
TEX_BASE_URL=https://api.getmetacognition.com

AZURE_OPENAI_ENDPOINT=https://<your-resource>.openai.azure.com
AZURE_OPENAI_API_KEY=...
AZURE_OPENAI_DEPLOYMENT=gpt-4o
AZURE_OPENAI_API_VERSION=2025-04-01-preview

Construct clients

import os
from openai import AzureOpenAI
from tex import Tex

tex = Tex(
    api_key=os.environ["TEX_API_KEY"],
    base_url=os.environ["TEX_BASE_URL"],
    timeout=10,
)
gpt = AzureOpenAI(
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version=os.environ["AZURE_OPENAI_API_VERSION"],
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
)

Implement answer(query, session_id)

from datetime import datetime, timezone
from tex import RateLimitError, APITimeoutError

def now_iso() -> str:
    return datetime.now(timezone.utc).isoformat().replace("+00:00", "Z")

def answer(query: str, sid: str) -> dict:
    memory: list[str] = []
    confidence = 0.0
    try:
        hits = tex.recall(q=query, session_id=sid, top_k=5)
        memory = [h.text for h in hits.hits.turns]
        confidence = hits.confidence
    except (RateLimitError, APITimeoutError):
        pass

    sys_msg = (
        "You are a helpful assistant. "
        + (f"Relevant memory:\n{chr(10).join('- ' + m for m in memory)}" if memory else "")
    )
    chat = gpt.chat.completions.create(
        model=os.environ["AZURE_OPENAI_DEPLOYMENT"],
        messages=[
            {"role": "system", "content": sys_msg},
            {"role": "user",   "content": query},
        ],
        temperature=0.4,
    )
    reply = chat.choices[0].message.content

    tex.conversations.remember(
        session_id=sid,
        turns=[
            {"role": "user", "text": query, "timestamp": now_iso()},
            {"role": "assistant", "text": reply, "timestamp": now_iso()},
        ],
    )
    return {"answer": reply, "confidence": confidence, "memory_used": len(memory)}

Smoke-test from `__main__`

if __name__ == "__main__":
    import json
    print(json.dumps(answer("any food restrictions?", "demo-session"), indent=2))

Production notes

Citations: pass hit ids into the prompt and ask the model to quote [mem:<id>]. Use that id to link answers back to memory.
Weak recall: if confidence < 0.4, call tex.recall(..., mode="deep") once before answering.
Streaming: set stream=True, then enqueue remember in a background worker so the stream can start quickly.

Slack bot with channel memory Multi-tenant SaaS pattern

​Production notes

Production notes