- Recall memory for the incoming message.
- Call your model.
- Store the user and assistant turns.
Tex client per process and reuse it across requests.
Layout
| File | Job |
|---|---|
deps.py | Cached Tex |
chat.py | /chat |
main.py | App entry |
Lay out the package
Create this structure:Rename
app/ if you want. Keep the import paths consistent in uvicorn.Cache your Tex client
Read secrets from the environment and construct Tex once:Use
deps.py
tex_client() inside FastAPI Depends(...) so every route shares the same connection pool.Build the chat route
Derive Replace
session_id from the user and the chat. Recall with a small top_k. If Tex times out or quota is exhausted, answer without memory. Then store both sides of the turn:chat.py
your_llm.complete(...) with your model call.Full files
If you prefer one copy block, paste these files:Production tweaks
Run remember in the background
Do not make the user wait for remember. Enqueue it in the background:
Bound recall latency
SetTex(timeout=2.0) and catch APITimeoutError. If recall is slow, answer without memory instead of blocking the whole chat request.

