Fieldstone.
A support agent that knows when not to answer. A demonstration build — with the eval harness to prove it.
Try the live demo ↗Every action runs through a typed tool, not a prompt the model improvises around. Retrieval is grounded in the store's own data. Confidence comes from two independent signals, so you can see when the model and the retrieval disagree. Escalation is a first-class tool, not a fallback. Fieldstone is a demonstration build of that pattern for a representative store — and an 18-case golden set judged by Opus puts it at a 98.1% pass rate.
Answer what it's sure of. Escalate the rest.
Every action goes through a typed tool, not a prompt the model can improvise around:
An autonomous support agent's most valuable move is knowing the edge of its own knowledge. That's why escalation is a tool, not a footnote.
An Opus-judged golden set.
18-case RAG golden set, three runs per case, judged by Opus on four rubric dimensions — tool routing, grounding, escalation, response quality. Stability is verdict agreement across runs.
One residual miss is documented in the postmortem and deliberately not patched. Tuning the agent's language to clear a judge rubric is the start of Goodhart drift. Saying plainly what a metric does and doesn't prove is the whole discipline.
The disagreement is the signal.
Every answer gets two independent reads: the model's self-rating of its own reply, and a retrieval-derived score from the cosine similarity of the chunks it used. When they disagree — a confident retrieval under a model that rates its own reply thin — that gap tells you more than either number alone. It's instrumented and visible in the /debug view. It doesn't gate replies yet: measure calibration before you let a number block one.
SDK direct, no framework.
No LangChain, no LlamaIndex, no agent framework. Full visibility into what the model sees and does at each step — the point of the build.