AI support agent · evals · Live demo

Fieldstone.

A support agent that knows when not to answer. A demonstration build — with the eval harness to prove it.

Try the live demo ↗
One agent turn — route, ground, or escalate
routing
cust
Where is my order #4021?
Typed tool selected
lookup_order check_return_eligibility create_return search_help_center escalate_to_human
model self-rating95%
retrieval grounding92%
answered · grounded grounded in the store’s order data
Every action runs through a typed tool, not an improvised prompt. When the two confidence signals disagree — or a request is out of scope — escalation is the answer, not a guess.

Every action runs through a typed tool, not a prompt the model improvises around. Retrieval is grounded in the store's own data. Confidence comes from two independent signals, so you can see when the model and the retrieval disagree. Escalation is a first-class tool, not a fallback. Fieldstone is a demonstration build of that pattern for a representative store — and an 18-case golden set judged by Opus puts it at a 98.1% pass rate.

98.1% Opus-judged pass rate (18 cases, 54 runs)
5 typed tools — including escalation
11 typed escalation reason codes
0 agent frameworks — SDK direct
01The design

Answer what it's sure of. Escalate the rest.

Every action goes through a typed tool, not a prompt the model can improvise around:

lookup_order · check_return_eligibility · create_return
order actions as typed tools, so answers come from the store’s data, never inference.
search_help_center
retrieval from a 50-article help center built from the store’s own data, for anything the order tools can’t answer.
escalate_to_human
a first-class tool with an eleven-value typed reason_code enum (payment_dispute, policy_exception, account_access, out_of_scope…) a contact-center platform would actually route on.

An autonomous support agent's most valuable move is knowing the edge of its own knowledge. That's why escalation is a tool, not a footnote.

02How I know it works

An Opus-judged golden set.

18-case RAG golden set, three runs per case, judged by Opus on four rubric dimensions — tool routing, grounding, escalation, response quality. Stability is verdict agreement across runs.

Pass
cases · runs
pass rate
stable
Before three targeted fixes
18 · 36
78%
16/18
After fixes
18 · 54
98.1%
17/18

One residual miss is documented in the postmortem and deliberately not patched. Tuning the agent's language to clear a judge rubric is the start of Goodhart drift. Saying plainly what a metric does and doesn't prove is the whole discipline.

03Dual-signal confidence

The disagreement is the signal.

Every answer gets two independent reads: the model's self-rating of its own reply, and a retrieval-derived score from the cosine similarity of the chunks it used. When they disagree — a confident retrieval under a model that rates its own reply thin — that gap tells you more than either number alone. It's instrumented and visible in the /debug view. It doesn't gate replies yet: measure calibration before you let a number block one.

04Architecture

SDK direct, no framework.

Runtime
Next.js App Router, Anthropic SDK direct — full visibility into every step.
Retrieval
50-article corpus generated offline against store.json, embedded once on voyage-4-lite. Runtime cosine ~2 ms.
State
In-memory everything — session state and corpus from JSON. No database, no vector store. On Vercel.
Model tiers
Sonnet for the agent turn, Haiku for the confidence side-call, Opus for the eval judge.

No LangChain, no LlamaIndex, no agent framework. Full visibility into what the model sees and does at each step — the point of the build.

← All work