| Workhorse

When the Answer Isn't the Whole Job

‍

Our support team spends its day answering customer questions, and for a while now they've had help. We run a RAG pipeline over our internal documentation — the configuration notes, the support history, the accumulated knowledge an analyst would otherwise have to go digging for — and it surfaces the answer, or something close to it, in seconds rather than minutes. The analyst reads what it found and sends the response. It saves realtime, and we trust it enough to lean on it when we're triaging.

We've made it quite a bit better since the spring. Two things mattered. The first is that it can see more than it used to: alongside the documentation, it now pulls in a client's own configuration, so the answer is about how their setup actually behaves rather than how the product behaves in the abstract. The second is less visible from the outside — we've spent a lot of time on the pipeline itself, on how the question gets put to the model, how the calls are structured, how the retrieved material is handed over. Between them the answers got sharper and more specific, the time saved went past sixty percent, and escalations to the tech team fell away.

But the change that did the most for us isn't on that list, and it went the other way.

We taught it to stop answering.

That sounds like a step backwards, and it took us a while to see it as anything else. Nearly everything written about getting these systems to work is about getting them to answer — better answers, more of them, fewer gaps, fewer "I don't knows". The received wisdom is that a model saying "I'm not sure" has failed, something to be trained out until it disappears. We were after the reverse: a system that says "I don't know," or "I'm not sure," when that's the truth.

The reason was practical, not principled. The answers that cost us were never the ones where the system threw up its hands. They were the ones where the documentation was thin on something and, rather than say so, the system put together a confident, reasonable answer out of whatever was nearest. Those answers didn't look wrong. They looked exactly as solid as the good ones — same tone, same fluency — and an analyst leaning on the system at speed had no way to tell them apart. A confident wrong answer in support doesn't just sit there being wrong, either. It goes out under our name, and someone acts on it.

So we built a check. Before the system answers, it looks at what it actually retrieved — the docs, the client's config — and asks whether the answer is genuinely supported by it. If it is, it answers, and answers fully. If it isn't, it says as much, and the question goes to a person rather than coming back dressed up as something certain.

The effect was bigger than the feature looks. The point of it isn't the questions it declines — it's what declining does to the ones it doesn't. Once it's willing to stay quiet when it's out of road, what it does say is standing on something rather than spun to fill a gap. The value turned out not to be in the answers it gives — it's in the ones it's prepared to withhold. A system that always has something to say forces you to check all of it; one that knows the edge of what it knows lets you trust what it does give. It's not flawless — now and then it goes quiet on something it could have handled — but that's the side to err on. It's a long way from where we started, and not where most of the attention goes.

And having got it working, we more or less convinced ourselves the harder problem was nearly behind us. Get the system honest about what it knows, and the rest follows.

Then a case went through that the check was perfectly happy with, and the rest hadn't followed at all.

A client asked one of our analysts whether they could override a customer's credit limit on a single order — let one larger order through without permanently raising the customer's limit. With their configuration in front of it, the RAG answered precisely and correctly: no, there's no per-order override; the credit limit lives on the customer account, so you raise it there before the order can go through. The check had nothing to flag, because there was nothing to flag. The evidence was right there.

The analyst didn't send it.

Not because it was wrong. Because "no, you can't — go and change the limit on the account" was, for this client, going to landbadly: they'd asked for exactly the thing the answer says doesn't exist, and the bare version reads like a shrug. What went back instead was the same fact with some room around it: there's no per-order override, but you can raise the account limit for this one and set it back afterwards — and a proper per-order option is a fair ask, which we'll look at. The honest answer, with a bit of give.

Some of that the system could have managed itself. The raise-it-and-set-it-back workaround is sitting in the docs and the config; a better-tuned pipeline could have surfaced it and answered more fully on its own. That part is just completeness. The rest isn't, and no amount of retrieval gets you there. Whether a flat no needed softening for this client, and whether to float a per-order option as something we'd build — those don't turn on a missing document. They turn on which client this is and what we'll stand behind for them: "not yet, but it's coming" for one; just the workaround, no promises, for another; a specific commitment for a third, because of where things stand. We'd fed the system more context than ever and it knew the setupcold, but what no extra source was going to give it was any read on what the answer would cost the relationship, or what we'd promise to keep it. That was never in a document, and it was never in the config.

The check is good at one job — spotting when the system doesn't actually know — and that's most of what we've gained. But this wasn't that. The system knew perfectly well; the answer was correct, supported, complete, and still not ours to send. "Is this backed by what I retrieved" and "is this the right thing to put in front of this client" are different questions, and only the first one has anything to look at. The answer and its evidence sit side by side, there to be compared — which is the whole reason the check works. What's safe to send has nothing of the sort to point at.

When the Answer Isn't the Whole Job

Ready to TransformYour Customer Management?

Ready to Transform
Your Customer Management?