A controller’s morning
It is 06:47. The night shift is nearly over. A controller sits in front of two monitors with half a dozen tabs open: scheduling, attendance, the check-call log, incident reports, customer SLA notes, and a standby-staff channel that, for historical reasons nobody can quite explain, still lives partly outside the main system.
She has a few minutes before the day-shift handover. Three things have happened in the last hour. An officer at a regulated site missed the 06:30 check-call. A backup officer at another site booked off forty minutes early with no reason logged. A customer complaint came in at 03:14 and nobody has acknowledged it. Every one of these is visible somewhere on her screens. None of them is flagged as more urgent than the others. By the time she has worked out which one actually matters, the day shift is already on the road.
That gap is what I set out to close. Notice what was not the problem. The information was there. The dashboards were there. What was missing was the layer in between, the thing that helps a tired human answer one question quickly: which of these needs me first? A lot of business intelligence work stops just short of that question. It shows you everything and helps you decide nothing.
This is an account of how my team and I built that decision layer for a multi-site operations platform, and the main thing we got wrong before we got it right. The short version: for Agentic BI, the language model should not be the brain. The rules engine should be.
By Agentic BI I mean something unglamorous: a layer that watches operational data, works out which changes matter, and hands a person a recommendation with the evidence attached, then records what they did and learns from it. Not a system that runs the business on its own, and not a swarm of agents acting with nobody watching. Traditional BI tells you what happened. This is meant to help with a harder question: what matters now, and what should I do about it?
The first version, which failed
Our first attempt did what nearly everyone was doing at the time. We put the language model in the middle. It would read recent events, pull context from the database, reason about the situation, and write a recommendation. The rules layer would be thin. The model would be clever enough to carry it.
The demo was excellent. Real use was not.
The model would produce recommendations that read beautifully and were operationally wrong. Once it cited an officer who was not even on the rota. Another time, two refreshes of the same situation gave two different answers. And the deeper problem was not the wrongness itself. It was that I could not reproduce the wrongness. If a bad recommendation cannot be reproduced, it cannot be tested. If it cannot be tested, I cannot put it anywhere near a live operational decision and still sleep at night. So we tore it down and rebuilt. The model stayed in the system, but it lost the keys to the decision.
The 60/40 split
What we settled on is roughly sixty per cent deterministic rules engine and forty per cent language model. The rules engine decides. The model explains. It is a less thrilling architecture than “the AI figures it out,” and it is far safer and far easier to test. Think of the rules engine as the load-bearing wall and the model as the plaster.
Three reasons drove the choice, and they stack on top of each other.
Rules can be tested. Each one has a name, a fixture, an expected result, and a regression test. When operations decide a threshold should move from twelve minutes to ten, we change it, review it, test it, and ship it, and we know exactly what we changed. Rules can also be explained. When a controller asks why the system told her to do something, she gets the actual trace: this rule fired, these records were checked, this threshold was crossed. A black box can be right and still be useless, because under pressure people do not act on advice they cannot interrogate.
The third reason is about fit. Language models are good at summarising and rephrasing, and weak as the final authority on an operational judgement. A clumsy sentence from the model costs nothing. A wrong escalation decision can cost a contract, or worse. So the principle I work to now is simple. If you can write it as a rule, write it as a rule, and let the model do the talking.
One related choice tends to surprise people. The system runs against the live operational database, not a separate analytics warehouse. Warehouses are fine for end-of-month reporting. For real-time decision support they fall over the moment the data sync is late or nobody trusts it.
What the system does with one missed check-call
Go back to that missed check-call. On its own it is a fact, not a decision. Is it a minor delay, a welfare concern, a compliance breach, or an SLA problem? You cannot tell yet.
The rules engine walks through it. Which site, and is it regulated? What SLA applies? How overdue is the call? Has anyone attempted a welfare check? Is standby cover available? Has this officer been late before? On this morning the answer resolves cleanly: regulated site, missed call, no welfare attempt yet, past the threshold, standby available. That maps to escalation tier two.
Only then does the model get involved, turning that structured decision into something a controller can read in five seconds:
The officer at Site 14 missed the 06:30 check-call and is now 14 minutes overdue. This is a regulated site with a 15-minute response window. No welfare call has been logged, and standby cover is available. Suggested action: log a welfare call now, dispatch standby, and notify the duty manager. If there is no contact within five minutes, escalate to the client liaison.
Clear, short, and pointed at an action. But the part that matters is the part you cannot see in the wording. The model did not choose to escalate. The rules engine did. The model just said it well.
The trust mistake I made
Early on I tried to hide the system’s mistakes. If a recommendation was obviously wrong, we quietly suppressed it. If the wording was slightly off, we corrected it before anyone saw. I told myself I was protecting people’s trust in the system. I was destroying it. Controllers started ignoring the recommendations, and when I finally asked one of the senior ones why, she put it better than I could have: how am I supposed to know when it is wrong if I never see it being wrong?
So we did the opposite. Every recommendation now shows a confidence score, and low-confidence ones are marked as such. Every recommendation has a “why this was suggested” view with the rules and records behind it. When someone rejects a recommendation, the reason gets logged in the open, treated as useful information rather than a black mark. And once a week we sit down and go through the worst calls the system made. That last habit changed the culture more than any feature. The system stopped being an outsider to be distrusted and became something the team felt responsible for improving.
The lesson has stuck with me. Trust does not come from a system that is always right. It comes from a system you can see being wrong. Underpinning all of this is a feedback log that records every recommendation and what happened to it: accepted, rejected, edited, or later proved wrong. I was stubborn about keeping it, because without it there is no way for the system to improve and no honest record of how often it failed.
What the pilot actually showed
These numbers come from one pilot, so please do not read them as benchmarks. Operations are messy and results depend on data quality, process maturity, and the kind of decisions involved. With that said, the direction was clear.
Weekly rota preparation for one mid-sized client used to eat most of a working day. With conflict detection and coverage logic encoded as rules, it came down to about ninety minutes. That saving had nothing to do with the model writing nice sentences and everything to do with logic we could test and trust. On catching actionable events such as missed calls and welfare risks, controllers had been acting on roughly four in ten inside the response window; with ranked priorities in front of them, that moved to about six in ten. And recommendation acceptance, the figure I watch most, went from around 41 per cent early on to roughly 78 per cent once we added confidence scores, visible reasoning, and those weekly reviews. The system did not get dramatically more accurate in that time. It got more legible, and people trusted it because they could see how it worked.
The quieter win was harder to put a number on. Decisions that used to happen informally, in someone’s head, became traceable, which made the whole operation easier to govern.
If I were starting again
I would not kick off a big Agentic BI programme. I would run a six-week pilot on a single decision. Week one, pick that decision and write it in one sentence: “Should we dispatch standby when a check-call is missed?” If it does not fit on one line, it is too big. Week two, write the rulebook with the operations people, not the engineers: the trigger, the evidence needed, the action at each tier, the owner, and the things the system must never do. Fifteen to twenty-five rules is plenty. Week three, build the rules engine and the feedback log, and leave the model out entirely. Week four, add the model, but only to explain what the rules decided. Week five, run it in shadow mode against what people actually do. Week six, go live with one friendly team and measure acceptance, time to action, false alarms, and missed risks. If that one decision does not improve, do not scale. Fix the decision, the rules, or the workflow first.
This is not abstract caution. Gartner has warned that a large share of agentic AI projects will be scrapped by 2027, often because the business value was never clear and the controls were never there. Proving one decision before you scale is how you stay on the right side of that statistic.
What it comes down to
Most Agentic BI that actually works in production is less exciting than the demos promise. No magic dashboard, no autonomous manager. Just clean operational data, well-written rules, reasoning you can see, a human signing off, and a language model used carefully for the part it is genuinely good at. It sounds boring. Boring tends to be what survives contact with real operations.
If the goal is helping a controller at 06:47 work out which of three problems to deal with before the handover, a bigger model is almost never the answer. A clearer rulebook, better context, visible reasoning, and a feedback loop nobody hides from will get you further. That is the actual work.





