How To Apply Lean & Kanban To AI Agents: An Interview With Denis Ermakov

As AI agents become part of development cycles, many companies are facing unprecedented challenges. The agents waste thousands of dollars from budgets on endless iteration cycles, make uncontrollable pull requests, and, overall, act without synchronization. Denis Ermakov, a team lead and a software engineer with over 15 years in the tech industry, shares his insights on how AI agents are reshaping the development world and what we should brace for.

How AI agents reshape software development: what problems do they solve, and what, on the contrary, do they create?

The change is real, and it is not a marketing curve. For example, Cognition revealed in December 2025 that its AI Software Engineer Devin was responsible for 25% of its own internal pull requests. On Scale AI’s SWE-Bench Pro benchmark, frontier models now resolve real GitHub issues end-to-end at rates that were below 2% only three years ago. And Stack Overflow’s 2025 Developer Survey shows that over 80% of developers now use AI tools daily or weekly. It cannot be called “industry experiments” anymore, because the agents are routine production tools.

AI agents eat the tax work that used to absorb senior-engineer time: scaffolding, fixtures, type plumbing, cross-file refactors, test generation, documentation. The original GitHub Copilot productivity study says that the agents reduce task completion time on isolated coding tasks by 55%. For most of what fills a sprint backlog, agents are genuinely faster than humans now!

At the same time, AI agents create new categories of problems.

One of them is cost drift. An agent looping on a flaky test can spend $50 before anyone notices – multiply that across a busy day and you have a four-figure surprise on the bill.

They can also create repository noise: unconstrained agents can easily open seven pull requests touching the same module. This swamps the reviewers: two of the seven pull requests may be good, three could be duplicates, and the rest two – just nonsense.

At last, the agents can be plainly wrong, and a green CI run doesn’t necessarily mean that everything’s working correctly. METR’s randomised study of experienced open-source developers in 2025 found AI tools made them 19% slower on real codebases, even though they believed they were 20% faster. That gap is exactly the cost of unverified output landing in mature code. The same dynamic produced the Replit incident in July 2025, when an agent during a code freeze deleted a production database, then fabricated test results and falsely claimed the rollback was impossible!

Still, this doesn’t mean we need to stop using AI agents! What we need is to put the same kind of process scaffolding around agents that we once built around humans: CI, code review, on-call rotations, budgets. I argued the underlying point about measurement in my article Productivity Myths in Software Engineering (TechGrid Media, Jan 2025) [link]: count outcomes, incidents and maintainability, not commits or activity. That principle becomes mandatory the moment your “engineers” are stateless processes that bill by the token.

You have 15 years of experience implementing Agile principles in human teams at a large scale. How does this experience help you manage ‘social’ interaction between agents working on the same team?

An agent team is like a junior team, where each member is brilliant in their narrow domain, but they have poor coordination with one another. So I do for them what a good manager does for juniors:

I write job descriptions: each agent role has a two-page system prompt. The SA/BA prompt ends with a fixed output template — Affected Files, Approach, Acceptance Criteria, Edge Cases, Dependencies, Notes — and the Dev agent refuses to start when those fields are missing. This is the same point I made at FrontHub 2019 in my keynote Team Malfunctions: most “team malfunctions” cure themselves once the interfaces between roles are unambiguous.

I make the agents create handoff artifacts: each agent posts its output as a GitHub issue comment. The next agent reads all prior comments as its input. That comment chain becomes the paper trail – a human walking in three weeks later can reconstruct who decided what, with which model, on which date.

I set the WIP limits: If Dev has three cards in flight, the dispatcher refuses to push a fourth. Pure Kanban, and it is the direct cure for the “swarm of duplicate PRs” failure mode.

Can AI agents argue? Yes, and it is what they are expected to do! The Test agent regularly rejects what Dev produced when acceptance criteria are not met, and the card bounces back with the failing-test transcript attached as context. The SA/BA agent flags issues that are too large for one PR and proposes a split. The human-review step rejects code that violates conventions in CLAUDE.md, the project’s house-rules document. The structure I borrowed straight from Lean is the andon cord: any agent can stop the line and escalate to a human via a Telegram ping rather than push junk forward. Disagreement is a feature!

What does not transfer from human management is motivation, politics, and career growth. You can delete those rows from the spreadsheet. What you gain is a team that will work at 03:00 without complaint – and one that will, with equal good cheer, set fire to your token budget if you forget the guardrails.

You’re suggesting an Agile + Kanban approach to building an AI agent pipeline. How does this pipeline actually work, and how do you adapt a system built for humans to fit AI?

Here is the literal shape of it. It is a GitHub Projects V2 board polled every five minutes by a cron-driven dispatcher in GitHub Actions:

Todo → Ready for Work → SA/BA → Dev → Test → Human Review → Ready to Deploy → Done

↑ human ↑ agent ↑ agent ↑ agent ↑ human ↑ agent

Column	Owner	Model (typical)	Deliverable
Ready for Work	Human	—	Groomed issue with description & intent
SA/BA	Agent	Sonnet 4.6	Analysis comment (affected files, criteria)
Dev	Agent	Opus 4.7	Feature branch + draft PR
Test	Agent	Sonnet 4.6	Unit + e2e tests, green CI
Human Review	Human	—	Approve or request-changes
Ready to Deploy	Agent	Shell scripts	Merge + smoke test

That is the human-Kanban skeleton, more or less unchanged.

The interesting question is the second half: how does a system designed for people need to be adapted for AI? Four things change.

Context becomes an explicit input rather than an implicit one. Humans absorb tacit knowledge from stand-ups, Slack and corridors. Agents absorb only what you hand them. So the project conventions document (we call it CLAUDE.md) becomes load-bearing – it is read by every agent before every task, the way a new hire reads an onboarding handbook on day one, but every day. Anything that should be obvious to a teammate has to be written down.
Retry budgets become contractual. A junior dev who fails a test three times will go ask a colleague. An agent will loop forever. Every workflow has a hard retry cap (we use three), after which the card halts and a human is paged. No exceptions, no “just one more attempt”.
Branch isolation is mandatory. Two humans coordinate over Slack to avoid stepping on each other. Two agents cannot. Every issue gets its own agent/issue-N branch, every PR is independent, and merge conflicts are resolved by a shell script at the deploy step with a human fallback.
Issue comments are a wire protocol, not commentary. For humans, issue threads are conversation. For agents, they are structured input. Each comment has a fixed markdown schema; the next agent parses it. That is what makes the whole pipeline auditable later – every decision is preserved with the agent name, the model and the timestamp.

Let’s look at the economics. What is the real cost of a feature developed and tested by autonomous agents under a strict Kanban framework, versus just using raw LLMs?

I will give you both numbers from the pipeline I actually run, then explain why the comparison is more interesting than the absolute figures. All prices use Anthropic’s published rates: Opus 4.7 at $5/$25 per million input/output tokens, Sonnet 4.6 at $3/$15, Haiku 4.5 at $1/$5, with prompt caching giving up to a 90% discount on cached input.

Inside a strict Kanban pipeline, per typical issue:

Stage	Model	Cost
SA/BA analysis	Sonnet 4.6	$0.50 – $2
Implementation	Opus 4.7	$5 – $15
Test generation + green CI	Sonnet 4.6	$1 – $3
Total per issue	—	$6.50 – $20

For a team shipping twenty issues a month that is roughly $130–$400 in API spend. Set against a fully-loaded London senior-engineer day rate of £600–£900, the arithmetic looks attractive.

Using LLMs “as they are” — no WIP limits, no retry caps, no Definition of Done. The median number is much the same. The distribution is not. The cost becomes long-tailed in an ugly way: I have personally seen a single ungated agent burn $80 in an afternoon chasing a flaky integration test that no human asked it to chase. Multiply by a team and a careless month produces a four-figure surprise.

So the real saving is variance, not the median. The Kanban pipeline caps the p99 cost per issue at roughly 2× the typical cost, because retries are bounded and escalation kicks in early. Without it, p99 is easily 10× or 20× p50. That is the difference between a bill you can defend to your CFO and a Monday-morning panic.

There is also a hidden cost the invoice does not show: review fatigue. If your pipeline emits seven PRs a day and three are nonsense, your senior engineers spend their afternoon in low-grade slush-pile triage. METR’s 19% slowdown of experienced developers using AI tools almost certainly hides inside that activity. WIP limits and an explicit Definition of Done are the cheapest cure I know

What should companies with strict security perimeters do — for example, banks where compliance officers faint at the mere mention of the phrase “OpenAI API”? Is it actually feasible to deploy such a pipeline locally, fully offline, using something like Phi-4-mini, WebGPU, and IndexedDB directly in the browser?

Yes, and I want to answer both halves of that question seriously, because I have lived on both sides of it.

What banks should actually do? The regulatory baseline is well-documented: JPMorgan Chase firm-wide restricted ChatGPT in February 2023, citing third-party software compliance. Bank of America, Citi, Goldman Sachs, Deutsche Bank and Wells Fargo followed within months. “We promise not to train on your data” is not an audit answer. The realistic options for a bank are, in increasing order of compliance comfort:

An enterprise-tenant deployment of Anthropic or Azure OpenAI with a signed DPA and data-residency commitments — acceptable for some risk profiles, not for others.
A self-hosted open-weights model on internal GPU infrastructure, behind the firewall, with the codebase mounted read-only and writes confined to a sandbox branch.
A fully on-device deployment for engineer-facing tools, running in the browser via WebGPU with model weights cached in IndexedDB.

Is the third option realistic? Yes – much more so than even a year ago. WebGPU shipped by default across Chrome, Firefox, Edge and Safari in late 2025, with coverage now around 82% of browsers. Microsoft’s Phi-4 Mini (3.8B parameters, 128K context) is purpose-built for edge inference and RAG over long documents on constrained hardware. The WebLLM project achieves ~71 tokens/sec for Phi-3.5 Mini on an M3 MacBook inside a Chrome tab – 71-80% of native speed – with model weights cached in IndexedDB across sessions.

I have built the local-first half of that stack in production. At Tehpotok the application is architected around IndexedDB + Dexie.js with a service worker handling sync; UI state is local-first by default and the bundle is half the size of a Radix/Shadcn equivalent because the offline path is the primary path.

A realistic bank-grade agent pipeline looks like this:

SA/BA agent on a self-hosted small model (Phi-4 Mini, Qwen 2.5, fine-tuned Llama 3.x) — either on an internal GPU box or directly in the engineer’s browser via WebGPU.
Dev agent on a larger self-hosted model behind the firewall, codebase mounted read-only, writes confined to a sandbox branch.
Vector store and feature cache in IndexedDB on the engineer’s machine — the same local-first pattern that runs the Tehpotok UI today.
Audit trail as GitHub Enterprise issue comments, identical schema to the cloud version, but every comment is authored by a model that never touched the public internet.

You give up the absolute capability ceiling — an 8B model is not going to match Opus 4.7 on hard refactors. But for analysis, test generation, documentation and routine CRUD — which is ninety per cent of the backlog — a small model behind the firewall is already sufficient, and the compliance posture is one a regulator can sign off on.

How will the role of the IT manager change in this new reality? How will the profession itself transform in the future? What should engineers be learning today?

Three separate questions, three separate answers.

The IT manager’s role moves from managing throughput of people to designing throughput of a hybrid team — humans plus agents — where the boundary between the two shifts every quarter. The CIO analysis is fairly blunt: by 2026 engineers are expected to spend less time writing foundational code and more time orchestrating a portfolio of AI agents, components and services. The four manager skills that get more valuable, not less, are:

System and information architecture. If the agents do the typing, the only thing left for the manager is the shape of the system. Domain-driven design, contract-first APIs and clean ownership boundaries become daily craft, not an annual exercise.
Process design. Kanban, Lean, WIP limits, retro discipline. The same methodology I used to compress release cycles from 60 days to 7 at Otkritie is exactly the methodology that keeps an agent pipeline from imploding.
Observability and unit economics. Cost-per-feature, p99 latency, incident rate, review time. A manager who cannot read these numbers cannot run an agent team — the way one who could not read a burn-down chart could not run a Scrum team.
Architecture-level code review. The agents will produce code that compiles. They will not, reliably, produce code that fits the system. Guarding the architectural invariants becomes the manager’s single highest-leverage activity.

Regarding the evolution of the profession, it is worth citing the recent Stack Overflow trend. It gives a clear signal: 84% of developers using AI tools, but only 29% trusting the output – down from 40% the year before. The profession is splitting in two. One half writes prompts and accepts diffs. The other half decides which diffs to keep, which to throw away, and which architectural lines an agent must never cross. The second half is where compensation and authority will accumulate.

What engineers should learn today. Five things, in order of leverage:

Read and write system prompts the way you once learned to read and write SQL. The system prompt is the job description; treat it like production code.
Master one statically-typed language well enough to argue with an agent about its choices – and to refuse a bad refactor without a vague feeling.
Learn the local-first stack: WebGPU, IndexedDB, service workers, CRDTs. This is where regulated-industry work is migrating.
Lean on the boring methodology: Kanban, Lean, Theory of Constraints. They were invented for noisy production systems, which is exactly what an agent pipeline is.
Keep writing – prose, not just code. The engineers who can explain why a piece of code is correct, in plain English, will be disproportionately valuable in a world where any agent can produce code that compiles.

The profession is not disappearing. It is becoming what good engineering management was always trying to be – a discipline about flow, contracts and clarity, rather than line count. The teams that internalise that early will run circles around the teams still measuring productivity in commits.

Featured image credit

Tags: AI agents trends

How to apply lean & Kanban to AI agents: An interview with Denis Ermakov

Related Posts

Why Europe’s fragmented transport systems may give AI an edge

The hardest part of technology is people

Structuring AI agents for Perplexity’s Python-to-Rust migration

Content in the interface is a business metric: Julia Zakharova on how UX, visual design, and tone of voice drive user activation

Panathēnea’s builders are rethinking what a tech gathering can be

Zero trust in the age of AI: Why your data governance is now your security strategy

LATEST NEWS

Valve expands Steam gifting and wishlist options

Google introduces selfie video account verification

Kylian Mbappé named EA Sports FC 27 cover star

Anthropic adds screen-recorded teaching feature to Claude AI

Meta adds Xbox Game Pass starter edition to Horizon+ subscriptions

Threads launches new parental supervision tools for teen safety

BEST AI MODELS LEADERBOARD

LATEST TOOLS

Amanda AI

InterviewBot

VernAI

MyLoans

Essay Grader AI

Cover Letter AI

Animate Old Photos

Resume.io

MonAI

AIEngine Plugin

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.