AI agents are starting to move beyond small coding tasks into real engineering workflows. However, delegating production work requires more than trusting generated code. In this interview, Aleksandr Nikolenko, a Search & API Engineer at Perplexity, explains how a Python-to-Rust search migration was made safer through a clear goal, real-traffic verification, and a shadow workspace where failures could be caught before users saw them.
Many companies are testing AI agents in software development, but production migrations remain high risk. What made this Python-to-Rust search migration suitable for agent-assisted work?
What made it suitable was that the answer already existed. Most teams point agents at open-ended work, draft a design, and suggest an implementation, where the model has to invent the solution and the success criteria at the same time. Here, the existing Python search path was a working specification of correct behavior.
That changed the shape of the task. The new Rust path could run against the same inputs as the Python one, so progress was measurable from the start rather than asserted. The agent was not chasing a prompt; it was closing a gap against a baseline that production had already validated.
Closed did not mean trivial. Search has noisy ranking and snippet decisions where some differences are fine, and others are genuine regressions, and separating the two still requires human judgment. But a task with an inherited definition of done is far safer to delegate than one where the agent defines done for itself.
You describe the framework as “goal, verifier, workspace.” Why are these three elements essential when delegating engineering work to AI agents?
Before asking whether an agent can write the code, I ask whether the work can be framed as something it can attempt, check, and retry on its own. Goal, verifier, and workspace are the minimum structures needed to make that possible.
The goal states the outcome that matters rather than the file to edit; in this case, preserving the product’s observable behavior while the implementation moved to Rust. The verifier turns that goal into a signal through comparisons and tests, so the agent is steered toward correct behavior instead of merely convincing code. The workspace is where it can investigate and run changes without every attempt touching production.
The three only work together. A goal with no verifier is an opinion, a verifier with no workspace cannot drive iteration, and a workspace with no goal just produces unfocused edits. The engineer designs that structure; the agent operates inside it.
The migration involved removing roughly 50,000 lines of Python and moving search traffic to a new Rust path. What were the main risks, and how did the agent-assisted process reduce them?
The real danger was not a crash but a silent drift in quality. Models are good enough now that the code usually looks right; the open question is whether the product still behaves right. The failure mode that worried me was subtler: a common query losing its expected top result or ranking and snippets shifting in ways users would feel.
Obvious breaks are easy to catch. A changed schema or a clearly wrong top result shows up immediately in a comparison. The hard cases are quality cases where a local change looks perfectly reasonable in isolation, yet degrades overall results.
We narrowed this by giving the agent comparison and evaluation signals alongside the code, so it could ask whether quality had moved, in what direction, and whether the change was isolated or broad. It resolved the clear-cut gaps itself and escalated the ambiguous ones with supporting data, leaving the quality and rollout calls to people. The investigation got faster and far better grounded.
Why is output verification more important than simply reviewing the generated code? What did real-traffic testing reveal that normal code review might miss?
Code review answers a narrow question: whether a change looks reasonable on the page. That is necessary but insufficient here, because the product contract lives in how the system behaves, not in how the diff reads. Several Rust changes looked clean and passed standard checks, yet still diverged from the Python path on real queries.
One case was a parsing and serialization mismatch that dropped an expected field for a class of inputs and broke downstream validation. Nothing in the code or the unit tests flagged it; only comparing live behavior against the baseline exposed it. Standard tests ask whether a function returns the expected value, not whether the whole system still behaves correctly in real-world use.
This matters more as models improve and out-produce any reviewer reading line by line. The review question shifts from how the code is written to what is tested and what evidence shows the behavior is still right. With a strong verification loop, that evidence tells you more than the shape of the implementation does.
What role did the shadow workspace play in making the migration safer before real users were affected?
The shadow workspace let the Rust path rehearse on production-like inputs while the Python path kept serving every user. A comparison layer captured where the two diverged and held the new output back, so nothing reached anyone until the difference was understood.
Inside that lane, failure was cheap. A disagreement between the paths became something to investigate rather than a regression users had already hit, and the agent could trace it and propose a fix without causing production consequences. Its job there was to separate noise from real regressions and return a fix with evidence, not to make any final decisions.
It also reframed the rollout decision. Instead of debating whether the code felt good, we could ask whether we had seen enough real behavior to move more traffic. Once the comparison went quiet and the remaining differences were explained, traffic shifted gradually and with far more confidence.
What does this case tell us about how software engineering may change as AI agents become more common?
The center of the job moves from writing each change to designing the system that lets work be delegated safely. The agent was useful here because it was not handed a ticket in isolation; it was placed inside a structure with a clear target, a way to check itself, and a safe place to act.
I expect that structure to become a first-class part of engineering. As agents grow more capable, the verification infrastructure around them, guardrails, the safe environments, the gradual rollout, and the acceptance criteria, become the thing that decides how much you can hand off without turning production into an experiment.
That makes strong engineers more valuable, not less. When code itself is cheap to produce, the scarce skill is defining what correctness means, where an agent is allowed to act, how its work is checked, and when a change is safe to ship. The differentiator is no longer writing every hard change by hand; it is building the architecture in which agents can be useful without being blindly trusted.





