Supply chain execution is being reshaped by tighter delivery expectations, sustainability constraints, and the shift from batch planning to real-time decision-making. For companies operating at scale, the difference between a resilient platform and a fragile one shows up quickly — in latency, cost, and how well teams can adapt when conditions change in the field. In this conversation, Sachidananda Singh, Senior Engineering Manager at Wayfair, explains how modern logistics systems are built and operated as real-time distributed control systems. Read on to explore what breaks at scale, why human-in-the-loop design still matters, and where applied AI can deliver measurable impact.
The WEF predicts a major transformation of urban delivery toward more sustainable and efficient last-mile logistics. In the context of these global trends, what changes do you already see in logistics, and how are they reshaping the way distributed systems need to be designed?
I’d start with this: the biggest change is that logistics have to be real-time and cannot be a batch process that runs overnight. Urban delivery has three constraints: cities are getting more congested, emissions regulations are tightening globally, and customers expect two-day delivery by default. The World Economic Forum has been tracking this, and the externalities stack up fast if the industry doesn’t adapt.
From a systems perspective, the delivery platform has to function as a real-time control system. You’re pulling live data from carriers, warehouses, and vehicles, then reacting to constraints like curb access, delivery windows, and route capacity. At Wayfair, my team moved in that direction by unifying routing and scheduling platforms across regions and replacing legacy tools with real-time systems that could adapt as constraints changed in the field. This gave us faster speeds, reduced platform fragmentation, and let us iterate on new requirements much faster.
You’ve built systems that serve hundreds of thousands of users and process tens of thousands of events per minute. Are there engineering constraints that only reveal themselves at this scale — things you wouldn’t encounter in smaller systems?
Definitely — a few come to mind. Time synchronization is a good example. At AWS, I led the Chronos team, which served time to a large fleet of EC2 instances. Even a small clock drift across distributed nodes can corrupt database transactions and break encryption. We took the accuracy from 650 to 500 microseconds, a 23% improvement. That might sound incremental, but distributed databases and security protocols depend on that precision being rock-solid.
Cascading failures are another one. A single slow service can create backpressure that takes down systems that seem completely unrelated. We built redundancy patterns and failover mechanisms targeting a 99.999% SLA, which meant coordinating with several teams. The coordination was as hard as the engineering itself.
Then there’s tail latency. Your average response time might look fine, but the 95th percentile — the slowest few percent of requests — is what users actually feel during peak load. At Wayfair, the platform unification work increased deliveries per route by 50%. Getting there meant obsessing over worst-case performance, not the average.
In the Fleet Insights and SPS projects, you developed a real-time data infrastructure for tracking vehicle locations and schedules. Which architectural approaches enabled low latency and high data reliability when working with heterogeneous telematics sources?
The core challenge was that telematics data from different sources (GPS units in trucks, mobile apps, and third-party integrations) arrives in different formats and at unpredictable intervals due to connectivity challenges. So the first thing we did was decouple the ingestion layer. We used Kafka (messaging queue) to buffer and normalize incoming events before any processing happened. That absorbed traffic spikes without dropping data.
On the computer side, we designed a distributed fleet of servers with load-balanced endpoints and automatic failover. That let us scale to high event volumes while keeping latency consistent. The third decision, and probably the most important one, was separating read and write paths.
Real-time dashboards need different optimization than transactional writes. Pulling those apart gave us low-latency reads for operational visibility without compromising data integrity on the scheduling side. The database migration I led, moving to PostgreSQL in the cloud with event-driven patterns, was what made that separation work.
At the same time, ML-driven optimization of routes and schedules is not just about algorithms — it’s also about the people on the ground. How did you account for driver behavior and operational realities when implementing intelligent systems, and what unexpected effects did you observe?
The best algorithm is useless if it ignores what actually happens on the ground.
When we built the carrier selection platform across my teams, we learned that optimizing purely for theoretical efficiency often fights driver experience. Routes that minimize total distance can maximize cognitive load if they include hard-to-navigate roads or unfamiliar areas. Drivers who feel overwhelmed make more errors and leave faster.
In practice, how did you bring driver behavior into the model?
We built feedback loops, tracking which routes drivers actually followed versus what the system prescribed, then adjusting our models based on the gap. We also found that delivery windows need to account for real scenarios like parking availability near a building, building access, building type, and whether the customer is actually home. Travel time alone doesn’t cut it.
The surprise was that giving drivers slightly more flexibility in sequencing their stops, while keeping the time windows in place, actually improved on-time delivery rates. The human in the loop turned out to be a source of intelligence that pure algorithms couldn’t replicate.
And is it possible to quantify the impact of engineering architecture on day-to-day operations?
Yes, and I think quantifying it is underrated.
On the supplier side, the friction of finding policy information was driving up frustration and support tickets. We built an AI-powered Supplier Experience Platform using Gemini and vector embeddings to surface exact answers and actionable next steps for suppliers. That moved the Supplier Net Promoter Score up 10 points.
What did you see on the delivery side?
On the delivery side, platform unification and routing optimization increased deliveries per route by 50%. In practice, that means drivers spend less time driving empty and more time completing deliveries. The efficiency gains across these initiatives were equivalent to roughly $50M. That’s the connection between architecture decisions and business outcomes — and it’s measurable.
I know you managed four engineering and product teams at the same time. Do you have your own approaches to communication and delegation that help maintain speed, quality, and motivation when tackling complex engineering challenges?
Running four engineering teams (two direct-report managers and 25-plus engineers) requires a lot of structure that you don’t need with one team.
The biggest lever was standardizing planning cycles. I implemented six-month and 12-month planning with dependency mapping and milestone tracking. That took our project success rate from 58% to 90%. When everyone knows the cadence, the expectations, and how their work connects to other teams, a lot of coordination friction disappears.
I also delegate outcomes, not tasks. Each team owns its domain end to end, with explicit protocols for cross-team blockers. That keeps the speed up without routing everything through me.
Investing in people matters as much as the process. I mentored two individual contributors into management roles. Building leadership capacity inside the team creates resilience and frees me up for strategic work instead of tactical firefighting.
Retention tells you whether any of this is working. We maintained 95% team retention through mentorship programs and career development. When people feel invested in, they bring effort to hard problems that you can’t get by asking for it. That compounds over time.
Finally, considering the global trends in digitalization, AI, and sustainability, which emerging competencies and mindsets will be most critical for engineering leaders working with distributed systems and modern logistics?
A few things come to mind, both in terms of skills and how you think about the work.
On skills: Leaders need real AI fluency now, beyond knowing that it exists. At Wayfair, I implemented semantic search using an LLM (Gemini) and embeddings to solve a concrete supplier experience problem. The leaders who do well will be the ones who can tell the difference between when AI adds value and when a simpler solution works better.
Sustainability is also becoming a systems design concern, not a slogan. Route optimization now includes emissions per delivery alongside cost and time. That means leaders need to design systems that measure and report on sustainability KPIs as naturally as they report on latency or uptime.
Cross-functional orchestration is unavoidable at scale. The carrier selection platform spanning nine teams, the 99.999% SLA initiative across several AWS teams — these don’t happen with technical depth alone. You need to translate between engineering, operations, and business in a way that each group trusts.
Mindset-wise, the tools I used five years ago are already outdated, and whatever I’m using now will be too. Investing in adaptable architectures and teams that can evolve matters more than optimizing for the current stack.
The other thing I’d say is that how fast you learn matters more than how fast you ship. At Wayfair, taking project success from 58% to 90% wasn’t about working harder. It was about building processes that made learning from failures faster and cheaper. Treating failures as data rather than setbacks is what keeps you moving when the ground shifts. That’s what resilience looks like in practice — not just uptime, but the ability to adapt quickly when reality changes.





