From Super Bowl traffic to market-open spikes, the challenge is the same: process events in real time without compromising data integrity. Khrystyna Terletska — who built latency-critical pipelines at DraftKings — explains how modern platforms safeguard data under stress, what differs across industries, and what’s common everywhere. For Khrystyna, reliability isn’t just a metric — it’s a commitment she learned to honor when the stakes are highest.
Khrystyna, when people hear “fault-tolerant systems,” many think of backups or duplicated servers. In reality, what does the architecture of a platform look like when it’s truly designed to never lose or corrupt critical data?
When people think about fault-tolerant systems, they often picture backups or maybe a few extra servers sitting around. In reality, it’s much more involved. A platform that’s truly fault-tolerant has to guarantee that even if a node crashes, a network partition occurs, or hardware fails completely, critical data is never lost or corrupted.
At DraftKings, this is especially important because we are dealing with odds updates in real time, and a single corrupted event could directly impact financial outcomes.
To achieve that level of reliability, replication is key — every update is written to multiple nodes, often across availability zones, before it is acknowledged as committed. That way, no single machine failure could cause data loss. We also rely heavily on durable event logs through Kafka. Every odds change is appended to the log, and if a consumer fails or a service goes down, we can restore its state simply by replaying the log from the correct offset. That gives us both strong durability and a deterministic way to rebuild state.
On top of that, we regularly test recovery scenarios, including node crashes, failovers, and consumer restarts. The system wasn’t just designed to survive them — it was validated under production-like load. For me, the real measure of fault tolerance is whether the system continues to deliver correct results during those failures, not just afterward when things are back to normal.
At DraftKings, you work with millions of real-time updates during major sporting events. What key architectural elements ensure data safety and integrity under such extreme loads?
I would say that at DraftKings the pressure comes from scale as much as from correctness. During big sporting events, the system has to ingest and process millions of updates per second, and it isn’t enough to just store them durably — they have to be delivered in order, without duplication, and fast enough for users to make decisions in real time.
The architecture is designed around partitioning and horizontal scale. Odds updates are sharded by keys such as event or market, which allows Kafka to distribute the load evenly across brokers and consumer groups. That partitioning is critical because it guarantees both throughput and ordering for each market. In addition, consumers are deliberately stateless, relying only on offsets from the event log. That means they can scale out dynamically, and if one crashes, another can immediately pick up processing without risk of losing state.
Another architectural decision is to make every consumer idempotent. In high-load systems, retries are inevitable — networks spike, brokers rebalance — but idempotency meant the same update could be applied twice without introducing inconsistencies. Combined with strict replication policies and cross-zone distribution, this gives us confidence that updates would always be correct, even under heavy churn.
Finally, we closed the loop with real-time monitoring and backpressure. The architecture isn’t just built to move data fast, it is also built to sense when downstream systems were at risk and automatically slow ingestion to protect integrity. That’s what keeps the system safe under load. To me, that’s the essence of designing for high-throughput reliability — it’s not just about surviving failures, but about ensuring correctness while operating at peak scale.
In industries like banking, e-commerce, and media, data integrity is equally mission-critical. Which principles of fault-tolerant system design are truly universal across industries?
Across all these industries the same technical principles show up again and again in fault-tolerant design. One of the most important is strong durability through append-only logs. Whether it’s a transaction log in finance, an order log in retail, or a Kafka topic for event processing, having an immutable history of state changes allows you to rebuild or reconcile the system deterministically after a failure.
Another universal principle is isolation and containment of faults. In other words, architectures are built with partitioning, sharding, and circuit breakers so that one slow service or corrupted component doesn’t cascade across the entire platform. In betting, that meant isolating markets; in banking, it might mean isolating accounts or ledgers; in media, it means partitioning streams. The concept is the same — scope failures to the smallest blast radius possible.
Finally, consensus and ordering guarantees are a constant. In distributed systems, you can’t assume events will arrive in order or only once. Protocols like two-phase commit, Raft, or quorum writes are used across domains to ensure consistency. At DraftKings, ordering was critical because odds updates had to be applied sequentially; in banking, you can’t credit an account before debiting it; in e-commerce, you can’t ship before payment clears.
Those building blocks — immutable logs, fault isolation, and consensus for consistency — are the universal backbone of fault-tolerant systems. The implementation details change, but the architectural principles remain the same no matter the industry.
Tell us how you first became interested in this field — and, more broadly, how you became an engineer?
I’ve always been curious about what happens when systems fail or face unexpected pressure. Early in my career, I was excited not just about building features, but about asking “what happens if this service crashes, or if traffic suddenly triples?” That curiosity naturally led me into distributed systems and reliability engineering.
At DraftKings, that curiosity turned into real responsibility. Supporting major sporting events meant dealing with millions of real-time updates and extremely high traffic surges. I still remember my first big production event — traffic spiked within seconds, and every update had to be processed correctly, with zero room for error. Most importantly, that experience made it very clear to me that reliability isn’t just a technical goal; it’s what keeps a business running. That experience pushed me to focus on fault tolerance, replication strategies, and recovery mechanisms, because when you’re under that kind of load, the smallest inconsistency can have serious consequences.
More broadly, becoming an engineer felt natural because I’ve always enjoyed breaking down complex problems and making systems stronger. What excites me most is building platforms that don’t just work when everything is perfect, but that stay resilient during failures, under heavy load, and unpredictable conditions. That’s what keeps me passionate about this field to this day.
In your work on fault-tolerant systems, what has been the bigger challenge — the technical complexity, or the responsibility of maintaining mission-critical services?
For me, the bigger challenge has often been the technical complexity, especially when it comes to performance optimization in fault-tolerant systems. At DraftKings, we aren’t just focused on making the platform resilient to failures — we also have to make sure it could process millions of real-time updates per second with consistently very low latency. That meant going deep into the architecture and fine-tuning critical paths — for example, reducing memory allocations and improving how services handle high-throughput data streams.
The tricky part is balancing two goals that don’t always align — making the system faster while still guaranteeing clean recovery from failures. Every change, even something as small as reworking memory management or optimizing how data is batched and processed, have to be validated not only for performance but also for resilience before shipping.
The responsibility of running mission-critical services is always there — during a major sporting event, the system cannot afford even a moment of inconsistency. That said, what makes the work most challenging is pushing for peak performance while ensuring the system could survive and recover from real-world failures. That tension has been difficult, but it’s also what makes the work most rewarding.
How has your understanding of “system reliability” evolved as you’ve moved through different companies and industries?
Early in my career, I thought of reliability in fairly simple terms: if a service was running and responding, it was reliable. As I gained more experience, especially as I moved into high-load environments, I learned that availability alone doesn’t define reliability.
At first, reliability was about stability: avoiding crashes, adding monitoring, and making sure we could detect incidents quickly. Later, working with microservices at GR8 Tech, I saw how reliability also depends on resiliency patterns — for example, autoscaling, circuit breakers, and safe rollouts. It wasn’t just “Is the service alive?” but “Can the system adapt when something unexpected happens?”
At DraftKings, that view expanded further. Reliability here means designing for determinism under massive scale. It isn’t enough to be up — every event has to be processed exactly once, in order, with guarantees that we can recover cleanly after a failure. That required durable logs, replication across availability zones, idempotent processing, and continuous validation under load testing. I also started to think about latency as part of reliability. If the system is technically available but response times slow from 20 ms to 200 ms under load, users will see that as unreliable.
So my understanding has evolved from “uptime” to a much broader set of guarantees: correctness, durability, predictable latency, and graceful recovery. Today, reliability means designing the entire system so that users can continue to trust it, even when failures or spikes are happening behind the scenes.
In recent years, expectations for data availability have grown so high that even “five nines” (99.999%) uptime is no longer enough for many companies. Why is that?
I’m currently working with large-scale, real-time distributed systems, and what I’ve seen is that five nines often isn’t enough anymore. The number looks good on paper — less than six minutes of downtime per year — but availability today is about more than just being “up” or “down.”
During major sporting events, traffic can spike tenfold in seconds. Even if the system is technically online, if replication lags or data delivery slows, users will still see stale or inconsistent updates. From a traditional SLA perspective, you might still meet five nines, but from a reliability perspective you’ve already failed.
Another factor is scale. Modern platforms aren’t running in a single data center — they’re distributed across availability zones and regions. That brings challenges like consensus, quorum writes, and partition handling. If you only measure uptime, you miss the other dimensions of reliability: durability, ordering guarantees, and latency budgets.
So, for me, five nines is no longer the benchmark. Real reliability today means delivering correct, consistent, and fresh data at scale, even under partial failures or extreme load.
Looking ahead, do you think automation will significantly reduce the human role in ensuring fault tolerance, or will critical decisions always remain in the hands of engineers?
I think automation will continue to play a bigger role in fault-tolerant systems, especially for predictable patterns. In the systems I work with, we already rely on automation for scaling infrastructure, triggering failovers, reallocating resources, and detecting anomalies. These are tasks that follow clear rules, and automation can execute them faster and more consistently than humans. For example, when traffic spikes during a major sporting event, autoscaling and partition rebalancing happen automatically without manual intervention.
That said, there are limits. Fault tolerance is not only about reacting to events — it’s also about making trade-offs. During a partition, for instance, you may face the CAP theorem head-on: do you prioritize consistency or availability? Automation can enforce whichever policy you’ve predefined, but it can’t decide in the moment whether stale data is acceptable for a few seconds or whether strict consistency is more critical to protect the business. Those decisions depend on context, and that’s where engineers are still needed.
I see automation as taking over routine execution — scaling, failover orchestration, and replaying logs for recovery — while engineers remain responsible for the higher-level strategies and the exceptions where business impact and technical guarantees intersect. Put simply, automation strengthens fault tolerance, but the critical decisions remain with engineers.