What’s the most misunderstood thing about “Big Data”?
In my experience, corralling data is harder than people perceive and analyzing is easier than most people perceive. Not that either is easy at all, just relative to each other.
With corralling, if there’s any interruption at the front end, it quickly affects things downstream. Anticipating those downstream consequences and designing ingestion to be fault tolerant is difficult to do. Additionally, it can be very difficult to distinguish dirty data from clean.
Another challenge is to establish rules to identify dirty data . You can set up hard bounds, but that doesn’t help in all cases. It’s preferable to take a more statistical approach with anomaly detection. Splunk offers some rudimentary tools, but even simple statistics are way better than absolute bounds.
Why do people underestimate?
The data organizations I’ve been a part of spent way more time corralling data than they do analyzing it in terms of assignment efforts. The ratio is probably close to 5:1. Making software work is 10% of the work; making it work for exceptional cases (data errors, out of order events, network problems) is the other 90%.
Example?
When I worked at Quantitative Risk Management, we set up an API for one of our vendors to drop data every day. We built pretty simple extraction rules and counted on our provider to send us good, clean data. One day, columns were missing and it broke everything. We were caught unawares because we hadn’t planned for a schema change to be possible. In retrospect, that was naïve, but we’d been using this vendor for a while and designed our software based on that assumption.
What’s the one secret not being talked about with respect to your role and data?
People overlook how difficult it is to handle an ever-increasing data size; where do you keep it and what do you do with it? At Nomi, we kept every observation our WiFi sensors detected forever. We built our software with the assumption that we would keep every data point forever.
We understood that we’d face challenges, but also underestimated how difficult it would be to disentangle ourselves from that methodology. Through that experience, I learned that no matter how early you are as a start-up, establish up front assumptions around when and how you plan on accessing data. But, if it’s already too late and you’ve got data zooming in and performance is starting to suffer, there are mechanisms that can help. Depending on what kind of data store you’re using, there are tools to expire or archive data. MongoDB has capped collection that starts to drop data that isn’t so important once you reach a certain threshold. Other data stores like S3 have lifecycle management mechanisms that keep data hot for a given period of time, like 60 days, and then automatically purge thereafter.
What’s the most frustrating thing about your work with data?
The most frustrating thing is developing processes around data that end up being fragile or require maintenance. When I was on the Market Data team for QRM, we would get a data dump at 8am, run our process on it, and release it to clients at 9am. If there was a QA issue, we had a very short period of time to detect and resolve it. When things go well, clients don’t think much of it; they’re supposed to go well. If things go wrong, they’ll remember it forever. Even though our vendor was quite reliable and there were mistakes maybe 1% of the time, it didn’t matter; clients have a long memory.
What I learned from this was that the more fragile your process are, the more leeway you want to bake in. Automate as much as you can. We automated detection of the format of our input so we could tell if a given input was ingestible. We then established ranges around input values. In an ideal world, we wish we could have implemented a more sophisticated approach using statistical ranges, rather than absolute values as thresholds. There’s such a large variance that hard bounds are not always effective.
What’s the career accomplishment you’re most proud of?
While I was at Nomi, we scaled from 300M data points a week to >1B. Trying to figure out how to push our systems to the limit was intellectually rewarding. We had to experiment ways to get higher throughput given constraints.
Who has been the most important mentor in your career?
I’ve been fortunate to have many people mentor me from a number of perspectives. I see being in technology as always 20% learning / 80% practice. One person in particular who was instrumental in my shift from the application side to infrastructure was Michael Hamrah (now at Uber). He gave a fascinating presentation on how he was able to facilitate experimentation and scalability by designing a decentralized, easy-to-implement infrastructure. By decoupling and designing with usability in mind, his premise was that you could remove the barriers to experimentation.
Even though Michael didn’t put it this way, I think our friends on the Oscar Health infrastructure team (Mackenzie Kosut and Brent Langston) put it best” “Make everything simple and reproducible”. If you force yourself to make every operation repeatable, you automate more. As a result, you’re able to test your system against a new data set without any penalty in a way that encourages experimentation.
Adopting these principles can be easier said than done. In many cases, you are not designing and building something new from the ground up; you’re making incremental improvements to existing infrastructure. It’s difficult to balance the need to support something that’s currently working while trying to conceptualize and design its next generation. Balancing those two competing concerns isn’t easy. I wouldn’t say I’ve been great at it, but what I have observed is that the big rewrite from the ground up is rarely successful. It’s better to find some way to incrementally reduce constraints or increase capabilities.
What company is doing “Big Data” right?
I am very excited by things going on with Kafka led by http://www.confluent.io/. Turning data producers into first class citizens is the next gen of data architectures. I find it very interesting to apply a push model to data architecture, versus the more traditional ‘pull’ approach. A lot of a data system’s end goal is to produce a data product like an analysis, data dump, or table that can be queried. Generally that’s done through a batch job in which it pulls the required data necessary to create the product. Kafka and other new architectures are turning that model on its head, allowing producers of data to trigger the action (e.g. the origination of a data point or API call to cause the creation of the final output). That sort of push-based architecture is more reproducible, and the approach allows you to have a single root data stream that is being consumed by two versions and that you can compare them head to head, which you can’t do with other systems.
What’s the one piece of advice you would share with a younger version of yourself?
The person who is most responsible for your personal development is yourself. Other people are consultants. If you find yourself working in a place that does not emphasize your learning and growth, choose the most challenging projects whenever possible. Try to stay up-to-date outside of work. Attend Meetups, talk to peers, and contribute to open source projects. Part of working is investing in yourself and places that don’t recognize that are reducing your growth potential.
[bctt tweet=”The person who is most responsible for your personal development is yourself.”]
What would you like to ask readers of this interview?
How are other people keeping up? I always find it quite challenging so I’m curious to hear what people are reading or attending.
This interview is part of Schuyler’s Data Disruptors series over on the StrongDM blog.