For decades, companies have lived by the mantra “customer is king”. But in the age of the Internet- when users generate hoardes of data, not all of which is useful or accurate- the rules of the game have changed. We recently spoke to Tye Rattenbury, Trifacta’s lead Data Scientist, about how he dealt with the masses of user-generated data in his previous role at Facebook, as well his current role with Trifacta.
Tell us a little bit about yourself and your work.
I joined Trifacta at the beginning of June of last year. So I was in grad school at Berkeley with the co-founders Jeff Heer and Joe Hellerstein. I had worked with Joe at Intel- he was managing the Intel Research Programming Lab. I was working for Intel in a lot of different groups, so we were aware of each other but not directly working on any projects.
I had been living outside of San Francisco for some time and just starting to make my way back and Jeffrey asked if I wanted to get involved, given the domain experience that I have in that area. I think that comes really in two flavors: if you think about Trifacta’s product, there’s a very explicit balance between data- the engineered, mathematical, traditional way of dealing with data, and designing a good interface, good interaction, good user experience.
My background in my Ph.D. work was really looking at applied artificial intelligence projects that built interfaces for people. So I kind of had that outlet starting in grad school. From grad school, I started working at Intel in a group where people work on essentially understanding how people use technology. At the time it was mostly ethnographic qualitative research methods. When I showed up, we started working on more hybrid methods that mixed what they were doing from a qualitative perspective with much more quantitative—the behavior tracking and modelling and analysis type methods. That sort of balanced the quantitative, data rich, analytical perspective with a more creative, qualitative, design perspective is something that I’ve been bouncing around with for a while.
Complete our SAP x Data Natives CDO Club survey now, and help us to help you
This balance also came in to play when I was working with R/GA. There I was co-leading a group that was essentially finding creative solutions around data. What that looked like was that sometimes data sets, sometimes data driven products. What was required understanding technically what data we were working with and what insights we could derive from it either to create an experience for—or to try the product. But we had to balance the technical perspective with the design perspective-building a good experience on top of that data. Essentially, identifying what data was useful from the user experience standpoint.
With Trifacta, I think they ultimately decided “You’re probably a good fit both from what we’re trying to do in terms of the DNA we’re building for the company, but also have the domain experience we’re looking for.”
I understand you were on the team at Facebook who were tasked with understanding and cleaning up user-generated data, and answering questions about how many people on the platform actually went to Harvard, for instance. Talk us through this process.
The generic flavor of the problem that data science solves is when you’re working with a data set, you will often run into some kind of anomalies or discrepancies where it is unclear when you first see them what the appropriate sort of response is. The appropriate response could be anything from, “That’s a really interesting insight—that has publication value, that’s amazing that we just found that,” to “That’s a complete problem in the way we collect our data or some bias in our systems or some breakdown in our system. So we need to go fix something.” It’s about discovering where along that spectrum what the appropriate response lies.
On a generic level, Facebook’s dataset is split roughly into two parts. One is everything they know about people- which is just names and pictures. Then, there’s everything we know about things in the world and all of the links between these things. So I log in to Facebook, and Facebook will ask “Where did you go to school?”, I see a text log, and I type in “Harvard”. Then the system might say, “We don’t actually have a Harvard in our database so we’ll create an object that represents this thing that you say you have a link to, and we’ll stick it in our data set and put a link between you and it.” Then someone else says, “Hey I also went to Harvard,” and so now they’re going to connect to that same object that’s already been created.
So when someone new shows up and says, “Oh I went to Harvard University” or “I went to Harvard then Princeton”, or “I went to Harvardd.” How do we know- how does Facebook know- all of these text entries are supposed to be the same thing? So Facebook ends up creating multiple entities. So if we think about what they’re meant to represent, they’re supposed to represent the same thing. So now you have this problem of letting this process run where people create these entities, these schools that they’re associated with, and now you go and you look at that data set and you see you’ve got like a thousand entries that are all various misspellings of Harvard. You want to look at that and you go, “You know, actually that all corresponds to like one thing.”
One statistic you can run is how many unique school names show up in the Facebook data set. And it’s something like of the order of hundred million and you’re like, “Well there’s definitely not a hundred million places someone could have gone to school. So clearly there’s some problem going on with the data.” And then you start to dig in to try and understand what those different problems are.
You have the sort of opposite problem where you have people who put acronyms for schools- so they might have gone to UC Boulder, but written UCB. I also went to UCB- UC Berkeley. While you can call both UCB, in reality I’m pointing- or I meant to point- to two different things. So now you also would have this entity-sorting problem and you need to start assigning people.
So now we’re beginning to understand our problem- that we have dozens of entities that mean the same thing, and entities that mean more than one thing. And this is where domain knowledge comes in- domain understanding at Facebook involves understanding the user experience that generated that data to begin with.
So it was important to go back and look at the text that that Facebook had offered to users where it asked, “Where did you go to school?”, and what was available to those users based on what they put into that text box as they wrote. Because that would tell you a bit about why people might answer that question one way or another.
I’ll give you one other example of that. If you looked at the Portuguese entities that were in that data set, there were a bunch of people who had said “Finished,” or “Completo”, or “Medio Completed”- like partially completed. So what we had eventually figured out was when they translated “Where did you go to school?” it actually got translated to “What level of education do you have?” So basically a mistranslation. So we had a bunch of people answering the question like”I’m almost finished” or “I’m 2 years in”. So again they were creating these entities that were not the appropriate responses for the question we set out to answer. So debunking that kind of thing is a lot of what we worked on. And then once you really understood what the problems were, then we would start to build up some automated processes to address these problems.
Automation worked for about 80% of the problems- then we were left with entities that were particularly unusual, and we had to talk with users to understand exactly why that value showed up. This is the boundary where we don’t know enough about the specific nature of the problem to automate, and where manual exploration became the most plausible option.
On to Trifacta; tell us a little more about your product.
Trifacta is focused on a process we call, “Data Transformation,” which is also commonly referred to as data preparation, manipulation, cleaning and wrangling. Trifacta allows users to work with data that lacks the appropriate structure or format required for analysis. The platform enables analysts to visualize the content of raw data in Hadoop and visually interact with that data to build a transformation script, which then defines a Hadoop job that will output the data in the desired form.
It’s been widely publicized that 50-80 percent of the analysis process is spent cleaning or preparing data for analysis. Trifacta is focused on making this process more productive, efficient and even enjoyable. The platform is designed to help analysts successfully complete the most difficult and time-consuming part of the analysis process.
How do you differentiate yourselves from other competitors in the market?
We developed an entirely new approach to transforming data called Predictive Interaction™. In comparison to workflow-driven ETL tools, we allow the user to manipulate the content of the data to inform how it needs to be transformed – creating a more agile and productive process.
With Trifacta, users are no longer responsible for writing low-level code to transform data. Instead, we provide a familiar visual model that allows users to directly interact with the content of the dataset to prompt a prioritized list of recommended transformations to apply against the data.
Do you have any predictions in data science for 2015?
We will start to see data science (to the extent that it operates as a coherent entity) increasingly rely on the domain expertise of economists. The early days of data science were very math, statistics and programming oriented. Then there was the rise of the “computational social scientist,” which added sociology to the mix.
Many trend setting data science places are finding that sociology, and similar disciplines, tend to be retrospective, while other fields, like economics, offer simulation and auction modeling and other techniques to get more proactive and predictive with data. Of course, most economists don’t have the programming chops to land most data science jobs, but I think we’ll see that start to change significantly.
Are you currently looking for funding, or the hire any particular talent at the moment?
We’re always looking for top-notch talent across engineering, operations, sales/marketing, etc. Check out our open positions at http://www.trifacta.com/company/careers/.
(Image credit: Facebook network visualisation by Terry Chay)