Cameron is an open source contributor, a pythonista and a data geek – he’s developed various cool libraries. His blog is worth a read, and I personally recommend his screencasts. He’s got a strong Mathematical background like myself, and he currently is Data Team Lead at Shopify. He’s possibly most famous in the Python community for his excellent Bayesian Methods for Hackers. I also had the honour of contributing to that project.
Follow Peadar’s series of interviews with data scientists here.
1. What project have you worked on do you wish you could go back to, and do better?
For sure, it was my projects during 2012 when I first started to enter Kaggle competitions. The two in particular I wish I could redo were the Twitter Psychopaths challenge and the US Census Return Rate challenge. In both challenges I made some serious high-level errors (but that’s the point of these challenges, to discover mistakes before they happen when it really matters!) I’ve detailed my mistake in the US Census challenge in my latest PyData presentation “Mistakes I’ve Made”, . Basically I ignored population variance and replaced it with machine learning egotism. Oh, I also remembered another project I would really love to go back to. In 2011, when I was doing research into stochastic processes, I started my first Python library (if you could even call it that) called PyProcess. You can still see it here. Notice that it is, embarrassingly, one large file filled with Python classes. The first iteration didn’t even use Numpy! I would love to go back and redo the entire thing, but two things hold me back: 1) It was a lot of work to test each stochastic process and make sure they were doing the right, and 2) I’m to far out of the field now.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
If you’re not already learning and using Python or Scala, do that. Similarly, if you’re not already learning some software engineering, do that. What are some examples of data science software engineering? – writing (close to) professional level code – thinking proper abstractions, writing testable pieces, thinking about reusability. – having code reviewed, and reviewing code yourself – writing tests Why do I emphasize programming and software development so much? At a high level, data science is about using computers to do statistics for you. If you can’t properly use the former, then your most important tool in your toolbox is missing.
3. What do you wish you knew earlier about being a data scientist?
I wish I, and the rest of the field, knew about data cleaning. This is an important part of the whole data story and is glossed over. Specifically, the ETL pipeline (extract-transform-load). What I use to do is use SQL for the T part, but this caused too many problems (untestable, unmaintainable, unscalable). Now that is done prior to me even using the data for anything remotely complicated. This saves me time later, and allows the entire team to scale and benefit from my work (yes, I am still writing ETLs – I expect all my team members to, too). The problem is, you can’t really teach ETLs until you have the data problem. Small companies (I mean really small companies) and tutorials online can assume data is fine. Not until one is submerged in changing data does the ETL process start to make sense. So, though I wish I knew this earlier, I probably couldn’t have learned anyways!
4. How do you respond when you hear the phrase ‘big data’?
Sure, “Big Data” is a buzzword, but I think the issue with the name “Big data” comes down to two camps: are you seeing “Big data” as a solution (probably wrong) or as a problem (probably right). For example, two common questions an organization might have are 1) find the number of unique visitors to our site in the part month, and 2) find me the median of this dataset. If you data is simply too big for memory, which is a good definition of big data, then we can’t solve either of these problems naively. What is really interesting about big data as a problem is the abundance of cool new algorithms and data structures being invented to solve these problems. For example, HyperLogLog estimates the number of unique values in a set of data too big for memory. And TDigest estimates the percentiles of data too big for memory (and hence can’t be sorted).
5. What is the most exciting thing about your field?
I’ve already mentioned the interesting new algorithms for big data problems, so I won’t go over them again, but I do think they are very exciting. Another exciting thing the new problems being discovered, and the solutions being used. For example, the recommendation problem of what to recommend visitors to a site is a new problem that has massive impact, and is being solved by data. I can’t imagine Fisher or Pearson ever asking the question “what should I recommend next to this user?”. In a similar vein, we *are* seeing the reemergence of classical statistics again. Classical techniques like survival analysis, clinical trials, and logistic regression are seeing a major comeback because new problems have been identified.
6. How do you go about framing a data problem?
Honestly, I try to turn it into a binomial problem. I use the beta-binomial model as a large crutch far too often, but it’s a really good initial model of a problem. If I can turn the problem into a binomial problem, then I have lots of tools I can work with: Bayesian analysis, sample-size appropriate ranking techniques, Bayesian Bandits, etc. If I can’t turn it into a binomial problem, I go through the rest of my toolbox: survival analysis, lifetime value, Bayesian modeling, classification, association analysis, etc. If I still can’t find an appropriate solution, then I have to expand my scope (and often learn a new tool while doing that).
(image credit: Jer Thorp)