Thomas is Data Science Lead at Quantopian Inc which is a crowd-sourced hedge fund and algotrading platform. Thomas is a cool guy and came to give a great talk in Luxembourg last year – which I found so fascinating that I decided to learn some PyMC3
Follow Peadar’s series of interviews with data scientists here.
What project have you worked on do you wish you could go back to, and do better?
While I was doing my masters in CS I got a stipend to develop an object recognition framework. This was before deep learning dominated every benchmark data set and bag-of-features was the way to go. I am proud of the resulting software, called Pynopticon, even though it never gained any traction. I spent a lot of time developing a streamed data piping mechanism that was pretty general and flexible. This was in anticipation of the large size of data sets. In retrospect though it was overkill and I should have spent less time coming up with the best solution and instead spend time improving usability! Resources are limited and a great core is not worth a whole lot if the software is difficult to use. The lesson I learned is to make something useful first, place it into the hands of users, and then worry about performance.
What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Spend time learning the basics. This will make more advanced concepts much easier to understand as it’s merely an extension of core principles and integrates much better into an existing mental framework. Moreover, things like math and stats, at least for me, require time and continuous focus to dive in deep. The benefit of taking that time, however, is a more intuitive understanding of the concepts. So if possible, I would advise people to study these things while still in school as that’s where you have the time and mental space. Things like the new data science tools or languages are much easier to learn and have a greater risk of being ‘out-of-date’ soon. More concretely, I’d start with Linear Algebra (the Strang lectures are a great resource) and Statistics (for something applied I recommend Kruschke’s Doing Bayesian Analysis, for fundamentals “The Elements of Statistical Learning” is a classic).
What do you wish you knew earlier about being a data scientist?
How important non-technical skills are. Communication is key, but so are understanding business requirements and constraints. Academia does a pretty good job of training you for the former (verbal and written), although mostly it is assumed that communicate to an expert audience. This certainly will not be the case in industry where you have to communicate your results (as well as how you obtained them) to people with much more diverse backgrounds. This I find very challenging.
[bctt tweet=”Communication is key, but so are understanding business requirements and constraints.”]
As to general business skills, the best way to learn is probably to just start doing it. That’s why my advice for grad-students who are looking to move to industry would be to not obsess over their technical skills (or their Kaggle score) but rather try to get some real-world experience.
How do you respond when you hear the phrase ‘big data’?
As has been said before, it’s quite an overloaded term. On one side, it’s a buzzword in business where I think the best interpretation is that ‘big data’ actually means that data is a ‘big deal’ — i.e. the fact that more and more people realize that by analyzing their data they can have an edge over the competition and make more money.
Then there’s the more technical interpretation where it means that data increases in size and some data sets do not fit into RAM anymore. I’m still undecided of whether this is actually more of a data engineering problem (i.e. the infrastructure to store the data, like hadoop) or an actual data science problem (i.e. how to actually perform analyses on large data). A lot of times, as a data scientist I think you can get by by sub-sampling the data (Andreas Müller has a great talk of how to do ML on large data sets, here).
Then again, more data also has the potential to allow us to build more complex models that capture reality more accurately, but I don’t think we are there yet. Currently, if you have little data, you can only do very simple things. If you have medium data, you are in the sweet spot where you can do more complex analyses like Probabilistic Programming. However, with “big data”, the advanced inference algorithms fail to scale so you’re back to doing very simple things. This “big data needs big models” narrative is expressed in a talk by Michael Betancourt, here.
What is the most exciting thing about your field?
The fast pace the field is moving. It seems like every week there is another cool tool announced. Personally I’m very excited about the blaze ecosystem including dask which has a very elegant approach to distributed analytics which relies on existing functionality in well established packages like pandas, instead of trying to reinvent the wheel. But also data visualization is coming along quite nicely where the current frontier seems to be interactive web-based plots and dashboards as worked on by bokeh, plotly and pyxley.
How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
I try to keep the loop between data analysis and communication to consumers very tight. This also extends to any software to perform certain analyses which I try to place into the hands of others even if it’s not perfect yet. That way there is little chance to ween off track too far and there is a clearer sense of how usable something is. I suppose it’s borrowing from the agile approach and applying it to data science.
(image credit: Lauren Manning, CC2.0)