Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

“Big Data”: A Problem, not a Solution – Interview with Data Scientist Cameron Davidson-Pilon

by Peadar Coyle
May 30, 2016
in Conversations
Home Conversations
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail

Cameron is an open source contributor, a pythonista and a data geek – he’s developed various cool libraries. His blog is worth a read, and I personally recommend his screencasts. He’s got a strong Mathematical background like myself, and he currently is Data Team Lead at Shopify. He’s possibly most famous in the Python community for his excellent Bayesian Methods for Hackers. I also had the honour of contributing to that project.

Follow Peadar’s series of interviews with data scientists here.


Table of Contents

  • 1. What project have you worked on do you wish you could go back to, and do better?
  • 2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
  • 3. What do you wish you knew earlier about being a data scientist?
  • 4. How do you respond when you hear the phrase ‘big data’?
  • 5. What is the most exciting thing about your field?
  • 6. How do you go about framing a data problem?

1. What project have you worked on do you wish you could go back to, and do better?

For sure, it was my projects during 2012 when I first started to enter Kaggle competitions. The two in particular I wish I could redo were the Twitter Psychopaths challenge and the US Census Return Rate challenge. In both challenges I made some serious high-level errors (but that’s the point of these challenges, to discover mistakes before they happen when it really matters!) I’ve detailed my mistake in the US Census challenge in my latest PyData presentation “Mistakes I’ve Made”, . Basically I ignored population variance and replaced it with machine learning egotism. Oh, I also remembered another project I would really love to go back to. In 2011, when I was doing research into stochastic processes, I started my first Python library (if you could even call it that) called PyProcess. You can still see it here. Notice that it is, embarrassingly, one large file filled with Python classes. The first iteration didn’t even use Numpy! I would love to go back and redo the entire thing, but two things hold me back: 1) It was a lot of work to test each stochastic process and make sure they were doing the right, and 2) I’m to far out of the field now.

2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

If you’re not already learning and using Python or Scala, do that. Similarly, if you’re not already learning some software engineering, do that. What are some examples of data science software engineering? – writing (close to) professional level code – thinking proper abstractions, writing testable pieces, thinking about reusability. – having code reviewed, and reviewing code yourself – writing tests Why do I emphasize programming and software development so much? At a high level, data science is about using computers to do statistics for you. If you can’t properly use the former, then your most important tool in your toolbox is missing.


Join the Partisia Blockchain Hackathon, design the future, gain new skills, and win!


3. What do you wish you knew earlier about being a data scientist?

I wish I, and the rest of the field, knew about data cleaning. This is an important part of the whole data story and is glossed over. Specifically, the ETL pipeline (extract-transform-load). What I use to do is use SQL for the T part, but this caused too many problems (untestable, unmaintainable, unscalable). Now that is done prior to me even using the data for anything remotely complicated. This saves me time later, and allows the entire team to scale and benefit from my work (yes, I am still writing ETLs – I expect all my team members to, too). The problem is, you can’t really teach ETLs until you have the data problem. Small companies (I mean really small companies) and tutorials online can assume data is fine. Not until one is submerged in changing data does the ETL process start to make sense. So, though I wish I knew this earlier, I probably couldn’t have learned anyways!

4. How do you respond when you hear the phrase ‘big data’?

Sure, “Big Data” is a buzzword, but I think the issue with the name “Big data” comes down to two camps: are you seeing “Big data” as a solution (probably wrong) or as a problem (probably right). For example, two common questions an organization might have are 1) find the number of unique visitors to our site in the part month, and 2) find me the median of this dataset. If you data is simply too big for memory, which is a good definition of big data, then we can’t solve either of these problems naively. What is really interesting about big data as a problem is the abundance of cool new algorithms and data structures being invented to solve these problems. For example, HyperLogLog estimates the number of unique values in a set of data too big for memory. And TDigest estimates the percentiles of data too big for memory (and hence can’t be sorted).

5. What is the most exciting thing about your field?

I’ve already mentioned the interesting new algorithms for big data problems, so I won’t go over them again, but I do think they are very exciting. Another exciting thing the new problems being discovered, and the solutions being used. For example, the recommendation problem of what to recommend visitors to a site is a new problem that has massive impact, and is being solved by data. I can’t imagine Fisher or Pearson ever asking the question “what should I recommend next to this user?”. In a similar vein, we *are* seeing the reemergence of classical statistics again. Classical techniques like survival analysis, clinical trials, and logistic regression are seeing a major comeback because new problems have been identified.

6. How do you go about framing a data problem?

Honestly, I try to turn it into a binomial problem. I use the beta-binomial model as a large crutch far too often, but it’s a really good initial model of a problem. If I can turn the problem into a binomial problem, then I have lots of tools I can work with: Bayesian analysis, sample-size appropriate ranking techniques, Bayesian Bandits, etc. If I can’t turn it into a binomial problem, I go through the rest of my toolbox: survival analysis, lifetime value, Bayesian modeling, classification, association analysis, etc. If I still can’t find an appropriate solution, then I have to expand my scope (and often learn a new tool while doing that).


(image credit: Jer Thorp)

Tags: Cameron Davidson-PilonPeadar CoyleShopify

Related Posts

Chris Latimer tells how to use real-time data to scale and perform better

Chris Latimer tells how to use real-time data to scale and perform better

April 13, 2022
Ken Jee explains how to build a career as a data scientist

Ken Jee explains how to build a career as a data scientist

March 22, 2022
Transparency and data income plans

In conversation: the Chaos Computer Club, transparency, and data income plans

January 13, 2022
DataRobot AI

DataRobot CEO calls for ‘a new era of democratization of AI’

March 26, 2021
Where Data Scientist Salaries are Headed in 2021

Where Data Scientist Salaries are Headed in 2021

January 12, 2021
Food Delivery Via Drones: A Reality in Iceland

Food Delivery Via Drones: A Reality in Iceland

May 9, 2018

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

LATEST ARTICLES

Exploring the mind in the machine

Adobe Firefly AI: See ethical AI in action

A holistic perspective on transformational leadership in corporate settings

Runway AI Gen-2 makes text-to-video AI generator a reality

Maximizing the benefits of CaaS for your data science projects

Microsoft 365 Copilot is more than just a chatbot

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy
  • Partnership
  • Writers wanted

Follow Us

  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.