Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

The Problem With (Statistical) False Friends

by Ian White
March 10, 2017
in Data Science, Data Science 101
Home Topics Data Science
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail

I recently stumbled across a research paper, Using Deep Learning and Google Street View to Estimate the Demographic Makeup of the US, which piqued my interest in derivative uses of data, an ongoing research interest of mine. A variety of deep learning techniques were used to draw conclusions about relationships of car ownership, political affiliation and demographics. For those headline skimmers, you may be led to believe that researchers have just uncovered a vastly cheaper and more timely approach to perform the national census and make predictive claims about the population.

The researchers’ contention that official statistics are expensive and lagging is spot on. The principal US unemployment survey is performed in person or via telephone. Mystery shoppers still go into the field to purchase the underlying goods in the Consumer Price Index. Monthly government statistics are typically released several weeks after the close of the period and revised multiple times. The more infrequent the release, the longer the tabulation period. And for good reason.

These are national statistics, and by government mandate are required to have a transparent, consistent and well-understood methodology. When countries lie, they get found out. Ask Argentina about bogus inflation statistics. And that wasn’t even the dumb part–the difference between provincial government and national stats (black line) during the time in question is obvious to anybody who can read a chart:

The Problem With (Statistical) False Friends

Or analyze online prices in Argentina, compute a price index and see a similar conclusion. This initiative turned into the Billion Prices Project at MIT and is one of the innumerable research projects that use novel/alternative approaches to measure macro trends in a timely manner. Other highlights include Google’s use of flu-related search terms to indicate current influenza rates (which worked until it didn’t). Or near-time reporting of unemployment rates across EU member states. But I digress…

The danger in relying on the Google Street View study cited above can lead to spurious claims when taken out of context. I’m sure the authors are rolling their eyes at the below because nobody is suggesting polling can be better performed by knowing automobile ownership (not to mention the bias).

For example, the vehicular feature that was most strongly associated with Democratic precincts was sedans, whereas Republican precincts were most strongly associated with extended-cab pickup trucks (a truck with rear-seat access). We found that by driving through a city for 15 minutes while counting sedans and pickup trucks, it is possible to reliably determine whether the city voted Democratic or Republican: if there are more sedans, it probably voted Democrat (88% chance) and if there are more pickup trucks, it probably voted Republican (82% chance).

Also, while interesting, commercial market research vendors, such as Experian Automotive, can tell you much of the same information without the heavy probabilistic approach. Other research approaches also exist. It is clear there is more than one way to skin a cat, but it’s difficult to know which method will yield desired results (this analogy is still under development).

Kudos to the research team in the technical domain, but in the context of survey design and generally synthesizing a body of research, they really missed the boat. With the flood of non-traditional data sources available it is easier than ever to make inferences that lead to cognitive and statistical over-fitting. Chris Anderson’s WIRED essay on the topic from nearly a decade ago was prescient and should be required reading.

Key findings from studies that rely on highly dimensional data can be used as hypotheses to further interrogate research where there are questions about data paucity or legitimacy. This is evident in the case of the Argentinian inflation rate and there are countless examples through the global supply chain, human migration patterns and consumer preferences. Research into big data/novel analytics could be advanced by considering the impact of these proxy indicators for the domain(s) in question. This would compel researchers to be more robust in research design and foster cross-disciplinary thinking.

 

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia

Tags: data scienceProxy Indicatorsstatistics

Related Posts

Quizbot.ai: Artificial teacher, real success

Quizbot.ai: Artificial teacher, real success

December 1, 2023
ChatGPT turns 1 year old: Here is a brief recap of its infancy

ChatGPT turns 1 year old: Here is a brief recap of its infancy

November 30, 2023
Amazon Titan Image Generator is here to change the game

Amazon Titan Image Generator is here to change the game

November 30, 2023
Fooocus: Hassle-free Stable Diffusion experience

Fooocus: Hassle-free Stable Diffusion experience

November 30, 2023
MS Paint AI “Cocreator” revealed with DALL-E capacities

MS Paint AI “Cocreator” revealed with DALL-E capacities

November 30, 2023
Magnific AI: Another cool upscaler enters the scene

Magnific AI: Another cool upscaler enters the scene

November 29, 2023

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

LATEST ARTICLES

Hack your income: Unconventional ways you can earn free PayPal money online

The Rise of DAOs: Transforming governance through Ethereum’s decentralized autonomous organizations

Quizbot.ai: Artificial teacher, real success

Here is why people tape their phones on ceilings all over the world

Privacy on a new level with WhatsApp Secret Code

ChatGPT turns 1 year old: Here is a brief recap of its infancy

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy
  • Partnership
  • Writers wanted

Follow Us

  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.