Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

The Problem With (Statistical) False Friends

byIan White
March 10, 2017
in Articles
Home Resources Articles

I recently stumbled across a research paper, Using Deep Learning and Google Street View to Estimate the Demographic Makeup of the US, which piqued my interest in derivative uses of data, an ongoing research interest of mine. A variety of deep learning techniques were used to draw conclusions about relationships of car ownership, political affiliation and demographics. For those headline skimmers, you may be led to believe that researchers have just uncovered a vastly cheaper and more timely approach to perform the national census and make predictive claims about the population.

The researchers’ contention that official statistics are expensive and lagging is spot on. The principal US unemployment survey is performed in person or via telephone. Mystery shoppers still go into the field to purchase the underlying goods in the Consumer Price Index. Monthly government statistics are typically released several weeks after the close of the period and revised multiple times. The more infrequent the release, the longer the tabulation period. And for good reason.

These are national statistics, and by government mandate are required to have a transparent, consistent and well-understood methodology. When countries lie, they get found out. Ask Argentina about bogus inflation statistics. And that wasn’t even the dumb part–the difference between provincial government and national stats (black line) during the time in question is obvious to anybody who can read a chart:

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

The Problem With (Statistical) False Friends

Or analyze online prices in Argentina, compute a price index and see a similar conclusion. This initiative turned into the Billion Prices Project at MIT and is one of the innumerable research projects that use novel/alternative approaches to measure macro trends in a timely manner. Other highlights include Google’s use of flu-related search terms to indicate current influenza rates (which worked until it didn’t). Or near-time reporting of unemployment rates across EU member states. But I digress…

The danger in relying on the Google Street View study cited above can lead to spurious claims when taken out of context. I’m sure the authors are rolling their eyes at the below because nobody is suggesting polling can be better performed by knowing automobile ownership (not to mention the bias).

For example, the vehicular feature that was most strongly associated with Democratic precincts was sedans, whereas Republican precincts were most strongly associated with extended-cab pickup trucks (a truck with rear-seat access). We found that by driving through a city for 15 minutes while counting sedans and pickup trucks, it is possible to reliably determine whether the city voted Democratic or Republican: if there are more sedans, it probably voted Democrat (88% chance) and if there are more pickup trucks, it probably voted Republican (82% chance).

Also, while interesting, commercial market research vendors, such as Experian Automotive, can tell you much of the same information without the heavy probabilistic approach. Other research approaches also exist. It is clear there is more than one way to skin a cat, but it’s difficult to know which method will yield desired results (this analogy is still under development).

Kudos to the research team in the technical domain, but in the context of survey design and generally synthesizing a body of research, they really missed the boat. With the flood of non-traditional data sources available it is easier than ever to make inferences that lead to cognitive and statistical over-fitting. Chris Anderson’s WIRED essay on the topic from nearly a decade ago was prescient and should be required reading.

Key findings from studies that rely on highly dimensional data can be used as hypotheses to further interrogate research where there are questions about data paucity or legitimacy. This is evident in the case of the Argentinian inflation rate and there are countless examples through the global supply chain, human migration patterns and consumer preferences. Research into big data/novel analytics could be advanced by considering the impact of these proxy indicators for the domain(s) in question. This would compel researchers to be more robust in research design and foster cross-disciplinary thinking.

 

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia

Tags: data sciencestatisticssurveillance

Related Posts

Digital inheritance technology by Glenn Devitt addresses the $19T asset transfer problem

September 5, 2025
Earn Stable Crypto Passive Income in 2025 with 5 Best AI Crypto Coin Staking Cloud Mining Platforms

Earn Stable Crypto Passive Income in 2025 with 5 Best AI Crypto Coin Staking Cloud Mining Platforms

September 4, 2025
Why BPM tools are essential for the future of Business Process Automation

Why BPM tools are essential for the future of Business Process Automation

September 3, 2025
Top Model Context Protocol tools and platforms in 2025

Top Model Context Protocol tools and platforms in 2025

September 3, 2025
When Regulation Embraces Innovation: Xenco Medical Founder and CEO Jason Haider Discusses the Upcoming 2026 CMS Transforming Episode Accountability Model

When Regulation Embraces Innovation: Xenco Medical Founder and CEO Jason Haider Discusses the Upcoming 2026 CMS Transforming Episode Accountability Model

August 26, 2025
DeFAI and the Future of AI Agents

DeFAI and the Future of AI Agents

July 26, 2025
Please login to join discussion

LATEST NEWS

Texas Attorney General files lawsuit over the PowerSchool data breach

iPhone 17 Pro is expected to arrive with 48mp telephoto, variable aperture expected

AI chatbots spread false info in 1 of 3 responses

OpenAI to mass produce custom AI chip with Broadcom in 2025

When two Mark Zuckerbergs collide

Deepmind finds RAG limit with fixed-size embeddings

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.