Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Tabs or spaces?

byFelipe Hoffa
January 4, 2017
in Articles
Home Resources Articles
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail
Google Preferred Source

Using tabs or spaces when writing a new line of code has been one of the fiercest battles ever fought among coders. Because we don’t live in a perfect world where everybody indents and aligns according to the same standards, the debate is ultimately reduced to how source-code is displayed in editing software. Conflict arises because code doesn’t display nicely if there is no consistency throughout the file, which can get especially tricky when several coders are involved in the same project.

The debate has been around for so long that it has become polarizing enough, so that coders now proudly self-classify to as “tab people or “space people.”

Tabs or spaces – A battle in BigQuery

Our long-time collaborator and Google developer advocate Felipe Hoffa set out to find an answer. Felipe’s data came from GitHub files stored in BigQuery. He considered the top 400,000 repositories, sorted by number of starts received on Github between January and May, 2016.  He only considered files with more than 10 lines of code, and excluded duplicates. In cases where files used both tabs and spaces, he gave the vote to whichever method was used more frequently.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

Below is his original post on Medium, where he parses a billion files among 14 programming languages to decide which one is on top – tabs, or spaces.


 

400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Tabs or spaces?

The rules:

  • Data source: GitHub files stored in BigQuery.
  • Stars matter: We’ll only consider the top 400,000 repositories — by number of stars they got on GitHub during the period Jan-May 2016.
  • No small files: Files need to have at least 10 lines that start with a space or a tab.
  • No duplicates: Duplicate files only have one vote, regardless of how many repos they live in.
  • One vote per file: Some files use a mix of spaces or tabs. We’ll count on which side depending on which method they use more.
  • Top languages: We’ll look into files with the extensions (.java, .h, .js, .c, .php, .html, .cs, .json, .py, .cpp, .xml, .rb, .cc, .go).

Numbers

400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Tabs or spaces?

How-to

I used the already existing [bigquery-public-data:github_repos.sample_files] table, that lists the files of the top 400,000 repositories. From there I extracted all the contents for the files with the top languages extensions:

SELECT a.id id, size, content, binary, copies, sample_repo_name , sample_path
FROM (
  SELECT id, FIRST(path) sample_path, FIRST(repo_name) sample_repo_name 
  FROM [bigquery-public-data:github_repos.sample_files] 
  WHERE REGEXP_EXTRACT(path, r'\.([^\.]*)$') IN ('java','h','js','c','php','html','cs','json','py','cpp','xml','rb','cc','go')
  GROUP BY id
) a
JOIN [bigquery-public-data:github_repos.contents] b
ON a.id=b.id
864.6s elapsed, 1.60 TB processed

That query took a relative long time since it involved joining a 190 million rows table with a 70 million rows one, and over 1.6 terabytes of contents. But don’t worry about having to run it, since I left the result publicly available at [fh-bigquery:github_extracts.contents_top_repos_top_langs].

In the [contents] table we have each unique file represented only once. To see the total number of files and size represented:

SELECT SUM(copies) total_files, SUM(copies*size) total_size
FROM [fh-bigquery:github_extracts.contents_top_repos_top_langs]

400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Tabs or spaces?
1 billion files, 14 terabytes of code

Then it was time to run the ranking according to the previously established rules:

SELECT ext, tabs, spaces, countext, LOG((spaces+1)/(tabs+1)) lratio
FROM (
  SELECT REGEXP_EXTRACT(sample_path, r'\.([^\.]*)$') ext, 
         SUM(best='tab') tabs, SUM(best='space') spaces, 
         COUNT(*) countext
  FROM (
    SELECT sample_path, sample_repo_name, IF(SUM(line=' ')>SUM(line='\t'), 'space', 'tab') WITHIN RECORD best,
           COUNT(line) WITHIN RECORD c
    FROM (
      SELECT LEFT(SPLIT(content, '\n'), 1) line, sample_path, sample_repo_name 
      FROM [fh-bigquery:github_extracts.contents_top_repos_top_langs]
      HAVING REGEXP_MATCH(line, r'[ \t]')
    )
    HAVING c>10 # at least 10 lines that start with space or tab
  )
  GROUP BY ext
)
ORDER BY countext DESC
LIMIT 100
16.0s elapsed, 133 GB processed

Analyzing each line of 133 GBs of code in 16 seconds? That’s why I love BigQuery.


Space people, rejoice

According to the data, the winner here is spaces. C is the only major language where spaces were not used in the most popular files on GitHub. This might just be as close as we will ever get in discovering whether tabs or spaces are more popular.

The reason for the popularity of spaces might be that they display consistently across all hardware and text viewing software. However, it is still unclear whether spaces are objectively, qualitatively better than tabs.

 

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia

Tags: Data analysisGithubGoogle BigQuerysurveillance

Related Posts

How Public Web Data Can Strengthen Environmental Protection

How Public Web Data Can Strengthen Environmental Protection

June 10, 2026
How automation tools are being integrated into professional networking

How automation tools are being integrated into professional networking

May 31, 2026
Autonomous agentic UI orchestration for high-throughput enterprise ecosystems

Autonomous agentic UI orchestration for high-throughput enterprise ecosystems

May 31, 2026
Freedom Holding Corp.: Competing through data and integration

Freedom Holding Corp.: Competing through data and integration

May 15, 2026
First Round Capital’s Network Shows Where Seed Capital Is Landing

First Round Capital’s Network Shows Where Seed Capital Is Landing

May 5, 2026
The silence in the machine: Reclaiming authority in the age of digital noise

The silence in the machine: Reclaiming authority in the age of digital noise

April 22, 2026
Please login to join discussion

LATEST NEWS

Google Gemini outage affects users reporting error 1076 and 1099

Geoffrey Hinton rethinks AI’s role in warfare after Ukraine conflict

Logitech launches foldable Mobi Fold mouse for mobile workers

Anthropic launches Claude Fable 5 ahead of $965 billion IPO

Hasbro launches AI licensing studio Sixth Wall for approved character use

EU moves to ban transactions on 11 crypto platforms tied to Russia

BEST AI MODELS LEADERBOARD

See the best AI models, ranked by intelligence, benchmark results, speed and token price. Find the most suitable LLMs, Text-to-Image, Image Editing, Text-to-Speech, Text-to-Video and Image-to-Video  artificial intelligence model for your tasks and business.

LATEST TOOLS

Roboto AI

Pickaxe

Pfpmaker

MindPal

Syllaby

ScreenApp

FinanceBrain

GitHub Spark

Hints

VisionStory AI

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI tools
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies to improve your experience. You can choose to accept or reject them. Visit our Privacy Policy.