The Large Hadron Collider (LHC) is a multi billion-dollar particle accelerator that was built to answer some of the most fundamental questions about nature such as the origin of the universe. But what does this machine have in common with the digital advertising industry? The answer is quite simple – Big Data.
To provide a perspective on scale, for the Higgs-Boson discovery, over 300 trillion events of proton-proton collision were analyzed of which only a few thousand events were tagged as Higgs-Boson candidate events. An easy visualization of this is to think about an Olympic size swimming pool filled entirely with sand. In the entire pool, only one grain of sand would then represent a Higgs-Boson. In particle physics the convention followed is a five-sigma level of certainty. A signal hypothesis is considered true only if the probability that statistical fluctuations in data, assuming the background only hypothesis, can result in the observed number of events is less than 3e-7. That’s equivalent to getting 21 tails in a row when a fair coin is tossed. Hence data had to be collected for 3 years before enough Higgs-Bosons were produced such that they could be found in a vast sea of background events. The storage and analysis of this data was enabled by the World wide LHC grid (WLCG) computing project which is cluster of more than 150 computing centers located in more than 40 countries. The WLCG is designed to process up to 25 Petabytes of LHC data annually.
The rate of consumption of Internet content by humans means that billions of ad impressions are served to people on various platforms everyday. In digital advertising, newly emerging demand side platforms may end up serving tens of thousands of ad impressions before they can expect a conversion. In order to run any kind of machine learning software, a pre-requisite is often hundreds of thousands of signal events. This equates to about 100 million impressions before the tools used to tackle big data can be deployed.
In the abundance of data machine-learning algorithms such as logistic regression, artificial neural networks and decision trees, to name a few, can be deployed to predict how features of a dataset contribute in determining a event type – signal or noise. In the case of the Higgs-Boson discovery, the energy and direction of the decay particles was measured by detectors the size of football fields. This signal is digitized and converted into particle types. This conversion is done via offline software that is trained using machine learning on simulations and real data. Similarly, in digital advertising, a certain action by the user can be predicted using a plethora of information such as the site on which the ad is shown, the historical behavior of the user and the actual creative banner.
In the case of particle physics the datasets are rather clean and the noise processes to the signal are better understood due to precise measurements from past experiments. On the other hand in the case of mobile advertising, the dataset represents human behavior entangled occasionally with algorithmic action generated by bot networks that mimic clicks and conversions. The human behavior, least to say, is an amalgamation of genuine action, accidental action and incentivized action. This results in data sets being noisier than in the scientific case.
This stark overlap in tools implies that scientific and commercial research can bounce back ideas off each other and increase the pace of development of data handling technologies to continually disrupt businesses and scientific research worldwide.
Dr. Sahill Poddar is currently a data scientist at LiquidM GmbH in Berlin. LiquidM is a mobile advertising management platform providing a full-stack technological solution to advertisers and ad networks. Prior to LiquidM, Sahill obtained his doctorate in particle physics from University of Heidelberg and the European Council for Nuclear Research (CERN). His doctoral thesis involved analyzing several million proton-proton collision events with the ATLAS detector in search of new physics signals such as extra dimensions, black holes and dark matter. Sahill has a keen interest in the emergence of big data in all relevant fields from health care to black holes.
Image credit : http://home.web.cern.ch/about/computing