Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Big Data 101: Intro To Probabilistic Data Structures

by Christopher Low
April 17, 2017
in Big Data, Data Science 101, Understanding Big Data
Home Topics Data Science Big Data
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail

Oftentimes while analyzing big data we have a need to make checks on pieces of data like number of items in the dataset, number of unique items, and their occurrence frequency. Hash tables or Hash sets are usually employed for this purpose. But when the dataset becomes so enormous that it cannot fit inside the memory all at once, we need to use special kinds of data structures known as Probabilistic Data Structures. Streaming applications usually require data processing in one pass and then incremental updates. Fortunately, probabilistic data structures fit that processing model very well. Such data structures ignore collisions but errors are controlled under a certain specified threshold. They trade in a small margin of error for considerably less memory footprint and constant query time. This article discusses some commonly used probabilistic data structures:

  1. Table of Contents

    • Bloom Filter
    • HyperLogLog
    • Count–min sketch
    • MinHash

    Bloom Filter

Bloom filter is a probabilistic data structure used to efficiently determine whether a given element is a member of the dataset or not. Structure-wise it is a bit array of m-bits initialized to 0. A query returns either “maybe in set” (with some margin of error) or “definitely not in set” (with full confidence) but not “definitely in set”. This means that false positive matches are possible but false negative matches are not. The more the number of elements, the greater is the probability of false positives. Only addition of an element is possible. To add an element, it is fed to n number of hash functions to obtain n array positions and the corresponding value at these positions is set to 1. When checking for membership, the element is again passed through n hash functions to obtain n array positions. If any of the corresponding values that these positions is 0, the element is definitely not present in the dataset. If all are 1, the element might be in the dataset.

  1. HyperLogLog

Count-distinct problem (or cardinality estimation problem) is the problem of finding number of unique elements in a streaming dataset with possibly repeated elements. This problem is a common occurrence in various fields like estimating the number of unique visitors to a website, or motifs in a DNA sequence. Calculating the exact cardinality of a dataset requires an amount of memory proportional to the cardinality – which would require a large amount of memory for large datasets. HyperLogLog solves this problem by providing an estimate of the cardinality instead. It can count up to one billion distinct items with an accuracy of 2% using only 1.5 KB of memory. It is based on the fact that for a stream of randomly distributed numbers, if there is a number with the maximum of leading 0 bits equal to k, the cardinality of the stream is likely equal to be 2^k. This means that in a stream of random binary numbers, the probability of encountering a “1” in the beginning is ~50%, and of encountering a “01” is ~25%. Thus an observation of “01” in the stream’s start indicates that the cardinality is likely to be 2^2 = 4.

  1. Count–min sketch

The count-min sketch data structure records the frequency of element occurrence in a stream. Unlike a hash table, it trades the ability to obtain a perfect frequency count with using very less (sub-linear) memory space due to over-counting of elements in case there is a hash collision. It is similar to a Bloom filter, which represents the dataset as a bitmap, while Count-min sketch is a two-dimensional array of w columns and d rows – where the w and d parameters can be varied to obtain a desired matching point between accuracy and the probability of that accuracy.

  1. MinHash

Minhash is a technique to determine the level of similarity between two datasets. It finds its applications in document clustering. Minhash uses Locality sensitive hashing (LSH) which involves generating a hash code such that similar items will tend to get similar hash codes. Precomputed LSH are compared to determine if two objects are similar enough to be examined in detail or the comparison should be discarded. Exact implementation details can be found here.


Join the Partisia Blockchain Hackathon, design the future, gain new skills, and win!


Probabilistic data structures are a very powerful tool while working with big data in a streaming format. The implementation details can be a bit daunting at first but the underlying concepts are very simple. If you haven’t tried them before, head over to Amazon AWS or Microsoft Azure and fire up a VM to hack on. You may want to use Twitter’s Algebird library which has implementations for the first three data structures described above.

 

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia

Tags: analysisBig DataProbabilistic Data Structures

Related Posts

What are data silos and how to get rid of them?

Data silos are the silent killers of business efficiency

December 23, 2022
TikTok data practices for data transfer of EU citizens to China and ads catering to kids are under investigation by the EU

EU probes TikTok’s data practices with multiple investigations

November 23, 2022
Big data and artificial intelligence: What's the future for them?

AI and big data are the driving forces behind Industry 4.0

November 7, 2022
What is the impact of artificial intelligence in insurance with examples? Explore AI in insurance use cases and find out insurance companies using artificial intelligence.

The insurance of insurers

September 22, 2022
Business analytics and business intelligence solutions in retail

Navigate through the rough seas of retail with business intelligence as your compass

September 20, 2022
Data in motion briefly describes a stream of digital information between networks.

Enterprises, caution your “data in motion”

September 14, 2022

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

LATEST ARTICLES

AI Asmongold may have been one of the very first examples of AI streamers

Mastering the art of efficiency through business process transformation

Google starts testing its ChatGPT rival AI chatbot called Apprentice Bard

How AI improves education with personalized learning at scale and other new capabilities

Cyberpsychology: The psychological underpinnings of cybersecurity risks

ChatGPT Plus: How does the paid version work?

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy
  • Partnership
  • Writers wanted

Follow Us

  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.