Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Big Data 101: Intro To Probabilistic Data Structures

byChristopher Low
April 17, 2017
in Articles, Artificial Intelligence
Home Resources Articles
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail
Google Preferred Source

Oftentimes while analyzing big data we have a need to make checks on pieces of data like number of items in the dataset, number of unique items, and their occurrence frequency. Hash tables or Hash sets are usually employed for this purpose. But when the dataset becomes so enormous that it cannot fit inside the memory all at once, we need to use special kinds of data structures known as Probabilistic Data Structures. Streaming applications usually require data processing in one pass and then incremental updates. Fortunately, probabilistic data structures fit that processing model very well. Such data structures ignore collisions but errors are controlled under a certain specified threshold. They trade in a small margin of error for considerably less memory footprint and constant query time. This article discusses some commonly used probabilistic data structures:

  1. Bloom Filter

Bloom filter is a probabilistic data structure used to efficiently determine whether a given element is a member of the dataset or not. Structure-wise it is a bit array of m-bits initialized to 0. A query returns either “maybe in set” (with some margin of error) or “definitely not in set” (with full confidence) but not “definitely in set”. This means that false positive matches are possible but false negative matches are not. The more the number of elements, the greater is the probability of false positives. Only addition of an element is possible. To add an element, it is fed to n number of hash functions to obtain n array positions and the corresponding value at these positions is set to 1. When checking for membership, the element is again passed through n hash functions to obtain n array positions. If any of the corresponding values that these positions is 0, the element is definitely not present in the dataset. If all are 1, the element might be in the dataset.

  1. HyperLogLog

Count-distinct problem (or cardinality estimation problem) is the problem of finding number of unique elements in a streaming dataset with possibly repeated elements. This problem is a common occurrence in various fields like estimating the number of unique visitors to a website, or motifs in a DNA sequence. Calculating the exact cardinality of a dataset requires an amount of memory proportional to the cardinality – which would require a large amount of memory for large datasets. HyperLogLog solves this problem by providing an estimate of the cardinality instead. It can count up to one billion distinct items with an accuracy of 2% using only 1.5 KB of memory. It is based on the fact that for a stream of randomly distributed numbers, if there is a number with the maximum of leading 0 bits equal to k, the cardinality of the stream is likely equal to be 2^k. This means that in a stream of random binary numbers, the probability of encountering a “1” in the beginning is ~50%, and of encountering a “01” is ~25%. Thus an observation of “01” in the stream’s start indicates that the cardinality is likely to be 2^2 = 4.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

  1. Count–min sketch

The count-min sketch data structure records the frequency of element occurrence in a stream. Unlike a hash table, it trades the ability to obtain a perfect frequency count with using very less (sub-linear) memory space due to over-counting of elements in case there is a hash collision. It is similar to a Bloom filter, which represents the dataset as a bitmap, while Count-min sketch is a two-dimensional array of w columns and d rows – where the w and d parameters can be varied to obtain a desired matching point between accuracy and the probability of that accuracy.

  1. MinHash

Minhash is a technique to determine the level of similarity between two datasets. It finds its applications in document clustering. Minhash uses Locality sensitive hashing (LSH) which involves generating a hash code such that similar items will tend to get similar hash codes. Precomputed LSH are compared to determine if two objects are similar enough to be examined in detail or the comparison should be discarded. Exact implementation details can be found here.

Probabilistic data structures are a very powerful tool while working with big data in a streaming format. The implementation details can be a bit daunting at first but the underlying concepts are very simple. If you haven’t tried them before, head over to Amazon AWS or Microsoft Azure and fire up a VM to hack on. You may want to use Twitter’s Algebird library which has implementations for the first three data structures described above.

 

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia

Tags: analysisBig DataUSA

Related Posts

Samsung adopts ChatGPT Enterprise and Codex across global workforce

Samsung adopts ChatGPT Enterprise and Codex across global workforce

June 22, 2026
OpenAI improves health responses for free ChatGPT users

OpenAI improves health responses for free ChatGPT users

June 19, 2026
What 53,000 Churches Reveal About the Digital Transformation of Faith Communities

What 53,000 Churches Reveal About the Digital Transformation of Faith Communities

June 19, 2026
Xenco Medical wins back-to-back honors with Fast Company’s 2026 World Changing Ideas Award and Time Magazine 2026 Impact Award

Xenco Medical wins back-to-back honors with Fast Company’s 2026 World Changing Ideas Award and Time Magazine 2026 Impact Award

June 17, 2026
Steam Next Fest sees one in five demos labeled for generative AI

Steam Next Fest sees one in five demos labeled for generative AI

June 17, 2026
Anthropic adds multilingual and push-to-talk features to Claude Voice Mode

Anthropic adds multilingual and push-to-talk features to Claude Voice Mode

June 17, 2026
Please login to join discussion

LATEST NEWS

Samsung adopts ChatGPT Enterprise and Codex across global workforce

Samsung Galaxy S27 Pro leak points to built-in Privacy Display

Perseverance rover completes a marathon on Mars

Polymarket accused of paying creators to post misleading TikTok bet videos

OpenAI improves health responses for free ChatGPT users

Adobe expands Firefly AI across Premiere, Illustrator, InDesign and Frame.io

BEST AI MODELS LEADERBOARD

See the best AI models, ranked by intelligence, benchmark results, speed and token price. Find the most suitable LLMs, Text-to-Image, Image Editing, Text-to-Speech, Text-to-Video and Image-to-Video  artificial intelligence model for your tasks and business.

LATEST TOOLS

Moonbeam

Charisma AI

Essay Writer by Papertyper

Slite

Wonderin AI

Spur

Stenography

Calldesk

MaxAI.me

PhotoRestore

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI tools
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies to improve your experience. You can choose to accept or reject them. Visit our Privacy Policy.