Deduplicating Massive Datasets With Locality Sensitive Hashing
This article is part of a media partnership with PyData Berlin, a group helping support open-source data science libraries and tools. To learn more about this topic, please consider attending our fourth annual
PyData Berlin conference
on June 30-July 2, 2017. Matti Lyra and other experts will be giving talks on Natural Language Processing, Machine Learning, AI Ethics and many related data fields. You can also find a more detailed blog post on Matti Lyra’s personal blog at

Many online platforms that deal with natural language documents face a big problem: thousands of duplicate documents. Duplicates are easily produced in news media when a content producer like Reuters or the Associated Press distributes an article to a number publishers. Each publisher will typically add a paragraph or a footer making each copy slightly different from the others. A web crawler that monitors the sites of the individual publishers will see many almost identical articles as each copy was essentially written by Reuters and only slightly modified by the other publishers.

Quora also recently released the Quora Question Pairs dataset. The dataset contains pairs of user generated questions, some of which are semantically the same Here the aim is specifically to find questions that are semantically the same, often those questions are just slighty paraphrased ones.

This article will cover some simple ways such as character similarity as well as more advanced hashing techniques to find near duplicate document.

Document Similarity and Duplicates

When we talk about similar documents we usually mean documents that are semantically related, for instance two different news articles about the same event. There are a number of ways for determining the semantic relatedness of documents, for instance Latent Dirichlet Allocation (LDA) or neural language models. Semantic similarity is relatively easy for humans to detect but extremely difficult to do algorithmically, especially for longer documents. This is an active area of research known as distributional semantics and specifically distributional composition. The second and much simpler kind of similarity is character based similarity, which quite simply measures the character overlap between two sentences or documents. Later in the article we’ll see that these two notions of similarity are interlinked in interesting ways.

Paul Jaccard and Roofing

Now that we know where near duplicates come from and how they differ from documents that are semantically related, let’s outline how one could go about detecting them. The method relies on two simple concepts, character shingles and the Jaccard similarity. ‘Shingle’ is a term that comes from roofing and refers to partially overlapping pieces of clay, stone, wood, asphalt or some such bits of roofing material.

Wooden roof shingles on an old Church roof in Finland. By Htm – Own work, CC BY-SA 3.0, Link

The idea behind character shingles is similar, create a document representation of consecutive overlapping character n-grams from each document. “Cat sat on the mat”, when 4-shingled becomes ('The ', 'he c', 'e ca', … ,'at.'). Notice that punctuation and whitespace are all part of the process. This represenatation preserves word order, to some extent, and allows comparing documents based on the sets of character shingles. The similarity of those documents can then simply be defined as the Jaccard similarity of the two sets of shingles; the number of elements (shingles) they have in common as a proportion of the combined size of the two sets, or the size of the intersection divided by the size of the union.

For two dummy sentences let’s see how the length of the character shingle effects the similarity of the documents. We’ll use python‘s matplotlib and seaborn libraries to plot the similarities.

Leaving aside the discussion of whether a red cat sitting on a mat is in fact a “duplicate” of a cat sitting on a mat, what it means for something to be red or indeed a cat, I will declare those two “documents” as duplicates of each other. With a shingle size of 2 we get a similarity of ~0.81 (or 81%) and with a shingle size of 5 we get a similarity of ~0.62 or 62%.

The size of the shingle clearly has an impact on the effectiveness of this approach, one should however bear in mind that the two “documents” are very short especially in comparison to a shingle size of 5. For longer documents that better reflect actual language usage the shingling will produce much more reasonable outputs.

Where Character Similarity Fails

It is not difficult to come up with sentences that are semantically the same but share only a small proportion of their shingles, for example: “What’s the flight time from Berlin to Helsinki” and “How long does it take to fly from Berlin to Helsinki” are semantically exactly the same but have very few words or character n-grams in common. On the other hand “What’s the flight time from Berlin to Helsinki” and “What’s the flight time from Berlin to Oulu” are semantically not the same but have a large character overlap.

These two are again simple example sentences but it is important to understand where the limits of any particular method or technology lie. We already know that the document and shingle length affect the success of this method, suggesting that it might not work so well for short one sentence documents, for instance tweets. Equally it’s unlikely to work all that well for rephrased sentences or documents as the semantics of rephrased or summarised information should not change but the character representation will. This should all be fine however, as we’ve already defined the task to be about finding near duplicate documents not semantically similar ones, for document collections with longer documents this method should work very well.

A Real World Example – Deduplicating the Reuters RCV1 corpus

The Reuters Corpus, Volume 1 (RCV1) corpus is a commonly used resource for various NLP tasks, especially document classification. It was made available in 2000 by Reuters Ltd. and consists of ~800,000 english language news stories collected between August 20th 1996 and August 19th 1997 from the Reuters news wire.

I’ve preprocessed the corpus so that it is all in a single file, one line per document. Each line has the format:


Some duplicate items are present in the corpus so let’s see what happens when we apply LSH to it. There is a helper function that creates the document shingles and measures the Jaccard similarity. Let’s first try this pure python implementation on the first 500 documents.

If we look at the documents themselves we can easily see how accurately the algorithm is detecting duplicate documents. Let’s see what documents 2 and 3 look like. We don’t really care what the first 6 character are as those are just the document ID and a <TAB>.

Judging by the first 350 characters of the articles it would seem that the documents are indeed exact duplicates. Let’s see what near duplicates look like, this time we’ll take documents 25 and 77 which have been assigned a similarity score of 0.80 or 80%.

There are more differences between the two, but they are essentially talking about the exact same thing. A few sentences have been paraphrased but otherwise the documents look identical. So the method seems to be working.

Approximately 7 percent of the first 500 documents are in fact duplicates, that translates to about 56,000 duplicate documents in the relatively small 800,000 document dataset overall. The problem is that finding those duplicates took quite a long time as computing the Jaccard similarity of the documents requires comparing every document to every other document, this approach is clearly not scalable. This is where Locality Sensitive Hashing with minhash comes in.

Locality Sensitive Hashing

Locality Sensitive Hashing (LSH) is a generic hashing technique that aims, as the name suggests, to preserve the local relations of the data while significantly reducing the dimensionality of the dataset. It can be used for computing the Jaccard similarities of elements as well as computing the cosine similarity depending on exactly which hashing function is selected.

As it stands each document is shingled into some number of shingles, the exact number of which depends on the length of the document and the size of the shingle. Each document therefore has an unpredictable memory footprint, ideally we’d have a document representation whose size is independent of the length of the document without changing the semantics of document similarity. This is where minhash comes in, it’s a specific hash function that has some desirable properties for this use case. Namely, it turns out that the probability of a hash collision for a minhash is exactly the Jaccard similarity of two sets.

Using minhash we can create a fixed length fingerprint from each document, where each item in the fingerprint is a different random permutation of the rows. The longer the fingerprint the higher the likelihood that duplicate documents have a hash collision for at least one of the permutations. You’re not guaranteed to get a collision, but you get to control the memory requirements of your document deduplication algorithm. The graph below shows the relation between the actual Jaccard similarity of a pair of documents and the probability it will be discovered for a few different parameter settings of LSH.

I’ve done a detailed comparison on a sample of 1000 documents from RCV1 of LSH and an exact algorithm and found that out of 27 pairs with similarity greater than 90% 27 were found, that’s 100.0%. In terms of false positives, documents flagged as duplicates that are not in fact duplicates, only 84 were foind. That’s 84 documents that were checked in vein in comparison to the 499000 pairs we would have had to go through for an all pairs comparison. I’ll gladly take those odds.

As we can see LSH with minhash is an great tools for finding candidate pairs of near duplicate documents. The method is scalable to large quantities of data and the best of all the slow operations are easy to parallelise.

The naive pure python implementation is quite slow and not really usable in production. I’ve made a much more robust implementation that utilizes cython and murmurhash for very fast and memory efficient creation of the document fingerprints, the implementation is freely available on github at


Like this article? Subscribe to our weekly newsletter to never miss out!

Previous post

How blockchain is changing the way we pay

Next post

5 Misconceptions About Data-Driven Financial Services Marketing