Welcome to Part 2 of How to use Elasticsearch for Natural Language Processing and Text Mining. It’s been some time since Part 1, so you might want to brush up on the basics before getting started.
This time we’ll focus on one very important type of query for Text Mining. Depending on the data it can solve at least 2 different kinds of problems. This magical query I’m referring to is the More Like This Query.
GET /_search { "query": { "more_like_this" : { "fields" : ["title", "content", "category", "tags"], "like" : "Once upon a time", "max_query_terms" : 12 } } }
Let’s have a look at the basic parameters:
It can take either a set of IDs that are inside the index you’re querying or an external document (“like”).
Now the question: what do you want to compare it to? You can list all the fields that are interesting. Let’s assume your dataset consists of news articles.
The relevant fields will be for example: title, content, category, tags.
What happens when that query is fired?
It will analyse your input text that comes either from the documents in the index or directly from the like text. It will extract the most important keywords from that text and run a Boolean Should query with all those keywords.
How does it know what a keyword is?
Keywords can be determined with a formula given a set of documents. The formula can be used to compare a subset of the documents to all documents based on word probabilities. It is called Tf-Idf and despite several attempts to find something new and fancy it is still a very important formula for TextMining.
It assigns a score to each term in the subset compared to the entire corpus of documents.
A high score indicates that it is more likely that a term identifies or characterizes the current subset of documents and distinguishes it clearly from all other documents.
If you have a very clean dataset of — let’s continue with the example — news articles, you should easily be able to extract keywords that describe each section: Sports, Culture, Politics and Business.
But if you have to solve a real world Big Data problem, you will probably have a lot of noise in your data: links, words from another language, tags etc. If that “garbage” is not equally distributed you will have a problem. Tf-Idf will score very high all those rare “mistakes” in your dataset as they look very unique to the algorithm.
So you need to be aware of this and clean up your dataset.
Anyway. This logic is used under the hood when running a More Like This Query.
You can further configure the maximum number of query terms and some frequency cutoffs that can also help you with cleaning up the input.
The MLT query will return results most of the times if your document corpus (index) is large enough.
If you don’t trust the “magic” query or want to understand why it returns certain hits you can activate highlighting.
So you will be able to see the query terms that matched the documents.
That’s the best you can get. There is no option to return all the generated keywords from the input document.
To enable highlighting with the More Like This query you need to configure your mapping for the fields you want to be highlighted.
Just add this to the properties of the field:
"term_vector" : " with_positions_offsets"
We talked a lot about the MLT query and maybe you already have a few applications in mind.
3. Recommendation Engine
The most basic TextMining application for the MLT query is a recommendation engine.
There are usually 2 types of recommendation engines: social and content based. A social recommendation engine is also referred to as “Collaborative Filtering” mostly known as Amazons “People who bought this product also bought…”
This works based on the assumption that a user will be interested in what other users with a similar taste liked. You need quiet a lot of interaction data for this to work well.
The other type of recommendation engine is called “Item based recommendation engine”. This tries to group the datasets based on the properties of the entries. Think of novels or scientific papers as an example.
With Elasticsearch you can easily build an item based recommendation engine.
You just configure the MLT query template based on your data and that’s it. You will use the actual item ID as a starting point and recommend the most similar documents from your index.
You can add custom logic by running a bool query that combines a function score query to boost by popularity or recency on top of the more like this query.
4. Duplicate Detection
Depending on your dataset that same MLT query will return all duplicates. If you have data from several sources (news, affiliate ads, etc.) it is pretty likely to run into duplicates. For most end user applications this is unwanted behaviour.
But for an expert system you could use this technique to clean up your dataset.
How does it work?
There are always 2 big problems with duplicate detection:
You need to compare all documents pairwise (O(n²))
The first inspected element will remain, all others will be discarded
So you need a lot of custom logic to choose the first document to look at. It should be the best.
As the complexity is very high you might not want to detect duplicates offline in a batch process but online as they are needed.
The industry standard algorithms for duplicate detection are Simhash and Minhash (used by Google and Twitter e.g.).
They generate hashes for all documents, store them in an extra datastore and use a similarity function. All documents that exceed a certain threshold are considered duplicates.
For very short documents you can work with the Levenshtein distance or Minimum Edit Distance. But for longer documents you might want to rely on a token based solution.
The more like this query can help you here.
I have the next blog post in the works but don’t worry, you’ll have enough time to let all of the knowledge in Part 1 and Part 2. sink in. For now, stay tuned for Part. 3 of the How to use Elasticsearch for NLP and Text Mining series, where we’ll tackle Text Classification and Clustering.
Like this article? Subscribe to our weekly newsletter to never miss out!