Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

How to use ElasticSearch for Natural Language Processing and Text Mining — Part 2

bySaskia Vola
May 24, 2017
in Articles, Resources
Home Resources Articles

Welcome to Part 2 of How to use Elasticsearch for Natural Language Processing and Text Mining. It’s been some time since Part 1, so you might want to brush up on the basics before getting started.

This time we’ll focus on one very important type of query for Text Mining. Depending on the data it can solve at least 2 different kinds of problems. This magical query I’m referring to is the More Like This Query.

GET /_search
{
    "query": {
        "more_like_this" : {
            "fields" : ["title", "content", "category", "tags"],
            "like" : "Once upon a time",
            "max_query_terms" : 12
        }
    }
}

Let’s have a look at the basic parameters:

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

It can take either a set of IDs that are inside the index you’re querying or an external document (“like”).

Now the question: what do you want to compare it to? You can list all the fields that are interesting. Let’s assume your dataset consists of news articles.

The relevant fields will be for example: title, content, category, tags.

What happens when that query is fired?

It will analyse your input text that comes either from the documents in the index or directly from the like text. It will extract the most important keywords from that text and run a Boolean Should query with all those keywords.

How does it know what a keyword is?

Keywords can be determined with a formula given a set of documents. The formula can be used to compare a subset of the documents to all documents based on word probabilities. It is called Tf-Idf and despite several attempts to find something new and fancy it is still a very important formula for TextMining.

It assigns a score to each term in the subset compared to the entire corpus of documents.

A high score indicates that it is more likely that a term identifies or characterizes the current subset of documents and distinguishes it clearly from all other documents.

If you have a very clean dataset of  — let’s continue with the example —  news articles, you should easily be able to extract keywords that describe each section: Sports, Culture, Politics and Business.

But if you have to solve a real world Big Data problem, you will probably have a lot of noise in your data: links, words from another language, tags etc. If that “garbage” is not equally distributed you will have a problem. Tf-Idf will score very high all those rare “mistakes” in your dataset as they look very unique to the algorithm.

So you need to be aware of this and clean up your dataset.

Anyway. This logic is used under the hood when running a More Like This Query.

You can further configure the maximum number of query terms and some frequency cutoffs that can also help you with cleaning up the input.

The MLT query will return results most of the times if your document corpus (index) is large enough.

If you don’t trust the “magic” query or want to understand why it returns certain hits you can activate highlighting.

So you will be able to see the query terms that matched the documents.

That’s the best you can get. There is no option to return all the generated keywords from the input document.

To enable highlighting with the More Like This query you need to configure your mapping for the fields you want to be highlighted.

Just add this to the properties of the field:

"term_vector" : " with_positions_offsets"

We talked a lot about the MLT query and maybe you already have a few applications in mind.

3. Recommendation Engine

The most basic TextMining application for the MLT query is a recommendation engine.

There are usually 2 types of recommendation engines: social and content based. A social recommendation engine is also referred to as “Collaborative Filtering” mostly known as Amazons “People who bought this product also bought…”

How to use ElasticSearch for Natural Language Processing and Text Mining — Part 2

This works based on the assumption that a user will be interested in what other users with a similar taste liked. You need quiet a lot of interaction data for this to work well.

The other type of recommendation engine is called “Item based recommendation engine”. This tries to group the datasets based on the properties of the entries. Think of novels or scientific papers as an example.

With Elasticsearch you can easily build an item based recommendation engine.

You just configure the MLT query template based on your data and that’s it. You will use the actual item ID as a starting point and recommend the most similar documents from your index.

You can add custom logic by running a bool query that combines a function score query to boost by popularity or recency on top of the more like this query.

4. Duplicate Detection

Depending on your dataset that same MLT query will return all duplicates. If you have data from several sources (news, affiliate ads, etc.) it is pretty likely to run into duplicates. For most end user applications this is unwanted behaviour.

But for an expert system you could use this technique to clean up your dataset.

How does it work?

There are always 2 big problems with duplicate detection:

You need to compare all documents pairwise (O(n²))
The first inspected element will remain, all others will be discarded
So you need a lot of custom logic to choose the first document to look at. It should be the best.

As the complexity is very high you might not want to detect duplicates offline in a batch process but online as they are needed.

The industry standard algorithms for duplicate detection are Simhash and Minhash (used by Google and Twitter e.g.).

They generate hashes for all documents, store them in an extra datastore and use a similarity function. All documents that exceed a certain threshold are considered duplicates.

For very short documents you can work with the Levenshtein distance or Minimum Edit Distance. But for longer documents you might want to rely on a token based solution.

The more like this query can help you here.

I have the next blog post in the works but don’t worry, you’ll have enough time to let all of the knowledge in Part 1 and Part 2. sink in. For now, stay tuned for Part. 3 of the How to use Elasticsearch for NLP and Text Mining series, where we’ll tackle Text Classification and Clustering.

 

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia

Tags: ElasticsearchHow to use Elasticsearch for NLP and Text MiningNLPSaskia VolasurveillanceText mining

Related Posts

Best ELD devices and fleet management tools 2025: Top picks for trucking companies

Best ELD devices and fleet management tools 2025: Top picks for trucking companies

September 18, 2025
Zen Media and Optimum7 Merge to Create AI-Native Growth Agency: Why Data Is at the Core

Zen Media and Optimum7 Merge to Create AI-Native Growth Agency: Why Data Is at the Core

September 18, 2025
How wedding photographers save hours with SoftOrbits batch editing

How wedding photographers save hours with SoftOrbits batch editing

September 11, 2025

Digital inheritance technology by Glenn Devitt addresses the $19T asset transfer problem

September 5, 2025
Earn Stable Crypto Passive Income in 2025 with 5 Best AI Crypto Coin Staking Cloud Mining Platforms

Earn Stable Crypto Passive Income in 2025 with 5 Best AI Crypto Coin Staking Cloud Mining Platforms

September 4, 2025
Why BPM tools are essential for the future of Business Process Automation

Why BPM tools are essential for the future of Business Process Automation

September 3, 2025
Please login to join discussion

LATEST NEWS

Zoom announces AI Companion 3.0 at Zoomtopia

Google Cloud adds Lovable and Windsurf as AI coding customers

Radware tricks ChatGPT’s Deep Research into Gmail data leak

Elon Musk’s xAI chatbot Grok exposed hundreds of thousands of private user conversations

Roblox game Steal a Brainrot removes AI-generated character, sparking fan backlash and a debate over copyright

DeepSeek releases R1 model trained for $294,000 on 512 H800 GPUs

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.