Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

How to use ElasticSearch for Natural Language Processing and Text Mining — Part 2

bySaskia Vola
May 24, 2017
in Articles, Resources
Home Resources Articles

Welcome to Part 2 of How to use Elasticsearch for Natural Language Processing and Text Mining. It’s been some time since Part 1, so you might want to brush up on the basics before getting started.

This time we’ll focus on one very important type of query for Text Mining. Depending on the data it can solve at least 2 different kinds of problems. This magical query I’m referring to is the More Like This Query.

GET /_search
{
    "query": {
        "more_like_this" : {
            "fields" : ["title", "content", "category", "tags"],
            "like" : "Once upon a time",
            "max_query_terms" : 12
        }
    }
}

Let’s have a look at the basic parameters:

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

It can take either a set of IDs that are inside the index you’re querying or an external document (“like”).

Now the question: what do you want to compare it to? You can list all the fields that are interesting. Let’s assume your dataset consists of news articles.

The relevant fields will be for example: title, content, category, tags.

What happens when that query is fired?

It will analyse your input text that comes either from the documents in the index or directly from the like text. It will extract the most important keywords from that text and run a Boolean Should query with all those keywords.

How does it know what a keyword is?

Keywords can be determined with a formula given a set of documents. The formula can be used to compare a subset of the documents to all documents based on word probabilities. It is called Tf-Idf and despite several attempts to find something new and fancy it is still a very important formula for TextMining.

It assigns a score to each term in the subset compared to the entire corpus of documents.

A high score indicates that it is more likely that a term identifies or characterizes the current subset of documents and distinguishes it clearly from all other documents.

If you have a very clean dataset of  — let’s continue with the example —  news articles, you should easily be able to extract keywords that describe each section: Sports, Culture, Politics and Business.

But if you have to solve a real world Big Data problem, you will probably have a lot of noise in your data: links, words from another language, tags etc. If that “garbage” is not equally distributed you will have a problem. Tf-Idf will score very high all those rare “mistakes” in your dataset as they look very unique to the algorithm.

So you need to be aware of this and clean up your dataset.

Anyway. This logic is used under the hood when running a More Like This Query.

You can further configure the maximum number of query terms and some frequency cutoffs that can also help you with cleaning up the input.

The MLT query will return results most of the times if your document corpus (index) is large enough.

If you don’t trust the “magic” query or want to understand why it returns certain hits you can activate highlighting.

So you will be able to see the query terms that matched the documents.

That’s the best you can get. There is no option to return all the generated keywords from the input document.

To enable highlighting with the More Like This query you need to configure your mapping for the fields you want to be highlighted.

Just add this to the properties of the field:

"term_vector" : " with_positions_offsets"

We talked a lot about the MLT query and maybe you already have a few applications in mind.

3. Recommendation Engine

The most basic TextMining application for the MLT query is a recommendation engine.

There are usually 2 types of recommendation engines: social and content based. A social recommendation engine is also referred to as “Collaborative Filtering” mostly known as Amazons “People who bought this product also bought…”

How to use ElasticSearch for Natural Language Processing and Text Mining — Part 2

This works based on the assumption that a user will be interested in what other users with a similar taste liked. You need quiet a lot of interaction data for this to work well.

The other type of recommendation engine is called “Item based recommendation engine”. This tries to group the datasets based on the properties of the entries. Think of novels or scientific papers as an example.

With Elasticsearch you can easily build an item based recommendation engine.

You just configure the MLT query template based on your data and that’s it. You will use the actual item ID as a starting point and recommend the most similar documents from your index.

You can add custom logic by running a bool query that combines a function score query to boost by popularity or recency on top of the more like this query.

4. Duplicate Detection

Depending on your dataset that same MLT query will return all duplicates. If you have data from several sources (news, affiliate ads, etc.) it is pretty likely to run into duplicates. For most end user applications this is unwanted behaviour.

But for an expert system you could use this technique to clean up your dataset.

How does it work?

There are always 2 big problems with duplicate detection:

You need to compare all documents pairwise (O(n²))
The first inspected element will remain, all others will be discarded
So you need a lot of custom logic to choose the first document to look at. It should be the best.

As the complexity is very high you might not want to detect duplicates offline in a batch process but online as they are needed.

The industry standard algorithms for duplicate detection are Simhash and Minhash (used by Google and Twitter e.g.).

They generate hashes for all documents, store them in an extra datastore and use a similarity function. All documents that exceed a certain threshold are considered duplicates.

For very short documents you can work with the Levenshtein distance or Minimum Edit Distance. But for longer documents you might want to rely on a token based solution.

The more like this query can help you here.

I have the next blog post in the works but don’t worry, you’ll have enough time to let all of the knowledge in Part 1 and Part 2. sink in. For now, stay tuned for Part. 3 of the How to use Elasticsearch for NLP and Text Mining series, where we’ll tackle Text Classification and Clustering.

 

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia

Tags: ElasticsearchHow to use Elasticsearch for NLP and Text MiningNLPSaskia VolasurveillanceText mining

Related Posts

When Regulation Embraces Innovation: Xenco Medical Founder and CEO Jason Haider Discusses the Upcoming 2026 CMS Transforming Episode Accountability Model

When Regulation Embraces Innovation: Xenco Medical Founder and CEO Jason Haider Discusses the Upcoming 2026 CMS Transforming Episode Accountability Model

August 26, 2025

Deductive reasoning

August 18, 2025

Digital profiling

August 18, 2025

Test marketing

August 18, 2025

Embedded devices

August 18, 2025

Bitcoin

August 18, 2025
Please login to join discussion

LATEST NEWS

Asian banks fight fraud with AI, ISO 20022

Android 16 Pixel bug silences notifications

Azure Integrated HSM hits every Microsoft server

Apple’s Asa AI chatbot trains retail staff

ChatGPT introduces flashcard quizzes for learning

Fake DocuSign emails spoof Apple Pay charges

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.