ContributorsData Science

How to use ElasticSearch for Natural Language Processing and Text Mining — Part 1

ElasticSearch is a search engine and an analytics platform. But it offers many features that are useful for standard Natural Language Processing and Text Mining tasks.

1. Preprocessing (Normalization)

Have you ever used the _analyze endpoint?

As you know ElasticSearch has over 20 language-analyzers built in. What is an analyzer doing? Tokenization, stemming and stopword removal.

That is very often all you need for preprocessing for higher level tasks such as Machine Learning, Language Modelling etc.

You basically just need a running instance of ElasticSearch, without any configuration or setup. Then you can use the analyze-endpoint as a Rest-API for NLP-preprocessing.

curl -XGET "http://localhost:9200/_analyze?analyzer=english" -d'
{
  "text" : "This is a test."
}'

{
  "tokens": [
  {
  "token": "test",
  "start_offset": 10,
  "end_offset": 14,
  "type": "<ALPHANUM>",
  "position": 3
  }
  ]
 }

Here’s a list of all available built in language analyzers.

2. Language Detection

Detecting languages is a so called “solved” NLP problem. You just need a character ngram language model derived by a relatively small plain text-corpus from all languages you want to distinguish.

So no need to reinvent the wheel over and over.

When you’re already have ElasticSearch up and running, you can simply install another plugin.


curl -XPOST 'localhost:9200/_langdetect?pretty' -d 'This is a test'
{
  "profile" : "/langdetect/",
  "languages" : [ {
    "language" : "en",
    "probability" : 0.9999971603535163
  } ]
}

That’s it. It’s open source, free to use and super simple.

How to use ElasticSearch for Text Mining appeared originally on textminers.io ‘s blog


 

Like this article? Subscribe to our weekly newsletter to never miss out!

Image: born1945, CC 2.0

Previous post

Using Data to Discover Hidden Value in Your Customer Journey

Next post

Data Hoarding and Alternative Data In Finance - How to Overcome the Challenges

  • Stas Mossat

    Is that all? How about significant terms and etc?