Apache LuceneAccording to his LinkedIn profile, Robert Muir is Mongolia-based Ghostbuster for Elasticsearch. Any activities involving the elimination of supernatural entities aside, what we do know is that his work at Elasticsearch involves implementing and improving the reliability of Apache Lucene. He’s also an Apache Lucene committer; Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. We caught up with Robert at Berlin Buzzwords to discuss his work, how people are using Lucene and what we can expect from Lucene in the future- sadly, there was no talk of ghosts.

Tell us a little bit about yourself and your work.

My name is Robert Muir and I’ve been an Apache Lucene committer for five years now.I work for Elasticsearch; I’m a developer there and I mostly work on Lucene.

The talk you gave today was on the new features or Apache Lucene. Do you want to give us a brief overview of that?

Essentially Lucene has grown a lot since Lucene 4. It’s more than just a core indexing library. We have features that people expect of search engines driven by Google, like auto-suggest and highlighting and faceting. So in Lucene 4, we have all of this other stuff you need around search as part of the library. The idea of this talk is that you can go to the store and buy Lucene in action and get a good description that’s maybe three or four years out of date, but it won’t tell you about all these cool features that you need to deal with. Like you need auto suggest. All users expect it. So the talk was to give people an idea of how it works in Lucene 4. A sort of up-to-date overview.

So there’s been alot of talks about search engines this year- search seems to be the buzzword of this year’s Berlin Buzzwords. What is it about Lucene that you think makes it stand out?

I think the first thing that people are attracted to when they use Lucene is that it’s fast- much faster than you would expect. Maybe because it’s Java code, they expect it to be much slower than it is. It’s much faster than a database usually for a lot of types of queries that users want to do these days. I think a part of dealing with lots of data is that you can’t deal with it all at once. So search is more naturally here because you’re just saying, I want to look at the most relevant stuff because I can’t look at all of it.

One of the things that stood out to me during the talk was how customizable Lucene is. How important is the customization when you are developing Lucene? Is that one of the main priorities?

Lucene always began as an API, which is different than say an Oracle database, where you have a server. Because of that, I think customization has always been a high priority. It’s built for just that. It’s built if you want to embed search somewhere to do something custom. If you want to have something more out-of-the-box, you can get Solr or Elasticsearch, which are the server version. We just make the customisable low-level engine and people use it in radically different ways for different purposes. So it’s definitely a huge priority.

Are there any particular use cases of Lucene that you find particularly interesting?

At Elasticsearch, we see a lot of people using it for log analysis. We see a lot of people doing stuff that’s more like analytics. And I think it’s really interesting because I just never thought about using Lucene for that, but it works pretty well and it solves a lot of real-world needs. I mean, I think we could probably make some improvements- we see these use cases and as developers, we haven’t tuned in for that or thought about it. So it’s cool for that reason.

Can you tell us a little bit more as well about your work with Elasticsearch?

I just started working there for about a month or two ago, and basically I work on Lucene. The first thing we did is we worked on improving sort of the reliability of Lucene. Lucene didn’t have bugs, but we just didn’t have features that you would expect to have for a data store. And these features are things like adding detection of errors to improve reliability. And you’ve got systems like Solr and Elasticsearch taking Lucene indexes and sending them around on the network, so we need to detect when something goes wrong. So we added file check summing, for example, to Lucene. That’s one of the first things I did. I think we improved the robustness a lot just with that change. It’s changes like that which make working on Lucene exciting.

What are you working on for the future of Lucene?

I can tell you what we’re working on right now, because we don’t really have a good idea of what’s coming- it’s open source, so it’s all up in the air. Currently I’m working on improving the way queries execute. And long-term, hopefully the way they work with positions to have more power, more flexibility and greater speed. So hopefully this is something we’ll fix this year.

Big data has gained a huge amount of momentum and hype over the past couple of years- where do you think this is headed?

There’s more and more information and we’re getting overloaded by it. I think search is an important role here as it allows you to sift through everything and find the needle in a haystack.As we’re drowning in data, I think improving the quality, performance and usability of the search is really important.

Apache Lucene 2

Elasticsearch is a real-time search server based on Lucene, with high availibility and multi-tenancy. In collaboration with Logstash and Kibana, they formed an end-to-end “ELK” stack that delivers actionable insights in real-time from almost any type of structured and unstructured data source.


(Image credit: Apache Lucene)

Previous post

Tech Open Air? Tech Open Bar.

Next post

Actuate Announce ‘Freemium’ Model for BIRT