Chapter 10. Tika and the Lucene search stack – Tika in Action

Chapter 10. Tika and the Lucene search stack

 

This chapter covers

 

We’re going to take a break from our in-depth tour of the Tika framework. By now, those topics should be second nature to you. But you may not be so comfortable with phrases like Mahout, or Droids, or (eep!) Open Relevance.

Though these terms might sound foreign, they’re common terminology to those familiar with the Apache Lucene[1] family of search-related applications. Lucene is an Apache Top Level Project, or TLP, originally home to a number of search-related software products that themselves have grown to TLP-level status, including Tika.

1 The name Lucene was Doug Cutting’s wife’s middle name, and her maternal grandmother’s first name as detailed at http://mng.bz/XyTG.

It’s our job in this chapter to educate you about these projects, and frame your understanding of Tika’s usefulness and relationship to this family of software applications. We’ll keep it high-level, focusing more on the architecture and less on the actual implementations. Those are dutifully covered in other fine Manning books.[2]

2 Specifically, we encourage you to check out Mahout in Action, Lucene in Action (1st and 2nd editions), and Solr in Action, because they cover Tika in some form and will help as a supplement to this book.

View From the Top An Apache Top Level Project (TLP) signifies a level of maturity for a particular software product. TLP indicates that the project has attracted a diverse base of committers, across multiple organizations; made frequent software releases under the Apache license, adhering to Apache standards in terms of dependent libraries, attribution, and legal protection; and demonstrated the ability to self-govern, elect new committers, and effectively manage itself. Tika reached this tremendous milestone on April 21, 2010.

10.1. Load-bearing walls

We’ll begin by explaining the high-level diagram shown in figure 10.1, indicative of the rich and blossoming Lucene ecosystem.

Figure 10.1. The Apache Lucene ecosystem and its family of software products. Some of the software products (such as Mahout, Nutch, Tika, and Hadoop) have graduated to form their own software ecosystems, but they all originated from Lucene. Tika is the third dimension in the stack, servicing each layer in some form or fashion.

Each of the boxes shown in the diagram represents a current Apache Lucene subproject, or Apache Top Level Project, with its own diverse community, release cycle, and set of software products released under its umbrella.

The diagram is layered to demonstrate the architectural properties of the system. In traditional software architecture, the layered architectural style has the following characteristics. Each layer represents some component (or set of related components) providing computation and functionality. Communication may occur intralayer (between components in the same layer) or interlayer, indicating that two adjacent layers are communicating. Interlayer communication may only occur between adjacent layers, originating from a top layer (the service consumer), and being responded to by a bottom adjacent layer (the service provider). Layers at the bottom of the architecture have little abstraction—they provide core functionality upon which all top layers rely.

Layers at upper levels of the architecture have increasing layers of abstraction, depending on those services provided by the directly adjacent service provider layer. Some layers are cross-cutting (Hadoop/HBase and Solr) and are shown as layers spanning multiple levels in the architecture. In addition, Tika is shown as the three-dimensional layer, since its applicability spans each one of the service consuming and providing layers in the Lucene ecosystem.

The technologies at the lower portion of the stack in figure 10.1 form the load-bearing walls on which the rest of the ecosystem stands tall. In this section, we’ll restrict our focus to ManifoldCF and Open Relevance and their relationship to Tika. As can be seen from the diagram, even though ManifoldCF and Open Relevance form the load-bearing walls, there’s still room for some Tika “mortar” to hold those walls together!

10.1.1. ManifoldCF

The Apache Manifold Connectors (or ManifoldCF) project[3] is an Apache Incubator podling focused on building connections between external enterprise document repositories (for example, SharePoint, Documentum, and so on) and higher-level content technologies such as Apache Solr (which we’ll talk about in section 10.2.2). ManifoldCF was originally conceived and implemented as a closed source set of software made available by the MetaCarta company, but was donated to the Apache Software Foundation in January 2010. The home page for the project is shown in figure 10.2.

3 For more information see http://incubator.apache.org/connectors/ or check out ManifoldCF in Action at http://manning.com/wright/.

Figure 10.2. The Apache ManifoldCF home page from the Apache Incubator

Peas in a Pod An Apache Incubator podling is a project not yet fully endorsed by the Apache Software Foundation. All projects enter Apache through the Apache Incubator, a super-project whose sole responsibility is to guide new podlings through the ins and outs of Apache. Specifically, the goal is to attract a diverse set of committers, encourage frequent releases under the Apache license, and to move toward the ability to self-manage and self-govern. It should be no surprise that the next step for projects after graduating from the Incubator is Apache TLP status.

The project originally entered the Apache Incubator under the title Lucene Connectors Framework, or LCF, but was later renamed ManifoldCF to avoid confusion with other Apache connector-related products, including the Apache Tomcat connector framework.

The main goal of ManifoldCF is to make it easy to connect to existing enterpriselevel document sources, including Microsoft SharePoint, EMC Documentum, and Windows File Shares, to name a few. Once connected, ManifoldCF extracts content from those sources. Once extracted, ManifoldCF provides a set of tools to easily make the extracted content available to send to output sinks, with a specific focus on Lucene and Solr. In addition, ManifoldCF also extracts security-related information and passes it along to the Lucene and Solr index for use in downstream policies.

There has been some light discussion[4] in the ManifoldCF community about using Tika’s MIME detection capabilities (recall chapter 4) to identify content as it travels from input source to output sink, but nothing beyond discussion has materialized to date. Beyond identifying content, Tika may also prove useful in ManifoldCF as an output content transformer, extracting information from content traveling across the wire and making that extracted information easily available to the ManifoldCF framework.

4 See, in particular, http://mng.bz/4018.

We’ll see in section 10.2.2 how ManifoldCF currently integrates Tika via Apache Solr’s ExtractingRequestHandler, more commonly known as SolrCell. ManifoldCF sends output content as Document constructs directly to SolrCell, which then uses Tika to parse out metadata and text to send to Apache Solr.

Let’s take a look at the Open Relevance project (ORP) next. Open Relevance is a community of volunteers whose goal involves making large document collections easily available for analysis and relevancy identification.

10.1.2. Open Relevance

Open Relevance (http://lucene.apache.org/openrelevance/) started out as an Apache Lucene subproject in June 2009, with the stated goals of making large corpuses of web content available under the Apache license. Search ranking techniques require these corpuses in order to train their algorithms to identify content relevant to return in search results. Since search ranking must be fairly content agnostic, corpuses of web content such as those provided by OpenNLP must be comparatively large, and representative of the entire web.[5] The home page for the Open Relevance project is shown in figure 10.3.

5 Though smaller corpuses are also useful for specific relevancy training and algorithms.

Figure 10.3. The Apache Open Relevance home page

To date, three data sets are part of the Open Relevance collection in Apache SVN (http://mng.bz/04Tk):

  1. Hamshahri corpus— This is a moderately sized data set (~350MB) of newspaper articles from 1996 to 2002 covering 82 categories of interest including politics, arts, and so on.
  2. OHSUMED corpus— This is a larger data set (~850MB) of analyzed medical documents from 1987 to 1991.
  3. Tempo corpus— This is a small data set (~45MB) of newspaper articles from 2000 to 2002.

In making these datasets available, and providing a community for discussion of them, Open Relevance serves to inform other search-related projects in the Lucene ecosystem. For example, Lucene Analyzer classes[6] can be trained to recognize the same patterns and concepts identified as part of each corpus. In addition, Solr Analyzers can take advantage of these. Nutch also uses a custom set of Lucene Analyzers, which can be furthered informed by the data sets in Open Relevance. In addition, ranking algorithms in Solr, Lucene, and Nutch can be tuned according to the suggested importance and relevancy of the documents as identified in each training corpus from Open Relevance.

6 Don’t worry—we’ll explain what these are shortly.

So, where does Tika play into this equation? Open Relevance expects as input a number of document collections, containing document IDs, relevant textual summaries, and other queries that help ascertain the relevancy of documents to common categories and queries of interest. This is precisely the type of information that Tika can extract and provide from a corpus of documents. For example, consider the following listing:

Listing 10.1. Integrating Tika into Open Relevance

The only code developed so far for Open Relevance is a set of utility code to represent documents according to TREC (Text Retrieval Conference) standards. These document contain three attributes: a document identifier, a date associated with the document, and its text summary.

By this point in the book, you should know that Tika excels in extracting all three of the TREC document attributes that are modeled by ORP. The program in listing 10.1 shows how simple the integration is in Tika. A single call to the Tika facade takes care of all the work!

So far, we’ve shown you the load-bearing walls on which the rest of the Lucene ecosystem stands, and how Tika helps those walls (and can be thought of as their mortar). In the next section, we’ll discuss the core search technologies in the next layer of the Lucene ecosystem: Lucene Core and Solr.

10.2. The steel frame

The “bread and butter” technologies that stand on top of the load-bearing walls in the Lucene ecosystem are the flagship Apache Lucene library itself (sometimes called Lucene Core), as well as Apache Solr, which builds on top of Lucene, but still belongs in this level.

10.2.1. Lucene Core

Apache Lucene[7] is a Java-based library that provides a few basic constructs which, when brought together, form a powerful, flexible mechanism to implement search. Its home page is shown in figure 10.4.

7 See http://lucene.apache.org/ or check out Lucene in Action at http://manning.com/hatcher3/.

Figure 10.4. The Apache Lucene Top Level Project home page

At its core is the Document model, allowing for the arbitrary storage of named Fields per Document, with multiple values per Field. This allows metadata to be stored per Document in the index, as shown in table 10.1.

Table 10.1. A table-oriented view of a Lucene Document

Field

Value(s)

Title Tika in Action
Author Chris A. Mattmann, Jukka Zitting
Number of Pages 250

Table 10.1 represents a Lucene Document that itself represents metadata about an upcoming important book. The Document contains three Fields: Title, Author, and Number of Pages, where Author is a multivalued field containing two values, separated by a comma, and the other two fields are single-valued entries.

In addition to the Document model for representing content in a search index, Lucene provides a query model and a set of tools for analysis and tokenization of both text and numeric data. Lucene also contains a number of additional modules, for highlighting (partial) word matches, for indexing content from dictionaries like WordNet,[8] and even for geographic information system (or spatial) search!

8 A large lexical database of English words. Read more about it at http://wordnet.princeton.edu/.

Tika has grown to provide a number of useful features to the core Lucene library. We saw some of these in action when we saw the LuceneIndexer from chapter 5, and the MetadataAwareLuceneIndexer from chapter 6. In short, Tika can feed both text and metadata to a Lucene index for any type of file that Tika knows about.[9] Not only can Tika extract text and metadata to feed into a Lucene index, it can also dynamically pick and choose the type of files (using its MIME detector, which you read about in chapter 4) to send to the Lucene index.

9 And by now, we know that includes many types of files, more than 1200!

Once files have been indexed in Lucene, Tika can also help out, as we saw in the RecentFiles example from chapter 6, where Tika’s standard metadata field names were used to automatically determine the names of the document metadata field names to query on.

Tika’s utility doesn’t stop at Lucene Core. One of Tika’s most frequent usages within the Lucene family in is connection with Apache Solr, as we’ll read about in the next section.

10.2.2. Solr

Apache Solr (http://lucene.apache.org/solr/) builds on top of Lucene but offers many of the same functions (highlighting, query parsing, tokenization, analysis, and so forth), exposing these capabilities over a RESTful interface. Solr also extends Lucene to support concurrent index writing and reading, leveraging an HTTP Servlet Application server such as Apache Tomcat or Jetty to assist in concurrency and transaction management. The home page for the Apache Solr project is shown in figure 10.5.

Figure 10.5. The Apache Solr Project home page

Solr originally began as an internal project at CBS Interactive (or CNET), but was donated to the Apache Software Foundation in January 2006 via the Apache Incubator. After graduating from the Incubator, Solr became a Lucene subproject. Over the years, the Lucene and Solr communities have grown closer together, resulting in a merge of their development activities in March 2010.

One of Solr’s flagship capabilities is its plugin mechanism, and one of the most useful plugins developed for Solr to date directly integrates Tika into Solr’s toolkit. The ExtractingRequestHandler, or SolrCell as it’s more commonly known, is a Solr UpdateHandler implementation that allows any arbitrary document to be sent to Solr via its HTTP update interface. Once the document arrives in Solr, Tika is leveraged to extract text and metadata from the document, and to map that text and metadata into fields stored per Document in the Lucene/Solr index. Recall in section 10.1.1 we discussed that one of ManifoldCF’s key features is its easy integration with SolrCell and Tika, by sucking documents out of proprietary enterprise document and content repositories and ingesting them into Solr via Tika and SolrCell.

Other areas of integration between Tika and Solr include a recent project to integrate Tika’s language identifier (recall chapter 7) as an UpdateProcessor interface in Solr known as the LanguageIdentifierUpdateProcessor. More information on LanguageIdentifierUpdateProcessor can be found on the Solr JIRA system.[10]

10http://issues.apache.org/jira/browse/SOLR-1979

Now that we’ve covered the steel frame of the Lucene search ecosystem, it’s time to talk about some of the advanced applications that sit on top of the frame. You probably won’t be surprised at this, but Tika is used a lot in each of the applications and software systems we’re about to discuss.

10.3. The finishing touches

With a strong foundation and core, it’s no wonder that higher-level applications and frameworks have blossomed in the Lucene ecosystem. The oldest of these frameworks was the original home to Apache Hadoop—the Apache Nutch project. Nutch’s goal is to leverage Lucene, Solr, and various content-loading and extraction technologies to provide web-scale (tens of billions of web pages) search, in an efficient and effective matter. Apache Droids is an Incubator podling whose focus is developing a lightweight extensible crawler that can integrate into projects such as Nutch, Lucene, and Solr, without all the complex features and functions that those technologies provide. Finally, though we discussed Mahout earlier (in section 3.3), we’ll revisit it in the context of the Lucene ecosystem discussion, and discuss the applications that sit on top of the core and load-bearing walls of Lucene.

The best thing about our upcoming foray into these technologies? They all leverage Tika!

10.3.1. Nutch

Apache Nutch entered the Apache Incubator in January 2005, and quickly graduated that June to Lucene subproject status. At its core, Nutch’s primary goal was (and remains) opening up the “black box” that is web search and allowing for infinite tinkering and exploration in order to improve user experience and advance the state of the practice. According to Doug Cutting (Nutch’s creator), Nutch came about for this reason:

Nutch provides a transparent alternative to commercial web search engines. Only open source search results can be fully trusted to be without bias. (Or at least their bias is public.) All existing major search engines have proprietary ranking formulas, and will not explain why a given page ranks as it does. Additionally, some search engines determine which sites to index based on payments, rather than on the merits of the sites themselves. Nutch, on the other hand, has nothing to hide and no motive to bias its results or its crawler in any way other than to try to give each user the best results possible.

Doug Cutting
Founder of Nutch, 2004

After a period of years and eventual 1.0 release under the Lucene umbrella, Nutch graduated to Top Level Project status in April 2010. Its home page (http://nutch.apache.org/) is shown in figure 10.6.

Figure 10.6. The Apache Nutch Top Level Project home page

Nutch is the integration architecture that leverages most or all of the components from the Lucene ecosystem as shown in figure 10.7.

Figure 10.7. The Apache Nutch2 Architecture. A major refactoring of the overall system, Nutch is now a delegation framework, leaving the heavy lifting to the other systems in the Lucene ecosystem.

At its core, Nutch provides a crawling framework (similar to what we’ll discuss when we talk about Droids) that leverages different Protocol plugins responsible for downloading file content (over HTTP, FTP, SCP, and so on). Once content is obtained, it’s fed through Tika for parsing and metadata extraction. Once the metadata and text has been extracted, that information is passed along to Solr for indexing, and made available for search via Solr’s REST APIs. The original content is cached in Apache Gora (http://incubator.apache.org/gora/), a new Apache Incubator podling responsible for data storage and object-relational mapping. Nutch’s crawling process is run on top of Apache Hadoop as a set of distributed crawling jobs, efficiently distributing the load of crawling billions of web pages across a set of clustered computing resources.

What we’ve just described is the current Nutch2 architecture, and it represents a huge advancement over the 1.x series. Nutch2’s goal is to leverage the rest of the projects in the Lucene ecosystem to do its heavy lifting, and to make construction and experimentation with a web-scale search engine possible by forging the necessary connections between these powerful (but complex) software technologies.

Clearly, from figure 10.7, Tika is a huge part of the Nutch architecture. Besides assisting in content extraction and metadata extraction for parsing, Tika’s MIME detection system is also heavily used to help determine which content should be pulled down and crawled, how it should be parsed (which parser to leverage), and how to flow the extracted information into Solr. As we’ve seen with the search engine examples (from chapters 1 and 9 and as you’ll see in chapter 15), it’s hard to build a search engine without leveraging a framework like Tika since parsing, metadata extraction, MIME detection, and language identification are all critical functions of search.

Next we’ll cover Apache Droids, an Incubator podling whose focus is restricted to extensible file crawling and delivery to systems such as Solr and Nutch. Don’t worry; Tika will pop up there again, too!

10.3.2. Droids

One of the more serious complaints about Nutch over the years came from users who felt it was too configurable.[11] Nutch’s plentiful configuration parameters threw off users who wanted to start crawling and indexing information about files and documents out of the box. Additionally, many potential users didn’t have access to a 100-node cluster to see the benefits of deployment over Hadoop and thus wanted a more minimal, out-of-the-box crawler to begin experimenting with a corpus of documents.

11 Yes, it’s possible for users to get annoyed with too much extensibility in a system. Good software frameworks make the appropriate trade-off between sensible defaults and existing functionality at the expense of total configuration, but usually that lesson is learned over a looong time.

Enter Apache Droids, an effort to refactor and reconfigure the crawler portion of Nutch into an independent, easy-to-use framework for text extraction and crawling. Droids entered the Apache Incubator in October 2008, and has been an Incubator project ever since. The home page for Droids is shown in figure 10.8.

Figure 10.8. The Apache Droids Project home page

Droids has no qualms about leveraging Tika as a core component in its framework. The Droids home page (http://incubator.apache.org/droids/) says it all.

Apache Tika, the parser component, is just a wrapper for Tika since it offers everything we need. No need to duplicate the effort.

A Handler is a component that uses the original stream and/or the parser (ContentHandler coming from Tika) and the url to invoke arbitrary business logic on the objects.

That pretty much covers web-crawling and file-crawling frameworks built on top of the Lucene stack. At this point, we hope you appreciate how Tika fills in as the mortar connecting most of these technologies, providing the common functionality that Lucene components require to implement world-class search software.

We’ll wrap up the section with a short revisit to the Apache Mahout project, and its role in the overall Lucene ecosystem.

10.3.3. Mahout

Apache Mahout[12] started out as a Lucene subproject in January 2008, when a number of Lucene project members realized that they shared a common interest in implementing scalable machine learning algorithms on top of the Apache Hadoop framework. The home page for Mahout is shown in figure 10.9. In April 2010, Mahout become an Apache Top Level project.

12 See http://mahout.apache.org/ or check out Mahout in Action at http://manning.com/owen/.

Figure 10.9. The Apache Mahout Top Level Project home page

Since its inception, Mahout has grown to focus on the field of machine learning, opting to add capabilities for collaborative filtering (finding common products and recommending them), clustering, and (automatic) categorization. Mahout gels nicely with Lucene in that it can load data from Lucene indexes and feed the information into its machine learning algorithms to run analyses and assist in decision making in software applications, such as suggest a book you should buy on Amazon (based on your purchase history), or categorize a new product you’ve added to your website based on its features and extracted information.

We covered Mahout pretty extensively in section 3.3, but let’s summarize in case you’ve forgotten by this point.[13] Tika can be leveraged in Mahout’s algorithms as a means of turning files and documents into extracted text, which is in turn fed into Mahout’s software framework for collaborative filtering, clustering, and so forth. Since Mahout algorithms are Hadoop-enabled, Mahout represents another real-world example (akin to Nutch) of bringing Tika to Hadoop, which remains a large load-bearing wall in the Lucene ecosystem.

13 We wouldn’t blame you, seven chapters later!

For an in-depth look at Mahout and Tika, we recommend heading back to section 3.3 and checking out figure 3.8. Now that we’ve covered all of the core portions of the Lucene architecture and ecosystem, it’s time for a quick recap and wrap-up of the chapter.

10.4. Summary

The goal of this chapter was to introduce you to the vibrant Lucene ecosystem, and all of the supporting cast involved in it, including Mahout, Lucene, ManifoldCF, and others. We tried to keep it high-level and focus on the architecture and broader details of each of these projects, as an in-depth treatment of them is beyond the scope of this book.

We framed Tika’s relationship to each of these technologies, and tried to indicate the overall layered architecture and commonalities between each of these software products, taking special care to show you where Tika fit in along the way. The key take-aways should include these points:

  1. The architecture of the Lucene ecosystem— Identifying which technologies fit where, and why.
  2. The broad infection of Tika into each layer of the architecture— There’s no getting around it—Tika forms the mortar that holds the “bricks” or layers of the architecture together.

Now that you can tell your Tikas from your Solrs, it’s time to wrap up this part of the book and discuss some advanced usage of Tika, focusing on the cases where Tika as shipped requires some extensions and additional functionality to meet your needs. It’s called “Extending Tika” and it’s up next!