Chapter 13. Content management with Apache Jackrabbit – Tika in Action

Chapter 13. Content management with Apache Jackrabbit

 

This chapter covers

  • The Apache Jackrabbit Content Repository
  • The use of Tika in Jackrabbit
  • File detection and parsing for Jackrabbit WebDAV

 

Apache Jackrabbit, http://jackrabbit.apache.org, is a content repository that provides a rich storage layer on which to build content and document management systems like the ones we discussed earlier in chapter 9. Full-text search and WebDAV integration are two key features of a content repository. In this case study we’ll learn how Jackrabbit uses Tika to help implement these features.

We’ll start by briefly describing the key features of Apache Jackrabbit and the Content Repository for Java technology (JCR) API (http://www.jcp.org/en/jsr/detail?id=170) that it implements. Armed with this background, we’ll then look deeper into how Jackrabbit’s search feature uses a pool of Tika threads to achieve the illusion of being able to index arbitrarily large documents nearly in real time. We’ll also look at how Tika’s type detection feature is used to add smarts to Jackrabbit’s WebDAV integration layer. We’ll end this case study with a brief summary.

13.1. Introducing Apache Jackrabbit

Apache Jackrabbit is an implementation of a new special kind of a database called a content repository. Defined in Java Specification Requests (JSRs) 170 and 283, a content repository is a hierarchically organized storage engine that combines features from advanced file systems and relational databases.

Documents, files, records, and all other kinds of information entities are stored as nodes in the content tree inside a repository. Each node consists of any number of properties and child nodes. Properties contain numbers, strings, byte streams, or other types of data, including arrays of such values. Figure 13.1 shows how this content model looks in practice.

Figure 13.1. Example content in a content repository

In addition to storing such hierarchical data, the content repository also makes it searchable, keeps track of past versions of content, sends notifications of changes, and supports a number of other features that make life easier for an application developer. As the reference implementation of JSR 170 and 283, Apache Jackrabbit implements all these features and more.

The most interesting features for this case study are full-text search and WebDAV integration. Jackrabbit uses an integrated Lucene search index to make all repository content, including binary properties, searchable as soon as it has been stored. The WebDAV integration lets users access and modify repository content over the web or to mount a repository as a part of their normal file systems. Tika is an integral part of these features, as we’ll see in the next two sections.

13.2. The text extraction pool

One feature that separates a Jackrabbit content repository from a relational database is the ease by which it can handle normal files. You can drop digital documents such as PowerPoint presentations or PDFs into a content repository and have them searchable by content without any custom indexing setup. Let’s see how Jackrabbit does this.

Whenever a node is added, modified, or removed in Jackrabbit, the integrated Lucene index is updated to match the change. If the node contains binary properties, the contents of those properties are extracted with Tika and added to the index as text. Since text extraction can be time-consuming for some documents, Jackrabbit uses a set of background threads for this purpose. This allows the index to be updated immediately during a save, and then reupdated as soon as the extracted text becomes available. Together these updates create an illusion of a super-fast index whose accuracy improves incrementally over time.

So how does this work in practice? When an index update is needed, new text extraction tasks are created for all binary properties and scheduled for execution by a pool of background threads. The essential Jackrabbit code for the text extraction task is shown next.

Listing 13.1. Background text extraction task in Jackrabbit

The code starts in with a standard use of the parse() method as described in chapter 5. The more interesting bits happen next. The first catch block takes care of silently ignoring problems caused by a deployment omitting some parser libraries, which is an easy way to customize and streamline an installation. The second block catches any other problems, including a special STOP exception that’s sent by a specially instrumented ContentHandler instance to signal that up to a given maximum number of characters have already been extracted from the binary stream. The extraction process is terminated when the STOP event is received; otherwise an exception is logged and any extracted text is replaced with a TextExtractionError token to make such problems easy to locate within a repository. Finally in the extracted text is made available for use by the indexer.

Meanwhile, as the extraction task is running, the indexer first waits for a fraction of a second to see whether the extracted text is already available for use in the first index update. If not, the node is first indexed with an empty text extraction result and a new index update is scheduled for when the extraction task is complete. This way a new or updated document is immediately searchable by its nonbinary properties, and by the extracted full-text contents normally within a few seconds from when the changes were saved.

13.3. Content-aware WebDAV

WebDAV, or Web-based Distributed Authoring and Versioning protocol as it’s officially called, is an extension of the Hypertext Transfer Protocol (HTTP) designed for remotely managing files and other resources on web servers. WebDAV makes a web server work like an advanced remote file system, and is thus a great match for a remote access protocol for a Jackrabbit content repository. Most operating systems have a Connect to Server feature that allows a WebDAV server to be mounted as a part of the file system, and many applications ship with integrated WebDAV support for accessing and modifying remote resources. Figure 13.2 shows the WebDAV mount feature in Windows Vista.

Figure 13.2. Doing a WebDAV mount in Windows Vista

Jackrabbit implements WebDAV in two varieties: one focused on integration with traditional WebDAV clients like the ones mentioned here, and another more complicated one that makes nearly all repository functionality available to advanced remote clients. The traditional clients are often fairly simple, and Jackrabbit needs to do some extra work to fill in details that the clients fail to provide. This is where Tika comes in.

A classic use case for WebDAV integration in Jackrabbit is being able to copy files to and from the repository by dragging and dropping them in a normal file explorer window. Another use case is browsing and downloading these files using a web browser. The trouble is that the latter use case needs accurate media type information so that the browser can easily associate a file with the correct application, whereas the WebDAV mount features in operating systems typically don’t provide such type information along with added files. The solution is to use Tika to automatically detect the types of incoming files.

If you remember the type detection examples from chapter 4, then the related Jackrabbit code in the following listing will look familiar.

Listing 13.2. Automatic type detection in Jackrabbit

The code takes advantage of the possible media type hint provided by the WebDAV client, the name of the uploaded file , and the contents of the file itself to automatically detect its type. With this code in place, a user who copies a PDF document to the repository with no associated type information will be able to access it as a properly annotated application/pdf resource later on. Small touches like this are essential for a smooth user experience.

13.4. Summary

The purpose of a content repository like Apache Jackrabbit is to make it easy to manage all kinds of content, including collections of digital documents. To do this well, Jackrabbit needs a way to understand and look inside the documents stored in the repository. Tika is the perfect tool for this purpose.

We started this chapter with a brief introduction to Apache Jackrabbit and the content repository model. Then we looked at two ways in which Jackrabbit uses Tika. The first was text extraction for use with the Lucene-based search index in Jackrabbit, and the second was automatic type detection for smooth WebDAV integration.

The Jackrabbit content repository powers many high-end content management systems that are used for purposes ranging from large-scale digital asset management to high-profile web content management. Tika might already be there working behind the scenes when you next visit a large website!

This concludes our discussion of Tika in Apache Jackrabbit. Our next case study is also related to the management of digital assets, but in a different way than in a generic tool like Jackrabbit. Read on to find out how the National Cancer Institute uses Tika to help manage the vast amounts of data it collects.