Chapter 11. Extending Tika – Tika in Action

Chapter 11. Extending Tika

 

This chapter covers

 

There are thousands of document formats in the world and new ones are constantly being introduced, so it’s impossible for a library like Tika to support all of them out of the box. Thus even though each Tika version adds support for new formats, there will be times when Tika won’t be able to extract content from or even detect the type of a document you’re trying to use. This chapter is about what you can do in such a situation.

Imagine that you’re working with a new XML-based file format for medical prescriptions. Each file describes a single prescription and consists of a set of both fixed and free-form fields of information. Optionally the prescription documents can be digitally signed and encrypted for better security and privacy. Figure 11.1 shows how such digital prescriptions can be used in practice.

Figure 11.1. Illustration of how a digital prescription document can be used to securely transfer accurate prescription information from a doctor to a pharmacy. A digital signature ensures that the document came from someone authorized to make prescriptions, and encryption is used to ensure the privacy of the patient.

It’d be useful to make such documents searchable based on both free-form text and selected metadata fields like the patient name or identifier. The easiest way to implement such a search engine is to use an existing search stack like the one we described in the previous chapter, and to do that you simply need to teach Tika how these documents should be parsed.

We’ll use such digital prescription documents as our example for extending Tika. First we’ll teach Tika how to detect and identify such documents, and then we’ll see how to make Tika correctly parse these documents.

11.1. Adding type information

The first step in dealing with a new document type is identifying it with a media type. Let’s tentatively name our prescription format application/x-prescription+xml. The x- prefix marks this as an experimental type that hasn’t been officially registered, and the +xml suffix signals that the type is XML-based.

We also need some information to help automatic detection of prescription documents. As discussed in chapter 4, the file extension and the XML root element are good hints for type detection. So let’s assume that the prescription files are named with a .xpd extension for extensible prescription document. Furthermore let’s assume that the XML documents start with an <xpd:prescription> element whose prefix xpd is mapped to the namespace http://example.com/2011/xpd. The following listing shows what such a document might look like:

<xpd:prescription xmlns:xpd="http://example.com/2011/xpd">
  <xpd:doctor>...</xpd:doctor>
  <xpd:patient>...</xpd:patient>
  <xpd:medicine>...</xpd:medicine>
  <xpd:instructions>...</xpd:instructions>
</xpd:prescription>

All this type information can be described in the media type record shown next. Please refer back to chapter 4 where we covered the MIME-info database for more details about the media type record structure:

<mime-info>
<mime-type type="application/x-prescription+xml">
  <sub-class-of type="application/xml"/>
  <acronym>XPD</acronym>
  <expanded-acronym>Extensible Description Document</expanded-acronym>
  <comment xml:lang="en">Digital prescription</comment>
  <glob pattern="*.xpd"/>
  <root-XML localName="prescription"
            namespaceURI="http://example.com/2011/xpd"/>
</mime-type>
</mime-info>

The most obvious way to teach Tika about new document types is to extend the existing media type database, so that’s what we’ll focus on first.

11.1.1. Custom media type configuration

Let’s look at the shared MIME-info database file we covered earlier in chapter 4. The database contains details of all the media types known to Tika, so to support a new type you’ll need to add it to the database. This section shows how to do that.

By default Tika will load this database from the org/apache/tika/mime/tikamimetypes.xml file inside the tika-core JAR. But you can also instruct Tika to load an alternative file using the MimeTypesFactory class. For example, the following listing shows how to load an alternative MIME-info database and use it to set up a Tika facade instance for use in type detection:

String path = "file:///path/to/prescription-type.xml";
MimeTypes typeDatabase = MimeTypesFactory.create(new URL(path));
Tika tika = new Tika(typeDatabase);
String type = tika.detect("/path/to/prescription.xpd");

When executed with the described custom settings, this code snippet will return the expected application/x-prescription+xml media type. You can also use the Mime-Types object returned by the MimeTypesFactory for constructing AutoDetectParser instances or anywhere you need a Detector object.

Tika currently doesn’t support merging multiple MIME-info databases, so the best way to create a customized database is to start with the default version included in the tika-core JAR. This unfortunately means that you should update your customized version whenever a new Tika release is made. A future Tika release will no doubt add support for incremental database updates to make it easier to maintain these kinds of custom extensions.

Now that Tika knows our custom types, our next step is to look at adding more generic type detection strategies through custom Detector classes.

11.2. Custom type detection

Customizing the MIME-info database is all it takes to teach Tika about new types and new type detection rules based on common features such as file extensions, magic bytes, or XML elements. But what if you’re dealing with a more complex format for which none of these simple detection mechanisms work? The answer lies in Tika’s Detector interface, which allows you to plug in custom type detection algorithms.

To better understand the Detector interface and how to use it as an extension point, we’ll first go through a quick overview of how the interface works. Then we’ll dive in and implement a complete custom type detector for encrypted prescription documents. Finally we’ll see how custom detectors can be plugged into Tika.

11.2.1. The Detector interface

The Detector interface specifies a generic API for type detection algorithms. The detect method defined in this interface detects the type of a document based on the document’s raw byte stream and any available document metadata. The diagram in figure 11.2 outlines how this works.

Figure 11.2. Overview of a generic type detector

One detector implementation can look at the byte stream for known byte patterns while another can inspect the available document metadata for known filename suffixes or other media type hints. Detector implementations should also be prepared for the absence of either of these inputs, for example, when dealing with just a filename or a raw byte stream. If the detector can’t determine the document type based on the available information, it should return the generic application/octet-stream media type.

Tika will automatically load all available detector implementations using Java’s service provider mechanism. When detecting the type of a document, all these available detectors are invoked in sequence and the most specific media type is returned to the client application as the detection result. You can add custom detection algorithms by implementing the Detector interface and adding the required service provider settings. The next section shows how this is done in practice.

11.2.2. Building a custom type detector

For example, let’s assume that the pharmacy automation system we described earlier is supposed to automatically detect and process digital prescriptions sent as encrypted email attachments. We have the decryption key but can’t rely on things such as file extensions or other external type hints for detecting these documents.

So how would we go about detecting such documents? As described earlier, the solution is to create a custom Detector class and plug it into Tika’s type detection mechanism. The example class shown next does exactly this. Take a moment to study the code, and read on for a more detailed description.

Listing 11.1. Custom type detector for encrypted prescription documents

What’s happening here? Let’s go through the code in steps.

1.  First, detector classes are instantiated using the public default constructor and can’t access extra settings through the ParseContext object that the Parser classes can. Thus in this case we need a static reference to the pharmacy’s decryption key .

2.  Then in the detect() method we start trying to detect the document type. This method can assume that the given stream supports the mark feature, and is only expected to reset the stream to its original position before returning. The org.apache.tika.io.LookaheadInputStream (introduced in Tika 1.0) utility class is a perfect tool for this, as it takes care of all the details of properly managing the stream state. See the Javadocs of that class for more details.

3.  We then try to decrypt the lookahead stream using the standard cryptography API in Java. If the decryption fails for whatever reason, we can assume that the document is either not encrypted or that we don’t have the correct key for that document. In either case this detector can return application/octet-stream as the fallback type.

4.  If we do manage to decrypt the stream, our next task is to check whether it looks like XML and starts with the xpd:prescription element. The org.apache.tika.detect.XmlRootExtractor utility class is designed for this purpose, and is also used by the default type detection code in Tika.

5.  Finally, if all signs point to this being an encrypted digital prescription, we can inform Tika of that fact by returning the application/x-prescription media type . Note that we’ve dropped the +xml suffix from the type name, as the encryption makes the document unusable for standard XML-processing tools. This new media type should also be added to the MIME-info database as a sibling of the already declared XML type.

11.2.3. Plugging in new detectors

The one last thing you need after compiling this custom detector class is to plug it into Tika. The easiest way to do that is to place the compiled class into a JAR archive together with a META-INF/services/org.apache.tika.detect.Detector file that contains the fully qualified name of this class on a line by itself. Then include that JAR in your classpath, and Tika will automatically pick up and use the new detector.

If you want more control over the set of detectors used by your application, you can also use the CompositeDetector class to explicitly compose a combination of them. The following code snippet shows how to extend our previous detection example with support for encrypted prescription documents:

String path = "file:///path/to/prescription-type.xml";
MimeTypes typeDatabase = MimeTypesFactory.create(new URL(path));
Tika tika = new Tika(new CompositeDetector(
        typeDatabase,
        new EncryptedPrescriptionDetector()));
String type = tika.detect("/path/to/tmp/prescription.xpd");

By now you’ve learned how to extend Tika’s media type database and type detection capabilities to cover pretty much any new document type you encounter. The next step is to let Tika parse such documents, and that’s what we’ll focus on in the next section.

11.3. Customized parsing

Knowing the type of a document is useful, but even better is being able to extract information from the document. To do this you need to be able to parse the document format, and for that you use the Parser interface described in chapter 5. To enable Tika to extract information from a new document type, the first step is to implement a new parser class or to extend an existing one. In this section we’ll do both.

Consider the digital prescription documents we’ve been discussing. In their unencrypted form they’re XML documents with a specific structure, and the encrypted form wraps a digital signature and encryption around the underlying XML document. To best handle such documents, we need two new parser classes, one for the basic XML format and another for the encrypted form. The relationship between these custom parser classes and the greater Tika parser design is outlined in figure 11.3.

Figure 11.3. Custom parser classes for handling digital prescription documents

As shown in this diagram, we’ll first implement support for the unencrypted prescription documents by extending the standard XMLParser class in Tika. Once we have that class, it’ll be easy to combine it with our earlier work on detecting encrypted documents to implement a custom parser for encrypted digital prescriptions.

11.3.1. Customizing existing parsers

Let’s start with the existing parsers that you can already find in Tika. Most of the time they work pretty well, but what if you need to make some minor adjustments to help them better understand the kinds of documents you’re working with? The Parse-Context object that we covered in chapter 5 allows some level of customization, but sometimes you need to make more extensive changes to parser behavior.

Many parser classes in Tika have been designed with such customization in mind, so you can often extend them with a subclass that overrides selected methods. See the Javadocs of the parser classes for more information on the ways in which they can be extended. A good example of such extensibility is the XMLParser class that we’ll be using next.

Remember the simple XML outline of a basic prescription document in section 11.1? The document contains separate elements for the doctor, the patient, the medicine, and any instructions associated with the prescription. The default XML parser will take the text content of all these elements and make it available as the output of the parsing process. Wouldn’t it be useful if at least some of these fields were also made available as structured metadata fields?

An example class that does this is shown in the following listing. It retains the default behavior of the XMLParser class while also mapping the contents of the xpd:doctor and xpd:patient elements into similarly named metadata fields.

Listing 11.2. An XMLParser subclass for parsing prescription documents

Let’s step through the code to better understand how it works.

1.  To start with, the class extends the existing XMLParser class and overrides the protected getContentHandler() method . This method controls how the SAX events from the parsed XML document are mapped to Tika’s XHTML output. The default implementation strips out all elements and passes only the text content to the client. In this case we customize this project to map selected parts of the XML document into corresponding metadata fields.

2.  To achieve this we use the ElementMetadataHandler utility class from the org.apache.tika.parser.xml package . This class interprets an incoming SAX event stream and maps the text content of selected elements to a given metadata field. In our case we’re interested in the names of the doctor and the patient mentioned in the prescription, so we construct two such handlers.

3.  We use the TeeContentHandler class to tie these metadata handlers together with the default XMLParser behavior as returned by the superclass method. See chapter 5 for more details on how the TeeContentHandler class works.

4.  Finally we override the getSupportedTypes() method to only return the application/x-prescription+xml media type. This allows our custom class to coexist with the default XMLParser class that supports just the standard application/xml media type.

That wasn’t too hard, was it? Let’s move on to creating an entirely new parser class.

11.3.2. Writing a new parser

You probably guessed it already: we also need a way to parse the encrypted prescription documents. Since there’s currently no generic parser in Tika for encryption formats, we need to write a new one to be able to extract information from encrypted prescriptions.

We already have all the basic building blocks we need from previous examples, so the only thing left to do is to put those block together into a fresh new parser class. The result is shown in the following listing.

Listing 11.3. Parser class for encrypted prescription documents

You can probably tell what each part in this code does, but let’s still go through it so we don’t miss any details.

1.  Instead of implementing the Parser interface directly, we start by extending the AbstractParser base class . This simple class comes with default implementations for deprecated old methods so we don’t need to worry about them in our code.

2.  The main functionality goes into the parse() method whose behavior we covered in detail in chapter 5. Here we want to first decrypt the encrypted document stream and then pass the XML content to the extended XML parser we already created .

3.  Since we’re delegating detailed processing to another parser class, we don’t need to worry about producing XHTML output in this class. Otherwise we could use the XHTMLContentHandler utility class that we also covered in chapter 5.

4.  And like before, we implement the getSupportedTypes() method in this class to tell Tika about the kinds of documents it should be using this parser class for. The returned media type should match the type returned by the corresponding detector.

Now we have two new parser classes: one for encrypted and one for unencrypted digital prescriptions. We still need to tell Tika to use these parsers, which is what we’ll do next.

11.3.3. Plugging in new parsers

Parser plugins are just like new detectors, in that Tika by default uses the service provider mechanism to load all available implementations from the classpath. To tell Tika about your two new parsers, you need to place the compiled classes into a JAR file together with a META-INF/services/org.apache.tika.parser.Parser file that lists the fully qualified names of these two classes on separate lines. When you include that JAR in your classpath, Tika will automatically start using these new parsers.

JAR archives like this are an easy way to extend Tika. For example, you can put all the code from this chapter into a new tika-xpd-1.0.0.jar file together with the two service provider files in META-INF/services. Then you’ll have a complete Tika plugin that you can easily use to enable support for digital prescriptions in any system that uses Tika for metadata and content extraction.

So what happens if you want to override an existing parser in case you have two parsers that both claim to support the same MIME type (such as application/x-prescription+xml)?

11.3.4. Overriding existing parsers

When you have two Parsers that both claim to support the same type, a simple bit of code can help you ensure the Parser you want to be selected is called by Tika, as shown next.

Listing 11.4. Overriding Parsers in Tika

In listing 11.4, first you declare an instance of the MyCustomPrescriptionParser in . This is the Parser that you’d like to be called instead of the default parser for the type application/x-prescription+xml. Then, to link that Parser to the media type, you can decorate your MyCustomPrescriptionParser by creating an AutoDetectParser instance with your MyCustomPrescriptionParser as the first Parser in the list provided to the constructor, as shown in . The combination of the ParserDecorator and the ordered set of Parser passed to the AutoDetectParser constructor helps ensure that no matter what MyCustomPrescriptionParser purports to deal with in terms of MIME types, it’ll be called and selected as the Parser for the application-x-prescription+xml type.

That’s a lot of functionality packed into a few small classes and some lines of configuration, and a good place to end our coverage of how to extend Tika. It’s time to summarize what we’ve learned here.

11.4. Summary

In this chapter we learned about a digital prescription document format, which despite being fictional is a good example of the kinds of new document formats that are being developed and used every day. Being able to easily detect, index, and search such documents is often an important requirement. Tika and the tools it integrates with can be a major help in implementing such requirements. You only need to extend Tika to understand such new document formats, which is what we’ve done in this chapter.

After briefly explaining our example document format, we looked at how to add information about that type into Tika’s media type database along with basic type detection details. Then, we covered more complex detection strategies by writing a custom Detector class. Finally, we implemented two custom Parser classes, one extended and one standalone, to allow Tika to extract text and metadata from our example documents. All this functionality was wrapped into a simple JAR archive that can be used as a drop-in plugin to extend the capabilities of any Tika-enabled system.

This concludes the third part of this book. By now you should know pretty much everything there is to know about Tika and should be able to start using it even in complex ways in your own applications. In the next part of this book we’ll discuss of how others are using Tika.