Chapter 4. Document type detection – Tika in Action

Chapter 4. Document type detection

 

This chapter covers

  • Introduction to MIME types
  • Working with MIME types in Tika
  • Identifying file formats

 

Let’s talk about taxonomy. Taxonomy is the science of classification. Taxonomies are used to identify and classify concepts in order to better understand them and to have a shared vocabulary for describing things. For example, the Linnaean taxonomy[1] is the classical system of naming all biological organisms using two-part Latin names that identify both the genus or category and the specific species within that category. The term Homo sapiens identifies the modern human species as a part of the family of earlier human-like species, along with the extinct Homo neanderthalensis. A similar taxonomy, called the internet media type system, is used to identify digital document formats.

1 Carl Linnaeus, a famous Swedish scientist, wrote Systema Naturae in 1735, in which he describes and categorizes plants, animals, and minerals. The seminal work was one of the first widely known uses of rank-based classification, in which certain categories can be ranked higher or lower than others. In Linnaeus’s taxonomy, plants, animals, and minerals are first ranked by class, then by order, and then by species. Relating back to this chapter, the IANA’s (Internet Assigned Number Authority’s) classification of internet media types mentioned in section 1.1.1 is a modern example of a rank-based classification system. MIME types are broken down into top-level categories, then specialized as subtypes within those categories.

Figure 4.1. Table of the Animal Kingdom (Regnum Animale) from an early 1735 edition of Carolus Linnaeus’s Systema Naturae. This and Linnaeus’s other seminal book, Species Plantarum, laid the groundwork for most of the biological nomenclature in use today. A similar classification of types can also be found in the internet media type system.

Taxonomies are often associated with ways of identifying or detecting specific things. For example, biological taxonomies come with details such as descriptions of the appearance of species, their behavior or growth patterns, or ultimately their DNA structure as ways to identify the species of any single animal or plant. Similar mechanisms exist for detecting formats of digital documents.

In this chapter, we’ll dive deep into the taxonomy of document formats and explain how to use the taxonomy and other mechanisms to determine a document’s true classification. The first stop on our journey is an introduction to the internet media type system and how media types are handled by Tika. Then, we’ll look at the different type detection mechanisms that are included in Tika. Finally, we’ll put these things together in a simple example application to give you a feel for using Tika’s document type detection system.

4.1. Internet media types

As you may remember from section 1.1.1, the internet media type system documented in RFC 2046 is the best available standard for identifying document types. Media types (or MIME types, as they’re often called based on the Multipurpose Internet Mail Extensions (MIME) standard that defined the concept) play a crucial role in the underlying interactions whenever you browse the web or read your email. In short, MIME types make the right applications run on your computer whenever you interact with a particular file. For example, have you ever wondered how your browser knows that when it encounters a QuickTime movie, rather than displaying the movie as binary or text content in your browser, it should load up your QuickTime player and start playing the file?

Most browsers either explicitly (as shown in figure 4.2 demonstrating Firefox’s media type to application mapping) or implicitly have to understand the underlying media type of a file, and then know what to do with it. Without an understanding of media types on the internet and their associated applications, your internet browsing experience would still be composed mostly of plain ASCII text, which wouldn’t be much fun at all.

Figure 4.2. The document media type to application mapping from Mozilla Firefox. This panel can be brought up on a Mac by clicking on the Firefox menu, then selecting Preferences, then clicking on the Applications tab (note: this sequence depends on the operating system used, but is likely similar for platforms other than Mac). Each listed media type is mapped to one or more handler applications, which Firefox tries to send the content to when it encounters the document on the internet.

We’re ready to dive into the naming scheme for internet media types. After that, you’ll be introduced to the eight top-level internet media types, as defined by the seminal Internet Assigned Number Authority (IANA) media type registry (http://www.iana.org/assignments/media-types/index.html). We’ll briefly describe the IANA registry as well as a few other ones, and how Tika leverages the information present in any media type registry to accurately and reliably detect media types.

4.1.1. The parlance of media type names

The name of a media type consists of a type/subtype type definition and an optional set of name=value parameters as shown in figure 4.3, following the Linnaean (and more generally rank-based) taxonomy development approach. The type/subtype part and the parameter names are restricted to a subset of printable US-ASCII strings and are always treated case-insensitively.

Figure 4.3. Railroad diagram of the syntax of media type names. See section 5.1 of RFC 2045 for the full details of the format.

The type/subtype part tells you the document format you’re dealing with, and the optional parameters add format-specific information needed to properly process the document. For example, the media type text/plain; charset=UTF-8 identifies a plain text document with Unicode characters encoded using the UTF-8 character encoding. Similarly, the image/jpeg type identifies an image stored in the JPEG/JFIF image format.

The Odd Box When dealing with lots of documents and media types, you’re bound to encounter some abnormal cases sooner or later. A common mistake is to reverse the order of the parts in a media type name, for example, charset=utf-8; text/html. A toolkit such as Tika shields your application from having to deal with the complexities of such anomalies.

Now that we know what a media type looks like, it’s natural to ask what kinds of types are being used out there and how the set of known media types is managed. Read on to find out.

4.1.2. Categories of media types

There are currently eight official top-level types as shown in table 4.1, and thousands of registered or otherwise known subtypes. Similar to Linnaeus’s animal taxonomy, these top-level types form the basis for classifying and organizing a taxonomy of internet media types.

Table 4.1. Officially specified top-level media types by IANA. These types form the basis for a detailed classification framework of available document types. Children are allowed for each top-level type, indicating some specialization of the parent (a more specific schema, a slightly different encoding format, and so on).

Top-level type

Description

text/* Text-based documents such as HTML (text/html) and Cascading Style Sheets (CSS, text/css) files, comma-separated values data (CSV, text/csv), and unformatted plain text (text/plain). All text documents are processed primarily as characters instead of as bytes, so a text media type is often accompanied with a charset parameter that identifies the character encoding used in a specific document.
image/* Image formats such as JPEG (image/jpeg) and Portable Network Graphics (PNG, image/png). Most image documents share some basic characteristics like image size and resolution, color space and depth, and compression ratio (including whether the used image compression is lossy). All of this information is normally embedded within the image document in a format-specific way, so media type parameters are usually not used or needed for image types.
audio/* Music and other audio formats such as MP3 (audio/mpeg) and Ogg audio (audio/ogg). There are also many audio formats designed for things like internet telephony and are usually used for transmitting instead of storing audio.
video/* Video formats such as QuickTime (video/quicktime) and Ogg video (video/ogg). Typical characteristics of video formats are frame rate and size, and the possible inclusion of synchronized audio and text tracks.
model/* File formats for expressing physical or behavioral models in various domains. The best-known example is the Virtual Reality Modeling Language (VRML, model/vrml) format used to express 3D models.
application/* Application-specific document formats that don’t necessarily fit any of the other top-level categories. Well-known examples include PDF (application/pdf) and Microsoft Word (application/msword) documents. The generic application/octet-stream type is used as a fallback for any documents whose exact type is unknown (the document can only be processed as a stream of bytes).
message/* Email and other message types sent over the internet and other networks.
multipart/* Container formats for multiple consecutive, alternative, or otherwise related component documents. Like message/* types, multipart/* documents are normally used for messages transmitted over the network, whereas packaging formats like Zip archives (application/zip) are categorized as application types.

In addition to the official top-level types, there’s a reserved example/* category for use only in examples. Some experimental applications may also use unregistered top-level types of the format x-*/*, though more frequently you see applications using unregistered subtypes with names that match formats like application/x-* or image/x-*.

As media types are identified, they need to be persisted in some manner so that others can look up their definitions and understand their relationships. Media types are stored in a media type registry for this purpose. There are a few canonical media type registries, so before you go out and try creating your own, it’s worth understanding some of the existing registries of media types, including the largest, most comprehensive source, the IANA registry.

4.1.3. IANA and other type registries

Among its other responsibilities, the Internet Assigned Numbers Authority maintains a list of officially registered media types. This list is publicly available on the web at http://www.iana.org/assignments/media-types/, and anyone may register new types by following the procedure described in RFCs 4288 and 4289.

There are hundreds of officially registered types, and more are constantly being added. Besides being one of the largest and most well-maintained media type registries in existence, the IANA registry is significant because the media types defined in it are of high quality, both in terms of the sheer amount of relationships captured (parent and child types), and because of the peer-reviewed nature of the attributes that are captured for each type (MAGIC byte patterns, file extensions, and so forth). IANA is a well-respected internet standards body, with many data curators and folks responsible for ensuring that the information captured in its registries isn’t junk, but actually useful to consumers of the information held within.

There are also many widely adopted types that haven’t been officially registered and thus haven’t been as extensively vetted by the broader community. Information about such types may at times be hard to come by, may require searching through both online and offline resources, and may also require vetting of misleading or even incorrect information. A few websites, such as http://filext.com, http://file-extension.net/, and TrID (http://mark0.net/onlinetrid.aspx), maintain huge file format databases that often provide the best hints about some unknown media types that you may encounter, or at least have information that may not be present in the higherquality, harder-to-get-into registries (like IANA). Unfortunately such information is often incomplete or contradictory, but luckily Tika solves some of these problems in a number of different ways, such as combining information from multiple existing media type registries, easily allowing for the addition and curation of those media types in a well-known format like XML (which in itself provides excellent tool support for managing media types), and finally by adopting a comprehensive specification for representing media types, allowing for their easy comparison, extension, and management. Now that we’ve covered the basics, let’s take a deep dive into Tika’s techniques for taming the complexity of media types.

4.2. Media types in Tika

Media types are the basic atomic building blocks of interaction with files and your computer’s software—they tell your computer what applications to associate with what files. Detecting media types accurately and reliably is of the utmost importance, and something Tika happens to excel at (no pun intended).

Now that you know a bit about the hassle of dealing with media types, such as the eight top-level media types and their countless children, how to name the media types and classify them, and where they’re stored (in registries, some high-quality and others not), it’s time we told you how Tika simplifies the complexity of dealing with media types.

First, Tika maintains a rich, easy-to-update, easy-to-understand MIME database internal to the project, reducing external dependencies to existing registries. Second, Tika provides Java API and class-level support for interacting with the Tika MIME database, exposing management APIs for the database but also exposing all sorts of methods of media type detection (by magic byte patterns, file extensions, and so on) that we’ll cover later in the chapter. The methods for media type detection are entirely driven by the richness of the underlying Tika MIME database that we’ll explain in this section ad nauseam. Read on!

Alert: Source Code Ahead Before getting too deep into the source code examples and MIME-info database in this chapter, we’d like to remind you to refresh your memory regarding working with the Tika source code and building the Tika codebase by reviewing section 2.1.

The Tika project maintains its own media type registry that contains both official IANA-registered types and other known types that are being used in practice. The Tika type registry also keeps track of associated information such as type relationships and key characteristics of the file formats identified by the media types. This section covers the basics of this registry and the key classes you can use to access the included type information.

4.2.1. The shared MIME-info database

Unix environments have traditionally had no standard way of sharing document type information among applications. This was a problem for popular open source desktop environments such as Gnome and KDE that are distributed with Linux. These environments strive to make the user experience more consistent with standard icons and program associations for all document types, akin to their commercial counterparts (Windows or the Mac desktop environment). To manage such document type information in a platform-independent manner, they came up with the Shared MIME-info Database specification (http://mng.bz/7Ylh), which among other things defines an XML format for media type information. This format, shown next, is used also by Tika.

Listing 4.1. Basic MIME-info database file

A mime-info file contains a sequence of mime-type records that each describe a single media type. A type record specifies the official name of the type as well as any known aliases. For example, many officially registered media types are also known by experimental x-* names that predate the official type registration. A type record can also contain informal type names that are frequently used in human communications. Just like most people would call a domestic cat a cat rather than a “member of the Felis catus species,” a term like PDF document is usually preferred to the more technically accurate application/pdf in informal language.

Capturing the media types in the mime-info file (called tika-mimetypes.xml in Tika’s source) provides a single point of access for managing Tika’s knowledge about media types. Tika ships with a rich, well-curated mime-info file, but nothing prevents you from adding to or removing from it to suit your needs. Just make sure that you try to fill in as much of the information shown in listing 4.1 as you can; it’ll help Tika to detect the right file type, and your programs and operating systems to map that file to the right application.

Before going further into all the detailed type information that can be included in a mime-info database, let’s first take a look at how you can access the recorded type information using Tika’s APIs.

4.2.2. The MediaType class

Tika uses the MediaType class to represent media types. Instances of this class are immutable and contain only the media type’s type/subtype pair and optional name=value parameters. The type and parameter names are all normalized to lowercase and the MediaType class supports the standard Java object equality and order comparison methods for easy use in all kinds of data structures. The class is depicted visually in the Unified Modeling Language (UML) notation in figure 4.4.

Figure 4.4. Basic UML class diagram that summarizes the key features of the MediaType class. The class implements both the Comparable and Serializable standard Java interfaces. The type name, its subtype, and the associated type parameters are all available through getter methods, and the MediaType can be serialized to human-readable form by calling the toString method.

The static MediaType.parse(String) method is used to turn media type strings such as text/plain; charset=UTF-8 to MediaType instances. The type parser is flexible and tries to return a valid media type even for malformed inputs, but will return null if passed a string like “this is not a type” that simply can’t be interpreted as a media type.

The following example shows how to use the key methods of the MediaType class. Full details of the class can be found in the API documentation on the Tika website:

MediaType type = MediaType.parse("text/plain; charset=UTF-8");

System.out.println("type:    " + type.getType());
System.out.println("subtype: " + type.getSubtype());

Map<String, String> parameters = type.getParameters();
System.out.println("parameters:");
for (String name : parameters.keySet()) {
    System.out.println(" " + name + "=" + parameters.get(name));
}

Individual MediaType instances don’t do much, but they form the basis for higher-level concepts such as the MediaTypeRegistry class we’ll encounter in the next section.

4.2.3. The MediaTypeRegistry class

The type information included in mime-info XML databases and other sources can be accessed through the MediaTypeRegistry class. As the name indicates, an instance of this class is a registry of media types and related information. The MediaTypeRegistry class and its important features are described in figure 4.5.

Figure 4.5. UML class diagram that summarizes the key features of the MediaTypeRegistry class. The class allows the set of loaded MediaType object instances to be returned as a SortedSet, and allows a user to obtain a SortedSet of aliases belonging to a particular MediaType.

Tika contains a fairly extensive media type database that you can access using the static MediaTypeRegistry.getDefaultRegistry() method. The following example uses this method to print out all the media types and type aliases known to Tika. That’s more than a thousand types!

MediaTypeRegistry registry = MediaTypeRegistry.getDefaultRegistry();

for (MediaType type : registry.getTypes()) {
    Set<MediaType> aliases = registry.getAliases(type);
    System.out.println(type + ", also known as " + aliases);
}

Now that we’ve studied the media type registry, we’ll show you how the power and flexibility of Tika’s media type detection mechanism is driven by the richness of the information captured in its media type registry (aka MIME database, mime-info file, and the rest of the aliases we’ve given it so far). So, in other words, the more accurate, more fleshed-out, and more easily accessible and updateable Tika’s media type registry is, the better your programs and software that leverage Tika will be able to discern the right application to handle files that you’ll encounter.

A key part of the richness of the media type registry is the notion of media type hierarchies. Type hierarchies tell your applications things like the fact that the media type application/xml is a subtype of plain text (text/plain), and can be viewed in a text editor, not something like QuickTime for viewing movies.

4.2.4. Type hierarchies

Many media types are based on a more generic format. For example, all text/* types like text/html are supposed to be understandable even if treated as plain text, like when using the View Source feature included in most web browsers. It’s thus accurate to say that text/html is a specialization of the more generic text/plain type.

These kinds of type hierarchies (parent-child relationships, or specializations) are different from the type/subtype categorization encoded in the standard internet media type system. Even though text/plain can be seen as a supertype of all text/* types, there’s no similar generic format for all image/* types. In fact the Scalable Vector Graphics (SVG, image/svg+xml) format is based on XML (application/xml) and thus SVG images can also be processed as XML documents, the gist of which is demonstrated in figure 4.6. Such type relationships are often indicated with a name suffix like +xml. For example, the Electronic Publication (Epub, application/epub+zip) format used by many electronic books is actually a Zip archive (application/zip) with some predefined content.

Figure 4.6. Four levels of type hierarchy with the image/svg+xml type. The SVG image can be processed either as a vector image, as a structured XML document, as plain text, or ultimately as a raw sequence of bytes.

Tika has built-in knowledge about handling text types and types with name suffixes such as +xml and +zip. Tika also knows that ultimately all documents can be treated as raw application/octet-stream byte streams. But more specific type hierarchy information needs to be explicitly encoded in the type database using sub-class-of elements as shown in the following example:

This kind of type hierarchy information is highly useful when trying to determine how a particular document can best be processed. For example, even if you don’t have the required tools to process Keynote presentations, you may still be able to extract some useful information about the presentation by looking at the contents of the Keynote Zip archive.

Tika supports such use cases by making type hierarchy information easily available through the getSupertype() and isSpecializationOf() methods of the Media-TypeRegistry class. The former Java API method returns the closest supertype of a given media type (or null if the given type happens to be application/octet-stream), whereas the latter method checks whether a given type is a specialization of another more generic type. The use of the getSupertype() method is illustrated next:

MediaTypeRegistry registry = MediaTypeRegistry.getDefaultRegistry();

MediaType type = MediaType.parse("image/svg+xml");
while (type != null) {
    System.out.println(type);
    type = registry.getSupertype(type);
}

That’s all there is to say about media types themselves. Let’s move on to figuring out how you can tell the media type of any given file or document using this information about how to capture and represent. Even armed with this huge knowledge base of media type information, detecting the media type of a given file can be more complicated than you might expect. Tika simplifies this for you, and we’ll show you how.

4.3. File format diagnostics

Biologists use details such as the shapes of leaves to detect different species of trees, and the color and patterns of feathers for bird species. Similarly, a researcher of document formats can use characteristic features of digital documents to detect the media types of those documents. This section is a guidebook for such a researcher, and provides you a full range of tools for detecting even the most unusual types of documents.

We’ll begin with filename glob patterns, one of the most widely used and easy methods for media type detection. We’ll cover content-type hints, magic bytes, and character encodings—a comprehensive set of digital fingerprinting techniques provided by Tika for identifying file types. The latter part of the chapter contains advanced techniques such as exploiting the structure of XML or combining both filename patterns and digital fingerprinting for detecting the underlying type of a file.

This entire section builds on the guidebook and information recorded and made available by Tika and its media type registry. Think of that registry as the biologist’s sketchbook and additional literature that provide the necessary hints to make the species identification for the leaves that they’re examining. See table 4.2 for a roadmap of the different detection types we’ll cover in this chapter.

Table 4.2. Methods for detecting the type of a file using Tika. The methods build on top of the media type information curated in the Tika media type registry.

Method

Good for

Covered in

File globs Well-known file types, with common extensions like *.txt, *.png. Section 4.3.8
Content-type hints When an application will touch a file before you do, and when it correctly identifies the right type (sometimes it won’t!). Section 4.3.9
MAGIC bytes The general case. This approach works in most cases because nearly all file types have a unique digital fingerprint. Section 4.3.10
Character encodings When an odd charset was used and can be exploited in a fashion similar to MAGIC bytes as a digital fingerprint. Section 4.3.11
Other mechanisms If you’re dealing with XML, whose digital fingerprint isn’t always unique, but also whose schema can give away the underlying type. Section 4.3.12
Combined approaches In the most general case, as it combines the best capabilities of all of the underlying approaches. Section 4.3.12

4.3.1. Filename globs

The simplest and most widely used mechanism for detecting file formats is to look at the filename. Most modern operating systems and applications use filename extensions such as .txt or .png to indicate the file type, even though this is mostly an informal practice with few guarantees that the extension actually matches the format of a file.

April Fools’ You can easily trick your computer using the concept of file extensions. For example, independent of whether you’re using Windows or Mac, try taking an image file and changing its extension to .txt. Now, double click on the file. What happened? More than likely your computer tried to open the image file in a text editor program. It based this decision off of the file extension, which is as easy to change as the filename. Some modern operating systems try to use more information than the file extension to decide what application to open the file with.

Table 4.3 lists the name extensions of some of the more popular file formats.

Table 4.3. Popular file formats and their filename extensions

Extension

File format

Media type

.txt Text document text/plain
.html HTML page text/html
.xls Microsoft Excel spreadsheet application/vnd.ms-excel
.jpg JPEG image image/jpeg
.mp3 MP3 audio audio/mpeg
.zip Zip archive application/zip

The practice of using filename extensions dates back to 40 years ago, when the operating systems of computers built by the Digital Equipment Corporation (DEC) started splitting filenames into a base name and a type extension. This practice was adopted by other vendors, including Microsoft who popularized the 8.3 filename format in their Disk Operating System (DOS) and early versions of Windows. Modern versions of Microsoft Windows no longer limit the filename length (in reality they limit length of the file path), but filename extensions are still used to determine which application should be used to process a file. Modern Mac OS and Unix systems handle filenames similarly.

In addition to filename extensions, there are also some more specific filenames and filename patterns that can be used to identify the type of a file. For example, many software projects contain text files such as README, LICENSE, and Makefile without any filename extensions. Unix systems also widely use textual configuration files whose names match the filename pattern .*rc (where * signifies any sequence of characters).

These and hundreds of other known filename patterns and extensions are included in Tika as <glob pattern="..."/> entries in the media type registry described in section 4.2. For example, here’s how Tika represents the various file extensions typically used by C and C++ source files.

Listing 4.2. C and C++ filename patterns
<mime-type type="text/x-c">
  <glob pattern="*.c"/>
  <glob pattern="*.cc"/>
  <glob pattern="*.cxx"/>
  <glob pattern="*.cpp"/>
  <glob pattern="*.h"/>
  <glob pattern="*.hh"/>
  <glob pattern="*.dic"/>
  <sub-class-of type="text/plain"/>
</mime-type>

If you use file formats with specific extensions or filename patterns that Tika doesn’t already know about, you can extend Tika with this information by modifying the tikamimetypes.xml configuration file present in tika-core (recall section 2.3.7). Adding information to this file is as simple as pulling up the XML file in your favorite editor, quickly cutting and pasting some existing media type blocks, and then modifying the information for your new type and setting it to your liking.

Next, we’ll study how to determine a file’s media type leveraging information besides just the file extension, including leveraging hints inside the file.

4.3.2. Content type hints

Sometimes a document’s filename isn’t available, or the name lacks a type extension. This is common when the document is stored in a database, accessed over the network, or included as an attachment in another document. In such cases it’s typical for the document to be associated with some external type information, most often an explicit media type.

For example, the HTTP protocol used by web browsers to request HTML pages and other documents from web servers specifies a Content-Type header that a server is expected to add to its response whenever it returns a document to the client.

Another example of this situation is when some application sets the Content-type metadata of a file, as when Microsoft Word saves a Word document for you. In these cases, regardless of the underlying extension that the application saves the Word document with (it could be named myfile.foo for all that it matters), Microsoft Word has still provided a hint to any other software that tries to detect the file’s type.

We can exploit this information as part of our toolbelt, and we should when possible. But sometimes, this information isn’t set, and even when it is, we’re still not absolutely sure how much we can trust these content type hints, so Tika still goes the extra mile and employs more advanced techniques, such as magic byte detection.

4.3.3. Magic bytes

Filename extensions and other content type hints are usually fairly accurate, but there’s no guarantee of that. In some cases such external information is either not available or is incorrect, so the only way to determine the type of a document is to look inside it and try to detect the document type based on its content.

A file format is just that: a format for expressing information in a file. Almost all file formats have some characteristic features or patterns that can be detected when looking at a file’s raw byte contents. Many formats even include a magic byte prefix that’s designed to accurately identify the file format. For example, the contents of GIF images always start with the ASCII characters GIF87a or GIF89a depending on the version of the GIF format used. More such magic byte patterns of common file formats are listed in table 4.4.

Table 4.4. Magic byte patterns in popular file formats. Some of the patterns are represented as plain ASCII text, whereas others are shown in their hexadecimal equivalent.

Magic bytes

File format

Media type

%PDF- (ascii) PDF document application/pdf
{\rtf (ascii) Rich Text Format text/rtf
PK (ascii) Zip archive application/zip
FF D8 FF (hex) JPEG image image/jpeg
CA FE BA BE (hex) Java class file application/java-vm
D0 CF 11 E0 (hex) Microsoft Office document application/vnd.ms-excel, application/vnd.ms-word,etc.

Using magic bytes as a means for media type detection is great, but it’s only half of the problem. Another obstacle that presents itself is accurately identifying a file’s character encoding, often referred to as its charset. In the next section, we’ll explore this in detail.

4.3.4. Character encodings

After the complexities of detecting magic bytes, file extensions, and content type hints, you might assume that at least the handling of plain text files should be simple. If only! The big problem with text is that there are so many ways of representing it as bytes. These representations are called character encodings, and there are hundreds of different encodings in active use.

As discussed earlier, the test/plain media type is often accompanied with a charset parameter that indicates the character encoding used in a text document. But even when this information is available, it’s often incorrect. A better way of detecting the character encoding of a text document is clearly needed.

BOM Markers

The easiest way to detect a character encoding is to look for the optional byte order mark (BOM) used by Unicode encodings to indicate the order in which the encoded bytes are stored in the document. The Unicode character U+FEFF is reserved for this purpose and is included as the first character of an encoded Unicode stream. Table 4.5 shows how the BOM looks in the commonly used Unicode encodings.

Table 4.5. BOM in common Unicode encodings

Encoded BOM (hex)

Unicode encoding

EF BB BF UTF-8
FE FF UTF-16 (big endian)
FF FE UTF-16 (little endian)
00 00 FE FF UTF-32 (big endian)
FF FE 00 00 UTF-32 (little endian)

If the first few bytes of a document match a known BOM pattern, you can be fairly confident that you’re dealing with a text document in the character encoding indicated. Otherwise you’re out of luck, since few of the other character encodings use byte order marks, and there are no other easy markers to be relied on.

Byte Frequency

The best approach to detecting the type and encoding of such documents is to look at the frequency of different bytes within, say, the first few kilobytes of the document. Plain ASCII text hardly ever contains control characters except newline and tab, and most other character encodings avoid using those bytes for normal text. So if you see many control bytes (characters with code < 32), you can assume that you’re not dealing with a plain text document.

If the document does look like plain text, you still need to determine the character encoding. There are a few tricks for detecting encodings such as UTF-8 that use easily recognizable bit patterns when encoding multibyte characters, and some character encodings never use certain byte values (for example, ASCII only uses the lowest seven bits).

Statistical Matching

After you’ve checked for easy matches and ruled out impossible alternatives, the last resort is to use statistical matching to determine which character encoding is most likely to produce the bytes and byte sequences in the input document. Many character encodings are associated with a specific language or a group of languages for which the encoding is particularly designed, so the frequency of encoded characters or character pairs can be used for a reasonably accurate estimate of the language and encoding used in a document.

Tika’s MediaTypeRegistry implements all of the aforementioned detection mechanisms and allows you to leverage them in your application. In the next section, we’ll explore Tika’s final type detection mechanisms, including XML root detection.

4.3.5. Other mechanisms

Some document formats are based on more generic formats like Zip archives (application/zip), XML (application/xml), or Microsoft’s format for Object Linking and Embedding (OLE) or Compound File Binary File Format (MS-CFB: see http://mng.bz/gU1C) documents. Even if such a container format can be easily detected using magic bytes or other details, it may be difficult to determine if the format is used to host a more specific kind of document. The container format needs to be parsed to determine whether the content matches that of a more specific document type.

XML Format

The most notable of such formats is XML, which has been used for countless more-specific document types such as XHTML (application/xhtml+xml) and SVG (image/svg+xml). To detect the specific type of a given XML document, the root element of the document is parsed and then matched against known root element names and namespaces.

OLE Format

Microsoft’s OLE format is another troublesome format to detect. Used by default by all Microsoft Office versions released between 1995 and 2003, many of which are still in production use, the OLE format is one of the most widely used document formats on the planet. The OLE format is essentially a miniature file system within a single file. Specifically named directories and file entries within such a file are used by specific programs, so the type of a document can be determined by looking at the directory tree of the OLE container. Unfortunately, the OLE format is somewhat complicated and requires random access to the document, which makes OLE type detection difficult for documents that are being streamed, for example, from a web server. Tika uses a best-effort approach for OLE detection that works pretty well in practice even within these constraints.

Combined Heuristics

These and other custom detectors are constantly being developed as Tika encounters new document formats that can’t be detected using one or more of the simpler mechanisms we discussed earlier.

That’s quite a load of different type detection mechanisms, and none of them promise to be absolutely accurate! Are we to announce defeat in the face of such complexity? Luckily the situation isn’t that bad, as many of the preceding approaches can be used independently to verify the results of another detection method. Then, by combining the various detection heuristics at hand, we can come up with a highly accurate estimate of the media type of almost all kinds of documents. And the best part is that Tika does this automatically for you. The next section shows you how.

4.4. Tika, the type inspector

As you can probably remember from chapter 2, the Tika facade class has a detect() method that returns the detected media type of a given document. The SimpleTypeDetector class shows how this works in practice.

Listing 4.3. Simple type detector example
import java.io.File;

import org.apache.tika.Tika;

public class SimpleTypeDetector {

  public static void main(String[] args) throws Exception {
      Tika tika = new Tika();

      for (String file : args) {
          String type = tika.detect(new File(file));
          System.out.println(file + ": " + type);
      }
  }
}

Pretty simple, right? Now let’s look at what else you can do with type detection. The first thing is to switch to a customized type registry that contains some extra type information which you need in your application. The following shows how you can specify which media type configuration file is used by Tika. The default type configuration is included as an embedded classpath resource at /org/apache/tika/mime/tikamimetypes.xml:

String config = "/org/apache/tika/mime/tika-mimetypes.xml";
Tika tika = new Tika(MimeTypesFactory.create(config));

In addition to passing java.io.File instances to the detect() method, you can also give it input streams, URLs, or even nothing but a filename string. In each of these cases Tika will do its best to combine all the available type information it has with the document details you’ve given. The result is usually the type you were looking for.

4.5. Summary

This completes our discussion of the taxonomy of document formats and the associated ways in which document types can be detected. We started by introducing the internet media type system and looking at how media types are handled in Tika using the mime-info database and the MediaType and MediaTypeRegistry classes. We then covered several heuristics for detecting document types, and finally brought it all together into the detect() method of the Tika facade.

By now you should know not only how to use Tika to detect document types, but also how Tika achieves this task internally and how you can extend Tika with custom type information. This knowledge will come in handy in the next chapters as we look at how to proceed from knowing the type of a document to being able to extract content and metadata from it.