List of Figures – Tika in Action

List of Figures

Chapter 1. The case for the digital Babel fish

Figure 1.1. Computer programs usually specialize in reading and interpreting only one file format (or family of formats). To deal with .pdf files, .psd files, and the like, you’d purchase Adobe products. If you needed to deal with Microsoft Office files (.doc, .xls, and so on), you’d turn to Microsoft products or other office programs that support these Microsoft formats. Few programs can understand all of these formats.

Figure 1.2. Seven top-level MIME types hierarchy as defined by IANA’s RFC 2046 (the eighth type model was added later in RFC 2077). Top-level types can have subtypes (children), and so on, as new media types are defined over the years. The use of the multiplicity denotes that multiple children may be present at the same level in the hierarchy, and the ellipses indicate that the remainder of the hierarchy has been elided in favor of brevity.

Figure 1.3. A snippet of HTML (at bottom) for the home page. Note the top-level category headings for sports (All Sports, Commentary, Page 2) are all surrounded by <li> HTML tags that are styled by a particular CSS class. This type of structural information about a content type can be exploited and codified using the notion of structured text.

Figure 1.4. An image of Mars (the data), and the metadata (data about data) that describes it. Three sets of metadata are shown, and each set of metadata is influenced by metadata models that prescribe what vocabularies are possible, what the valid values are, what the definitions of the names are, and so on. In this example, the metadata ranges from basic (file metadata like filename) to image-specific (EXIF metadata like resolution-unit).

Figure 1.5. Revisiting the search engine example armed with Tika in tow. Tika provides the four canonical functions (labeled as software components in the figure) necessary for content detection and analysis in the search engine component. The remainder of the search engine’s functions (crawling, fetching, link analysis, scoring) are elided in order to show the data flow between the search engine proper, the files it crawls from the shared network drive, and Tika.

Figure 1.6. A visual timeline of Tika’s history. Its early beginnings formed from the Apache Nutch project, which itself spawned several children and grandchildren, including Apache Hadoop and its subprojects. After some steps along the way, as well as the work of a few individuals who kept the fire lit, Tika eventually moved into its current form as a top-level Apache project.

Figure 1.7. High-level Tika architecture. Note explicit components exist that deal with MIME detection (understanding how to identify the different file formats that exist), language analysis, parsing, and structured text and metadata extraction. The Tika facade (center of the diagram) is a simple, easy-to-use frontend to all of Tika’s capabilities.

Chapter 2. Getting started with Tika

Figure 2.1. Tika GUI window

Figure 2.2. Overview of the Tika facade

Figure 2.3. The Tika component stack. The bottom layer, tika-core, provides the canonical building blocks of Tika: its Parser interface, MIME detection layer, language detector, and the plumbing to tie it all together. tika-parsers insulates the rest of Tika from the complexity and dependencies of the third-party parser libraries that Tika integrates. tika-app exposes tika-parsers graphically, and from the command line to external users for rapid exploration of content using Tika. Finally, tika-bundle provides an Open Services Gateway Initiative (OGSI)-compatible bundle of tika-core and tika-parsers for using Tika in an OGSI environment.

Chapter 3. The information landscape

Figure 3.1. A postulated movie distribution system. Movies are sent via the network (or hard media) to a consumer movie distribution company. The company stores the electronic media on their hard disk (the movie repository), and then metadata and information is extracted (using Tika) from the movie files and stored in a movie metadata catalog. The metadata and movie files are made available electronically to consumers from the company’s user interface. An average user, Joe User, accesses the movie files, and, potentially, to save bandwidth across the wire, Tika can be called on the server to extract summaries, provide ratings, and so on from the streamed movie to the end user’s console systems.

Figure 3.2. An estimate of the size of the World Wide Web from The graph is estimated and generated dynamically by inspecting results from Google, Yahoo!, Bing, and from 2008–2010. The size of the web has remained relatively constant, with a large gap between those results that are weighted using Google’s estimate (GYWA) and those weighted using Yahoo!’s estimate (YGWA) (a +20–30 billion page difference, with Google having the more conservative estimate). Still, the scale is representative of both the amount of information out there, as well as the difficulty in understanding it (your home library likely nowhere near approaches 10 billion pages in size!).

Figure 3.3. The amount of website growth per year (in millions of websites) over the last decade, estimated by with data provided by Netcraft. There was steady growth (tens of millions of sites per year) for the later part of the 1990s and into the early 2000s, but between 2005–2008 there has been three orders of magnitude growth (from 10 million to ~30 million) in new websites per year.

Figure 3.4. A sampling of well-known content types of the up to 51,000 in existence. As a user of the modern internet, you’ll likely see some of these documents and files while navigating and searching for your topic of interest. What’s even more likely is that custom applications are required to view, modify, or leverage these documents and files in your particular task.

Figure 3.5. Results from a test run by Bixolabs and its Public Terabyte Dataset (PTD) project. The dataset contains 50–250 million representative pages crawled from the top million US-traffic domains. The test involved running a large-scale crawl job programmed in Cascading (a concurrent workflow construct and API for running jobs on a Hadoop cluster) on Amazon EC2. One part of the crawl job used Tika to evaluate the charset of the document being crawled. The Y axis demonstrates accuracy in detection, and the points show the particular charset and its frequency within documents in the dataset. The results of the test demonstrate decent (60%) accuracy on charsets that were in the median frequency of the dataset, and mixed results (30%) on some common charsets such as UTF-8.

Figure 3.6. The architecture of a web search engine. The circular structures in the middle of the diagram are websites that the crawler (the eight-legged creature labeled with a bold C) visits during a full web crawl. The crawler is itself made up of several functional components, shown magnified in the upper-right corner of the figure. The components include URL filtering (for narrowing down lists of URLs to sites that the crawler must visit); deduplication (for detecting exact or near-exact copies of content, so the crawler doesn’t need to fetch the content); a protocol layer (for negotiating different URL protocols such as http://, ftp://, scp://); a set of parsers (to extract text and metadata from the fetched content); and finally an indexer component (for taking the parsed text and metadata, and ingesting them into the search index, making them available for end users). The crawler is driven by configurable policy containing rules and semantics for politeness, for identification of the crawler, and for controlling the behavior of the crawler’s underlying functional components. The stars labeled with T indicate areas where Tika is a critical aspect of the crawler’s functionality.

Figure 3.7. An example of collaborative filtering as provided by Recommendations are automatically suggested on entering the site through collection and processing of past purchases and user preferences. In the bottom portion of the figure, Amazon explicitly solicits feedback and ratings for items in a category from the user to use in future recommendations.

Figure 3.8. Tika’s utility in machine learning (ML) applications. The dashed line in the middle of the figure delineates two use cases. The first is within the Apache Mahout project, whose goal is to use ML to provide collaborative filtering, clustering, and categorization. Mahout algorithms usually take vectors as input—descriptions of the clustering, user or item preferences, or categorizations of incoming content, where content can be arbitrary electronic documents. An emerging use case is to take files and use Tika to extract their textual contents and metadata, which can be translated into Mahout vectors to feed into its ML algorithms. In the bottom of the figure is a use case for Apache UIMA, a reference implementation of the UIMA standard being developed by OASIS. In this use case, Tika is used to implement a UIMA annotator that extracts features from incoming content, and then classifies those features in a UIMA Common Analysis Structure (CAS) model.

Chapter 4. Document type detection

Figure 4.1. Table of the Animal Kingdom (Regnum Animale) from an early 1735 edition of Carolus Linnaeus’s Systema Naturae. This and Linnaeus’s other seminal book, Species Plantarum, laid the groundwork for most of the biological nomenclature in use today. A similar classification of types can also be found in the internet media type system.

Figure 4.2. The document media type to application mapping from Mozilla Firefox. This panel can be brought up on a Mac by clicking on the Firefox menu, then selecting Preferences, then clicking on the Applications tab (note: this sequence depends on the operating system used, but is likely similar for platforms other than Mac). Each listed media type is mapped to one or more handler applications, which Firefox tries to send the content to when it encounters the document on the internet.

Figure 4.3. Railroad diagram of the syntax of media type names. See section 5.1 of RFC 2045 for the full details of the format.

Figure 4.4. Basic UML class diagram that summarizes the key features of the MediaType class. The class implements both the Comparable and Serializable standard Java interfaces. The type name, its subtype, and the associated type parameters are all available through getter methods, and the MediaType can be serialized to human-readable form by calling the toString method.

Figure 4.5. UML class diagram that summarizes the key features of the MediaTypeRegistry class. The class allows the set of loaded MediaType object instances to be returned as a SortedSet, and allows a user to obtain a SortedSet of aliases belonging to a particular MediaType.

Figure 4.6. Four levels of type hierarchy with the image/svg+xml type. The SVG image can be processed either as a vector image, as a structured XML document, as plain text, or ultimately as a raw sequence of bytes.

Chapter 5. Content extraction

Figure 5.1. Overview of Tika’s parsing process

Figure 5.2. Information flows between the parse() method and its arguments. The input stream and metadata arguments are used as sources of document data, and the results of the parsing process are written out to the given content handler and metadata object. The context object is used as a source of generic context information from the client application to the parsing process.

Figure 5.3. Class diagram that summarizes some of the most prominent implementations of the Parser interface. The generic classes in the org.apache.tika.parser package aren’t tied to any specific document types, unlike the format-specific concrete parser classes organized in various subpackages.

Figure 5.4. Class diagram that shows the extra functionality provided by the TikaInputStream class

Figure 5.5. Structural breakdown of the beginning of the Wikipedia article on hyperlinks. In addition to the obvious heading, the authors of the text have used hyperlinks and different forms of emphasis to highlight key concepts in the opening paragraphs. These words—computing, hyperlink, link, reference, document—could well be treated as keywords of the document.

Figure 5.6. Grammatical breakdown of a simple sentence. Some computer programs and libraries can perform this kind of analysis of text, at times even more accurately than an average human, but the value of such analysis is limited without extensive knowledge about the meaning of words and their relationships. For now Tika doesn’t attempt to parse such grammatical structures.

Chapter 6. Understanding metadata

Figure 6.1. The search engine process and metadata. Metadata about a page, including its title, a short description, and its link are used to determine whether to “click” the link and obtain the content.

Figure 6.2. Classes of metadata models. Some are general, such as ISO-11179 and Dublin Core. Others are content-specific: they’re unique to a particular file type, and only contain metadata elements and descriptions which are relevant to the content type.

Figure 6.3. A content creator (shown in the upper left portion of the figure) may author some file in Microsoft Word. During that process, Word annotates the file with basic MSOffice metadata. After the file is created, the content creator may publish the file on an Apache HTTPD web server, where it will be available for downstream users to acquire. When a downstream user requests the file from Apache HTTPD, the web server will annotate the file with other metadata.

Figure 6.4. The code-level organization of the Tika metadata framework. A core base class, Metadata, provides methods for getting and setting metadata properties, checking whether they’re multivalued, and representing metadata in the correct units.

Chapter 7. Language detection

Figure 7.1. The language detection pipeline. Incoming documents with no language metadata are analyzed to determine the language they’re written in. The resulting language information is associated with the documents as an extra piece of metadata.

Figure 7.2. The first sentence of the first article of the Universal Declaration of Human Rights, written in Arabic, Chinese, English, French, Russian, and Spanish—the six official languages of the United Nations.

Figure 7.3. Title page of a 16th century printing of Romeo and Juliet by William Shakespeare

Figure 7.4. Frequency of letters in many languages based on the Latin alphabet

Figure 7.5. Class diagram of Tika’s language detection API

Chapter 8. What’s in a file?

Figure 8.1. Several areas where content can be gleaned from a file

Figure 8.2. A postulated satellite scenario, observing the earth and collecting those observations in data files represented in the Hierarchical Data Format (HDF). HDF stores data in a binary format, arranged as a set of named scalars, vectors, and matrices corresponding to observations over some space/time grid.

Figure 8.3. The CNN Really Simple Syndication (RSS) index page. CNN provides a set of RSS files that users can subscribe to in order to stay up to date on all of their favorite news stories, categorized by the type of news that users are interested in.

Figure 8.4. The Tika FeedParser’s parsing process. The ROME API is used to access the file in a streaming fashion, making the output available to Tika.

Figure 8.5. A side-by-side comparison of HDF and NetCDF. HDF supports grouping of keys like putting Latitude and Longitude inside of the Geometry group. NetCDF doesn’t support grouping, and the keys are all flattened and ungrouped.

Figure 8.6. The semantics of extracting file and directory metadata

Figure 8.7. A software deployment scenario in which the system is pulled out of configuration management, run through a deployment process that copies and installs the software to a directory path, and codified with the unique software version number. A symlink titled current points to the latest and greatest installed version of the software.

Chapter 9. The big picture

Figure 9.1. Overview of a search engine. The arrows indicate flows of information.

Figure 9.2. Architecture of a search engine. Blocks identify key components and the arrows show how data flows between them. Tika is typically used in the starred extraction component.

Figure 9.3. Overview of a document management system

Figure 9.4. Overview of a text mining system

Figure 9.5. Tika deployment with parser implementations from multiple different sources

Figure 9.6. Distributing a large workload over multiple computers can dramatically improve system throughput.

Figure 9.7. Building an inverse index as a map-reduce operation

Chapter 10. Tika and the Lucene search stack

Figure 10.1. The Apache Lucene ecosystem and its family of software products. Some of the software products (such as Mahout, Nutch, Tika, and Hadoop) have graduated to form their own software ecosystems, but they all originated from Lucene. Tika is the third dimension in the stack, servicing each layer in some form or fashion.

Figure 10.2. The Apache ManifoldCF home page from the Apache Incubator

Figure 10.3. The Apache Open Relevance home page

Figure 10.4. The Apache Lucene Top Level Project home page

Figure 10.5. The Apache Solr Project home page

Figure 10.6. The Apache Nutch Top Level Project home page

Figure 10.7. The Apache Nutch2 Architecture. A major refactoring of the overall system, Nutch is now a delegation framework, leaving the heavy lifting to the other systems in the Lucene ecosystem.

Figure 10.8. The Apache Droids Project home page

Figure 10.9. The Apache Mahout Top Level Project home page

Chapter 11. Extending Tika

Figure 11.1. Illustration of how a digital prescription document can be used to securely transfer accurate prescription information from a doctor to a pharmacy. A digital signature ensures that the document came from someone authorized to make prescriptions, and encryption is used to ensure the privacy of the patient.

Figure 11.2. Overview of a generic type detector

Figure 11.3. Custom parser classes for handling digital prescription documents

Chapter 12. Powering NASA science data systems

Figure 12.1. The flow of data through NASA’s Planetary Data System

Figure 12.2. The PDS Search Engine architecture redesign resultant architecture. Metadata is dumped from the PDS-D catalog, transformed to RDF by a custom Tika PDS parser, and then sent to Lucene/Solr for indexing.

Figure 12.3. The NASA Planetary Data System (PDS) main web page and its drill-down (facet-based) search interface

Figure 12.4. NASA’s Earth Science Enterprise, consisting of three families of software systems: SIPS takes raw data and process it; DAACs distribute that data to the public; proposal systems do ad hoc analyses.

Figure 12.5. Tika’s use in the NASA Earth Science Enterprise. Tika helps classify files for file management, metadata extraction, and cataloging as shown in the upper left of the diagram. In the lower right, Tika helps workflow tasks share metadata and information used to trigger science algorithms.

Chapter 13. Content management with Apache Jackrabbit

Figure 13.1. Example content in a content repository

Figure 13.2. Doing a WebDAV mount in Windows Vista

Chapter 14. Curating cancer research data with Tika

Figure 14.1. Simplified view of the EDRN data model, showing the relationship between protocols, specimens, science data sets, biomarkers, and instruments

Figure 14.2. EDRN’s eCAS architecture. The components on the left side of the diagram use Tika to prepare data for ingestion into a file manager. The components on the right side of the diagram use Tika to classify incoming files. The components that are directly implemented by Tika are shaded in grey.

Figure 14.3. The EDRN curation cockpit. The left side of the web application focuses on the staging area, allowing a curator to perform metadata extraction and manipulation. The right side of the webapp deals with data curation and ingestion in the File Manager component, allowing for classification of different file types. Most of the underlying extraction and classification functionality is driven by Tika.

Chapter 15. The classic search engine example

Figure 15.1. Searching the web for restaurant reviews discussing Chinese food by geographic location

Figure 15.2. Bixo data flow

Figure 15.3. Evaluating Tika’s charset detection with a web-scale data set