List of Tables – Tika in Action

List of Tables

Chapter 1. The case for the digital Babel fish

Table 1.1. Tika’s main methods of media type detection. These techniques can be performed in isolation or combined together to formulate a powerful and comprehensive automatic file detection mechanism.

Table 1.2. Tika’s key design goals, numbered and described briefly for reference. Each design goal is elaborated upon in detail in this section, and ties back to the overall necessity for Tika.

Chapter 2. Getting started with Tika

Table 2.1. Information included in views of the Tika GUI window

Table 2.2. Key methods of the Tika facade

Chapter 3. The information landscape

Table 3.1. Some underlying principles of the REST architecture and their influence on the web’s scalability. These are only a cross-section of the full description of REST from Fielding’s dissertation.

Table 3.2. Information representative of the type collected about users of e-commerce sites. This would then be fed into a collaborative filtering, clustering, or categorization technique to provide recommendations, find similarities between your purchasing history with that of other users, and so on.

Chapter 4. Document type detection

Table 4.1. Officially specified top-level media types by IANA. These types form the basis for a detailed classification framework of available document types. Children are allowed for each top-level type, indicating some specialization of the parent (a more specific schema, a slightly different encoding format, and so on).

Table 4.2. Methods for detecting the type of a file using Tika. The methods build on top of the media type information curated in the Tika media type registry.

Table 4.3. Popular file formats and their filename extensions

Table 4.4. Magic byte patterns in popular file formats. Some of the patterns are represented as plain ASCII text, whereas others are shown in their hexadecimal equivalent.

Table 4.5. BOM in common Unicode encodings

Chapter 5. Content extraction

Table 5.1. The arguments for the org.apache.tika.parser.Parser’s parse() method. Some of the arguments are only read, such as the InputStream and the ParseContext; some are callbacks (such as the ContentHandler); and some objects are actually written to, such as the Metadata argument.

Table 5.2. Potential problems that can be encountered during the parse() method. Outside of SAX parsing errors and I/O errors, Tika wraps the remaining parsing exceptions in its own custom TikaException class.

Table 5.3. Tika’s SAX helper utility classes. These helper classes allow easy extensibility and customization of the output from Tika’s text extraction functionality.

Chapter 6. Understanding metadata

Table 6.1. Relevant components of a metadata standard (or metadata model). Metadata standards help to differentiate between metadata fields, allow for their comparison and validation, and ultimately clearly describe the use of metadata fields in software.

Chapter 8. What’s in a file?

Table 8.1. Simplified representation of content within Hierarchical Data Format (HDF) files. HDF represents observational data and metadata information using a small set of constructs: named scalars, vectors, and matrices.

Chapter 10. Tika and the Lucene search stack

Table 10.1. A table-oriented view of a Lucene Document

Chapter 12. Powering NASA science data systems

Table 12.1. A PDS label for Cassini

Appendix A. Tika quick reference

Table A.1. Key methods of the Tika facade class

Table A.2. Document arguments to the Tika facade methods

Table A.3. Tika command-line options

Table A.4. ContentHandler utility classes