By now you should have a fairly good understanding of what Tika is, what it can do, and where it fits in the bigger picture of information-processing systems. If you read through chapter 2 and tried out the examples, you’ve seen Tika in action and written your first Tika-based application. But if you’re anything like us, you’re wondering how this toolkit is put together and what programming APIs it provides. Wait no more, because that’s what we’ll be covering in this part of the book!
We’ll start in chapter 4 by describing the internet media type system and how Tika can detect the type of virtually any kind of document. Once the type is known, Tika can parse the document to extract its content and any associated metadata. Content extraction with Tika is covered in chapter 5, and metadata handling in chapter 6. In chapter 7, we’ll show how Tika can help deduce information like the natural language in which a document is written. Finally, chapter 8 looks at some of the more popular file formats and the details that you should know when dealing with such files.
That’s a lot of ground to cover, so let’s get started!