Chapter 15. The classic search engine example – Tika in Action

Chapter 15. The classic search engine example

 

 

What better way to close out the book then the way we started it—with a classic search engine example?

You’re in for a treat. We interviewed Ken Krugler and his team from Bixo labs about their recent Public Terabyte Dataset Project, http://mng.bz/gYOt, and how Tika was a core component of a large-scale series of tests that helped shed some light on variations between languages, charsets, and other content available on the internet.

This chapter will show you even more of Tika in action, especially how you can leverage Tika inside of a workflow system such as Cascading, which is built on top of Hadoop to analyze a representative (by today’s standards) data set that many other internet researchers are also exploring. The tests run by Bixo labs that we’ll describe in the rest of the chapter should identify areas of further refinement in Tika, particularly in charset detection and language identification (recall chapter 7). Heck, they may even motivate you to get involved in improving Tika and working within the community.

Let’s hear more about it!

15.1. The Public Terabyte Dataset Project

The web contains a staggering amount of useful source data for a variety of interesting programming projects (for example, analyzing the geographical distribution of Chinese restaurants in the United States, as shown in figure 15.1). In order to make use of that data, you must enumerate a target set of URLs, make connections to each web server, and then individually download each page of content. Web crawlers are employed to automate this web mining process, but the complexity of developing a web crawler or even using an existing web crawling tool requires a large time investment before any work can be done to process the data collected.

Figure 15.1. Searching the web for restaurant reviews discussing Chinese food by geographic location

In 2010, Bixo Labs, Inc., Amazon Web Services, and Concurrent, Inc., decided to sponsor a large-scale crawl of the top 1–2 million web domains, based on traffic from clients in the United States. The goal was to fetch approximately 100–500 million pages from these domains and then put the content into the public domain on Amazon’s S3, in a format that would be easy to import into other applications, particularly those using Hadoop for scalability. This Public Terabyte Dataset would therefore constitute a very large corpus of both high value and relatively easily accessible web content.

Although most web content is delivered in HTML, a great deal of potentially useful content is available in other formats such as Microsoft Word, Adobe Portable Document Format (PDF), and so on. Parsing this web content to extract the text (and graphics) after it’s fetched makes the resulting data set much more useful than it would be in its raw form. Tika provides an attractive architecture for parsing arbitrary web content because its AutoDetectParser feature (recall chapter 5) automates the process of selecting an appropriate parser for each fetched document.

In addition, an essential part of any web crawler is its ability to collect outbound links from each fetched page and then add the target URLs to a database that it’ll use for subsequent content fetching. The Tika HtmlParser (recall the discussion of customizing parts of this component in chapter 8) also provides excellent support for link extraction while each page is being processed.

Now that we have some background on the Public Terabyte Dataset Project, we’ll explain a bit about Bixo, a company building software focused on exploiting such a corpus.

15.2. The Bixo web crawler

Bixo (see http://openbixo.org/) is an open source web mining toolkit based on Hadoop, the dominant open source implementation of the MapReduce algorithm. Bixo uses the Cascading open source project (see http://www.cascading.org/) to define the web crawling workflow. The use of Cascading allows Bixo to focus on the mechanics of web crawling and the associated data flow rather than Hadoop/MapReduce implementation details.

Cascading provides a rich API for defining and implementing scale-free and fault-tolerant data processing workflows on a Hadoop cluster. The Cascading workflow model is one of operations that are connected via “pipes,” much like classic Unix tools. Bixo consists of a number of Cascading operations and subassemblies, which can be combined to form a data processing workflow that (typically) starts with a set of URLs to be fetched and ends with some results extracted from parsed HTML pages. The entire Bixo workflow is shown in figure 15.2.

Figure 15.2. Bixo data flow

The Fetch subassembly is the component where the heavy lifting is done. URLs enter via the input data pipe, and its two tail (results) pipes emit the raw fetched content and status information (such as whether the fetch was successful or failed due to a particular transient or permanent error condition).

The Parse subassembly is used to process the raw fetched content. As mentioned, it uses Tika to handle the details of extracting text from various formats and to help extract outbound links.

Bixo also takes care to crawl in a “polite” manner, honoring the directives in each web server’s robots.txt file (a file specifying areas of the website to be excluded from crawling, how long to wait after completing a small set of requests to the web server, and so forth). Pages with noarchive HTML meta tags are also automatically excluded from the data set.

15.2.1. Parsing fetched documents

When Bixo was first developed, the team considered incorporating Nutch’s complete parsing architecture. But they were pleased to discover that Tika provided most of the required support, eliminating the hassle of maintaining such a large, complex body of code and keeping it synchronized with the Nutch project. Since then, the folks at Bixo have been encouraged to see Tika adopted by both Nutch and Apache Droids, as this can only help to improve Tika’s stability and performance, as well as support for features of particular interest to crawler developers (such as language detection, which you read about in chapter 7).

Spreading the Wealth Apache Nutch was the progenitor of many of the modern popular open source web and big data technologies, including most notably Apache Hadoop, Droids, and Tika. The ability to use these descendants without having to pull in all of the Nutch core has greatly increased not only the individual user bases of Hadoop, Droids, and Tika, but, Nutch’s as well.

Most web pages fall short of compliance with any HTML standard, and so web browsers are extremely forgiving when displaying content. Tika’s HTMLParser makes use of the TagSoup software library to perform the actual parsing, and TagSoup’s ability to handle and clean up badly broken HTML documents is essential when parsing web content. Because a web server can also return an arbitrarily long document, Bixo is configured to abort the fetch after a user-configured limit (such as 128 KB of text). Accordingly, TagSoup’s forgiving nature is also important in collecting content from such truncated pages.

Bixo uses Tika’s TeeContentHandler to employ two separate ContentHandlers simultaneously: one to extract the content itself and another to extract outbound links, as demonstrated in the following listing. If content language detection is desired, Tika’s ProfilingHandler that you might recall from chapter 7 can be used simultaneously as well.

Listing 15.1. Linking together link extraction and language detection

If language detection is desired, Bixo uses the ProfilingHandler as shown in . The ProfilingHandler is used to hand off SAX events containing the text extracted by Tika to the language detection mechanism (remember this from chapter 7). Then the TeeContentHandler joins the ProfilingHandler to the existing LinkHandler shown in . If language detection isn’t desired, Bixo joins the LinkHandler to extract links from the existing content handler stream as shown in .

Although not used directly for the Public Terabyte Dataset Project, Tika’s BoilerpipeContentHandlercan be useful for focusing on the meat of each HTML page by ignoring banner advertisements and navigation controls that typically appear in the header, footer, and margin areas.

Parsing is always a relatively CPU-intensive operation, especially when compared with other jobs in a Bixo workflow. For example, fetch performance tends to be I/O bound (say, by the bandwidth of the cluster’s internet connection and the constraints imposed by polite fetching), so it can be accomplished easily with a small army of relatively inexpensive machines. We found instead that parsing was best accomplished using a cluster of higher-end machines (dual-core CPUs with 7.5 GB of RAM, aka m1.large instances in Amazon’s EC2 environment). Despite the best efforts of the Tika development community, arbitrary web content sometimes causes a parser to hang. Accordingly, Bixo runs the Tika parsers in a separate thread via a TikaCallable object whose FutureTask will attempt to abort the parsing thread after a user-configured time limit has expired (say, 30 seconds). Unfortunately, it’s difficult to reliably terminate parsing threads. Note that there’s an outstanding JIRA issue (TIKA-456) to further insulate the client application from such zombie threads. Jukka Zitting has also developed a way to perform Tika parsing within a separate child JVM that could be killed off completely if necessary.

In order for TagSoup to successfully parse HTML content, the encoding of the text stream must also be provided. Although the HTML 5 specification states that clients should trust any Content-Encoding response headers returned by the web server, we’ve found that Tika’s “trust but verify” approach is far more robust when faced with arbitrary web content. Bixo puts the Content-Type and Content-Language from the server response headers into the metadata it passes to Tika. Tika first searches the initial 8 KB of the content for <meta> tags with Content-Type and charset attributes. If the charset isn’t found in the <meta> tags, it uses its CharsetDetector (along with any hints from the server response header) to pick the best charset, using statistical analysis of short character sequences.

Next, we’ll hear about an interesting analysis result we arrived at directly as a benefit of using Tika in the Public Terabyte Dataset Project. Yes, this example is more than just cool technology—it’s also produced a genuine scientific result that will be used to further improve charset detection mechanisms within the domain of internet-scale web crawling. Read on to find out!

15.2.2. Validating Tika’s charset detection

As an interesting dataset use case, we took a sample set of several thousand pages that each provided the charset via <meta> tags, and then examined the accuracy of the Tika CharsetDetector (we assumed that the charset provided in the <meta> tags was correct.) The results are summarized in figure 15.3.

Figure 15.3. Evaluating Tika’s charset detection with a web-scale data set

It would be ideal if the most common character sets (particularly UTF-8) also enjoyed the highest detection accuracy, but unfortunately this wasn’t the case as indicated in figure 15.3. Instead, there seem to be significant biases in the Tika Charset-Detector toward unusual character sets. For example, many UTF-8 pages were incorrectly classified as gb2312. We hope that future analysis using the full Public Terabyte Dataset will help support efforts to improve Tika’s character set detection support. Similar analysis could be performed to diagnose and improve Tika’s language detection support, which typically provides even poorer results despite requiring a great deal of extra processing time.

15.3. Summary

We hope you’ve enjoyed hearing about the way in which Bixo Labs, Inc., Amazon Web Services, and Concurrent, Inc., leveraged Tika to generate the public internet-scale web dataset called the Public Terabyte Dataset Project. To date, they’ve already produced an interesting scientific result, and we anticipate this result will help further efforts in improving the Tika system, especially in the areas of charset support and language detection.

To summarize, we heard about

  1. The goal of the Public Terabyte Dataset Project: to build a semi-processed web-scale dataset where the text and links have been pre-extracted, making the data more easily processed and amenable to analysis.
  2. The Bixo web crawler, and its layering on top of technologies like Tika, Hadoop, and Cascading. We heard about the Bixo crawler workflow, and how it separates out the fetching and the parsing steps.
  3. How Bixo’s web crawler allows link extraction and language detection via Tika.
  4. An interesting real-world science result demonstrating the need for improvement in Tika’s charset detection mechanism, especially on not-so-common character sets.

We’d also like to acknowledge you for sticking with us to the end. The development of Tika has been a tremendously challenging, intellectually stimulating body of work over the past five years, and describing it here over the past 15 chapters has been a blast. We hope you’ve had as much fun as we have and that you find Tika truly useful in your own software development. Stop by the Author Online forum or the user@tika.apache.org or dev@tika.apache.org Apache mailing lists and share your experiences with the rest of the community!