Index – Tika in Action

Index

[A][B][C][D][E][F][G][H][I][J][L][M][N][O][P][R][S][T][U][V][W][X]

A

<a> tag
AbstractParser class
add() method
agglutinative languages
alias
analyzers.
    See search engines.
annotations
Ant build.
    See also source code.
Apache Droids
Apache Gora
Apache Hadoop2nd
  Bixo
Apache Incubator3rd
  podlings
Apache Lucene
  Document class
  ecosystem
  Field class
  Lucene Core
Apache Mahout2nd
Apache Manifold Connectors.
    See ManifoldCF.
Apache Nutch2nd5th
  and Bixo
  Apache Gora
  Protocol plugins
Apache PDFBox2nd
Apache Solr
Apache Tika, history of2nd
Apache UIMA
  annotations
application programming interfaces (APIs)
  Java ROME API
  Parser API
  pull APIs
  push APIs.
    See also Content Repository for Java API.
application/* MIME type
Architectural Styles and the Design of Network-based Software Architectures
audio/* MIME type
AutoDetectParser
AutoDetectParser class2nd

B

Babel fish
Behemoth
/bin/ls output
biomarkers
Bixo9th
  and Apache Nutch
  and TagSoup
  Cascading
  Fetch subassembly
  Parse subassembly
  parsing documents2nd
  robots.txt file
black lists
BodyContentHandler class
BoilerpipeContentHandler class
BOM markers
Brin, Sergey
build.xml file
byte frequency matching
byte order marks.
    See BOM markers; magic bytes.

C

callback functions
cancer research3rd
  biomarkers.
    See also Early Detection Research Network.
Cardinality property
Cascading
Cascading Style Sheets
categorization
CF.
    See Climate Forecast model.
character encodings
  BOM markers
  byte frequency matching
  statistical encoding detection.
    See also character sets; charsets.
character sets2nd
  validating character set detection.
    See also character encodings.
CharsetDetector class
charsets.
    See also character encodings; character sets.
classpath
Climate Forecast model
ClimateForecast interface
cloud computing2nd
clustering
collaborative filtering
command-line interface
  --language option.
    See also Tika CLI.
composite design pattern
CompositeDetector class
CompositeParser class
compression
content12th
  how Tika extracts it2nd
  organization of2nd
  random access2nd
  streaming2nd
  types2nd
content management.
    See document management systems.
content repositories2nd
  text extraction pool
Content Repository for Java API
content type hints
Content-Encoding headers
ContentHandler argument
ContentHandler interface2nd
  in Apache Jackrabbit
content-specific metadata standards4th
  compared to general standards
  difficulty of comparing across standards
Content-Type header.
    See content type hints.
context-free interaction
COO.
    See Orbiting Carbon Observatory.
corpus
  distance from
  Hamshahri
  OHSUMED
  Tempo
Cotent-Type header
crawlers.
    See search engine.
CSS.
    See Cascading Style Sheets.
custom parsers4th
  creating
  customizing existing parsers

D

DAACs.
    See Distributed Active Archive Centers.
data curation
data mining, text mining
data models, Planetary Data System2nd
data, linked
databases, MIME-info
deduplication
Definition property
DelegatingParser class
dependencies, managing2nd
DeploymentAreaParser class
design goals9th
  fast processing
  flexible metadata
  flexible MIME type detection
  language detection
  low memory footprint
  MIME database
  parser libraries
  unified parsing interface
detect() method2nd3rd
detecting custom MIME types3rd
  custom type detectors2nd
detecting file formats2nd
detecting MIME types
Detector interface
  custom type detectors2nd
dictionary-based profiling
digital asset management.
    See also document management systems.
Distributed Active Archive Centers
document analysis
Document class2nd
document management systems
  Content-Type headers
Document Object Model
documents3rd
  analyzing
  as text
  custom
  document management systems
  input stream2nd
  language detection2nd
  parsing with Bixo2nd
  text mining.
    See also files.
DOM.
    See Document Object Model.
downloading
  Git
  Subversion
  Tika source code
drag and drop
Droids.
    See Apache Droids.
Dublin Core2nd

E

Early Detection Research Network14th
  data model
  data sets
  eCAS Curator
  EDRN Catalog and Archive Service
  identifying MIME types
  linked data
  metadata extraction2nd
  protocols
  scientific data curation
  use of Tika2nd
Earth Science Enterprise6th
  Distributed Active Archive Centers
  how Tika fits in2nd
  principal investigator
  Science Information Processing Systems
eCAS Curator
  Ingester
  references.
    See also EDRN Catalog and Archive Service.
e-commerce, useful user data
EDRN Catalog and Archive Service
  eCAS Curator.
    See also Early Detection Research Network.
ElementMetadataHandler class
embedding Tika4th
  Tika facade2nd
encoding, output encoding
endDocument function
endElement function
environment settings
errors, TextExtractionError.
    See also exceptions.
events, STOP
example/* MIME type
exceptions
  IOException2nd
  SAXException
  TikaException2nd
extending Tika3rd
  adding MIME types
Extensible Hypertext Markup Language
  in Tika CLI
  structured output2nd
Extensible Markup Language (XML)2nd
  Resource Description Framework.
    See also XML files.
Extensible Metadata Platform (XMP)2nd
  properties and property types
extracting text
  full text2nd
  with Apache Jackrabbit
ExtractingRequestHandler class

F

facade class
Facebook
fast processing
FeedParser class2nd
Fetch subassembly
Field class2nd
Fielding, Roy
file extensions.
    See also glob patterns.
file formats
  combined heuristics
  content type hints
  detecting2nd
  filename globs2nd
  HDF2nd3rd
  headers
  magic bytes
  OLE
  RSS2nd
  XML
file headers
File Manager catalog
file naming conventions2nd
file storage.
    See storage.
FileInputStream class
filenames, glob pattern2nd
formatted text
full-text extraction5th
  incremental parsing
  indexing2nd
full-text indexes, for large-scale systems

G

general metadata standards2nd
  compared to content-specific standards
  Dublin Core
Geographic interface
get() method
getContentHandler() method
getDefaultRegistry() method
getFile() method
getLanguage() method
getLinks() method
getSupertype() method.
    See MediaTypeRegistry class.
getSupportedTypes() method2nd3rd
Git
glob patterns2nd
graphical user interface.
    See Tika GUI.

H

Hamshahri corpus
handling custom documents
hasFile() method
HDF
  matrix data
  organization of content2nd
  scalar data
  vector data.
    See also Hierarchical Data Format.
HDFParser class2nd
<head> tag
heuristics
Hierarchical Data Format
  organization of content2nd
Hitchhiker’s Guide to the Galaxy
HTML.
    See Hypertext Markup Language.
HtmlMapper interface
HTMLParser class2nd
HtmlParser class
Hypertext Markup Language3rd
  <head> tag
  in Tika CLI

I

IANA.
    See Internet Assigned Numbers Authority.
IdentityHtmlMapper class
image/* MIME type
implementing parsers2nd
incremental language detection
incremental parsing, streaming
indexers
  full-text indexing.
    See also search engines.
indexing, full-text search2nd
IndexReader class
IndexWriter class
information overload
Ingester
  references
input, standardizing
InputStream argument
InputStream class3rd4th
  and parse() method
intermediaries, promotion
International Organization for Standardization
Internet Assigned Numbers Authority3rd
  MIME type registry
inverse indexes
IOException2nd
  input error
isMultiValued() method
ISO 639
isReasonablyCertain() method
isSpecializationOf() method.
    See MediaTypeRegistry class.

J

Jackrabbit.
    See Apache Jackrabbit.
Java
  embedding Tika2nd
  managing dependencies2nd
  ROME API
  service providers
Java Beans
Java ROME API
java.io.Writer class
java.util.zip package
JCR.
    See Content Repository for Java API.

L

language detection2nd17th
  advanced algorithms
  agglutinative languages
  corpus
  distance
  in Tika2nd
  incremental
  ISO 639 standards
  language profiles
  N-gram algorithm
  profiling algorithms2nd
  theory2nd
  UDHR example
language detection theory2nd
--language option
language profiles
LanguageIdentifier class
LanguageIdentifierUpdateProcessor class
LanguageProfile class
LinkContentHandler class2nd
  getLinks() method
linked data
LinkHandler class
links, between files2nd
Linnaean taxonomy.
    See taxonomy.
Linnaeus, Carl
locale
LookaheadInputStream class
Lucene
Lucene Core
Lucene ecosystem
  Apache Droids
  Apache Mahout
  Apache Nutch2nd
  Apache Solr
  ManifoldCF
  Open Relevance2nd
LuceneIndexer class
  and metadata
  converting metadata to RSS
Luke

M

machine learning8th
  categorization
  clustering
  collaborative filtering
  predicting user likes and dislikes2nd
  real-world examples2nd
magic bytes
Mahout.
    See Apache Mahout.
ManifoldCF
mark feature
mark() method
matrix data
Maven build.
    See also source code.
Maven, memory problems
media type registries
  MediaTypeRegistry class
media types.
    See also MIME types.
MediaType class.
    See also media types.
MediaTypeRegistry class
memory footprint
message/* MIME type
<meta> tag. name attribute
metadata5th7th10th
metadata
  and Early Detection Research Network2nd
  and LuceneIndexer
  and rest
  and Tika CLI
  and Tika facade
  Cardinality property
  challenges of acquiring2nd
  Climate Forecast model
  Content-Type header
  converting to RSS2nd
  Definition property
  Extensible Metadata Platform
  flexibility
  how it’s created2nd
  in Lucene Document objects
  instances
  metadata models
  Metadata.LANGUAGE entry2nd
  Name property
  practical uses for2nd
  quality of2nd
  Relationships property
  representing
  standards2nd
  transforming
  Valid values property
Metadata argument
Metadata class2nd3rd
metadata instances
  representing
  transforming
metadata models
  Climate Forecast model
  Dublin Core.
    See also metadata standards.
metadata quality2nd
metadata schema
metadata standards6th
  content-specific standards2nd
  Dublin Core
  general standards2nd
MIME database
MIME type identifiers
MIME types9th
MIME types
  adding new types to Tika
  adding to MIME-info database
  and Early Detection Research Network
  and Parser interface
  application/*
  audio/*
  categories of
  custom2nd
  custom MIME type detectors2nd
  detecting2nd
  example/*
  identifiers
  image/*
  Internet Assigned Numbers Authority
  media type registries
  MediaType class
  MediaTypeRegistry class
  message/*
  MIME database
  MIME-info database
  model/*
  multipart/*
  parent and child types
  registration
  syntax
  text/*
  Tika MIME repository
  top-level
  video/*
  working with2nd
MIME-info database
  adding new types to
MimeType class
MimeTypes class2nd
MimeTypesFactory class
ML.
    See machine learning.
model/* MIME type
modularity
multipart/* MIME type
Multipurpose Internet Mail Extensions.
    See MIME types.

N

Name property
NASA12th
  Earth Science Enterprise2nd
  how they use Tika2nd
  National Polar-orbiting Operational Environmental Satellite System
  Orbiting Carbon Observatory
  PDS search redesign2nd
  Planetary Data System2nd
  Product Evaluation and Analysis Tool Element
  Soil Moisture Active Passive
National Cancer Institute, Early Detection Research Network2nd
National Polar-orbiting Operational Environmental Satellite System
NetCDF
N-gram algorithm
nodes
NPOESS.
    See National Polar-orbiting Operational Environmental Satellite System.
Nutch.
    See Apache Nutch.

O

Object Linking and Embedding
OHSUMED corpus
OLE format
OLE.
    See Object Linking and Embedding.
OODT
Open Relevance5th
  Hamshahri corpus
  OHSUMED corpus
  Tempo corpus
Open Services Gateway Initiative (OSGi)2nd
Orbiting Carbon Observatory
  computing resources
org.apache.tika.language package2nd
org.apache.tika.metadata package2nd
org.apache.tika.mime package
org.apache.tika.parser.Parser interface2nd
org.apache.tika.parser package2nd
org.apache.tika.sax package
org.xml.sax.ContentHandler interface
organization of content2nd
OSGI.
    See Open Services Gateway Initiative.
output serialization
overriding parsers

P

packages
  org.apache.tika.language2nd
  org.apache.tika.metadata2nd
  org.apache.tika.mime
  org.apache.tika.parser2nd
  org.apache.tika.sax.
    See also java.util.zip.
Page, Lawrence
Parse subassembly
parse() method2nd9th10th11th12th13th
  and input streams
  ContentHandler argument
  in Apache Jackrabbit
  InputStream argument
  Metadata argument
  ParseContext argument2nd
ParseContext
ParseContext argument
ParseContext class2nd
Parser API
Parser class
parser libraries2nd3rd
parser override
parser selection
ParserDecorator class
parseToString() method2nd
parsing context4th
  environment settings
  locale
ParsingReader class
PDF files, parsing
PDFBox library
PDFParser class2nd
PdfParser class
PDFTextStripper class
PDS
PDSRDFParser class
PEATE.
    See Product Evaluation and Analysis Tool Element.
plain text
Planetary Data System11th
  data model2nd
  Instruments
  labels
  Missions
  PDS Data Distribution System
  products
  search redesign2nd
  Targets
Planetary Data System Data Distribution System (PDS-S).
    See Planetary Data System.
plugins, parser plugins
podlings
principal investigator.
    See Science Information Processing Systems.
Product class
Product Evaluation and Analysis Tool Element
  and Tika
profiling algorithms3rd
  advanced
  N-gram algorithm
ProfilingHandler class2nd
ProfilingWriter class
promotion of intermediaries
properties, in Apache Jackrabbit
Property class
property types
property values2nd
PropertyType class
PropertyType enum
PropertyValue class
Protocol plugins
protocols
  in Early Detection Research Network
provider configuration files
Public Terabyte Dataset (PTD)
Public Terabyte Dataset Project
pull API
purchase history
push API

R

random access
ratings
RDF.
    See Resource Description Framework format.
Reader class2nd
Really Simple Syndication (RSS)4th
  from metadata2nd
  organization of content2nd
Reference class2nd
Relationships property
Representational State Transfer.
    See REST.
reset() method
Resource Description Framework format
resource management, close() method
REST
  context-free interaction
  principles of
  promotion of intermediaries
  use of metadata
RFC 5646
robots.txt file
ROME
root element detection, XML root detection
root elements
RSS
  channels
  organization of content2nd.
    See also Really Simple Syndication.

S

SAX events
  parsing.
    See also Simple API for XML.
SAXException
  output error
SAXTransformerFactory class
scalability2nd
scalar data
Science Information Processing Systems (SIPS)
  how Tika fits in2nd
  principal investigator
search engine
search engines2nd3rd10th11th17th
search engines
  analyzers
  and Tika
  Bixo2nd
  black lists
  crawlers
  deduplication
  indexers
  inverse indexes
  Public Terabyte Dataset Project
  structure of
  URL filtering
  web crawlers
  white lists
service providers
  provider configuration files
set methods
setMaxStringLength() method
setMediaTypeRegistry() method
shared MIME-info database.
    See MIME-info database.
Simple API for XML
  callback functions
  parse() method ContentHandler argument
  structured output
SimpleTypeDetector class
SIPS.
    See Science Information Processing Systems.
SMAP.
    See Soil Moisture Active Passive.
social media
Soil Moisture Active Passive
Solr.
    See Apache Solr.
SolrCell
source code5th
  downloading
  Git
  Subversion
Spring framework
  bean configuration
startDocument function
startElement function
statistical encoding detection
STOP event, in Apache Jackrabbit
storage
  how it affects extraction2nd
  logical representation2nd
  physical representation
streaming2nd
structured text2nd3rd6th
  as SAX output
  semantic structure
sub-class-of
Subversion
  trunk checkout

T

TagSoup
Taste.
    See Apache Mahout.
taxonomy
TeeContentHandler class2nd3rd
Tempo corpus
text mining
Text Retrieval Conference.
    See TREC standards.
text, structured
text/* MIME type
TextExtractionError
TIFF interface
Tika Annotator.
    See Apache UIMA.
Tika application4th
  documentation
  tika-app.
    See also Tika CLI; Tika GUI.
Tika bundle.
    See Open Services Gateway Initiative.
Tika facade2nd
  and metadata
  detect() method
  parse() method
  parseToString() method2nd
  setMaxStringLength() method
Tika MIME repository
tika-app
tika-bundle
TikaCallable class
tika-core
TikaException2nd
  parse error
TikaInputStream class2nd.
    See also document stream.
tika-mimetypes.xml.
    See media type registries.
tika-parent
tika-parsers
top level project
TransformerHandler class
transforming metadata
TREC standards
Twitter
type hints, content type hints
type/subtype

U

UDHR.
    See Universal Declaration of Human Rights.
Unicode
  BOM markers
uniform resource locators, URL filtering
Universal Declaration of Human Rights (UDHR)
Unix pipeline.
    See Tika CLI.
unravelStringMet function
UpdateHandler class
updateVersion function
URL filtering
users
  characteristics
  item ratings
  purchase history

V

Valid values property
ValueType enum
vector data
video/* MIME type

W

web browsers
web crawlers
  protocol layer
web servers
Web-based Distributed Authoring and Versioning Protocol (WebDAV)
  when to use
white lists
World Wide Web
  architecture
  complexity of2nd
  scale and growth of2nd

X

XHTML output2nd
XHTML.
    See Extensible Hypertext Markup Language.
XML files, root elements
XML.
    See Extensible Markup Language.
XMLParser class
XmlRootExtractor class
XMP dynamic media
XMP.
    See Extensible Metadata Platform.
xmpDM.
    See XMP dynamic media.