Apache Tika

By Maurizio Farina | Posted on September 2017

This tutorial is dedicated to Apache Tika an Apache toolkit to detect and extracts metatada and content from different file types.

What is Apache Tika?

Apache Tika is a toolkit, originally developed a part of Apache Nutch; for detecting and extracting metadata and content from from different file types.

Apache Tika reuses existing Java libraries such as PDFBox or Apache POI to handle as many as possible different file format. Here is possible to find the supported file format.

Apache Tika manages encrypted PDF using Bouncy Castle.

Apache Tika can be used either as a Java library or as a server application or directly on a command line.

Apache Tika artifacts

A list of artifacts extracted from Apache Tika websites

Artifact Brief Description
tika-core-*.jar Tika core library. Contains the core interfaces and classes of Tika
tika-parsers-*.jar Tika parsers. Collection of classes that implement the Tika Parser interface based on various external parser libraries.
tika-app-*.jar Tika application. Combines the above components and all the external parser libraries into a single runnable jar with a GUI and a command line interface.
tika-server-*.jar Tika JAX-RS REST application. This is a Jetty web server running Tika REST services as described in this page.
tika-bundle-*.jar Tika bundle. An OSGi bundle that combines tika-parsers with non-OSGified parser libraries to make them easy to deploy in an OSGi environment.

To download artifacts is possible to use the Apache Tika download web page

Maven dependencies

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
  <dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>1.16</version>
  </dependency>

  <dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>1.16</version>
  </dependency>

Apache Tika's Java API

The API is divided in two layers: - Tika facade: manage documentats for most common use cases - low level interface: interfaces like Detector,Parser and so on to manage documents with a fine-grained control.

A simple method using Tika facade:

1
2
3
4
5
6
7
8
9
public String parse(String htmlContent) throws IOException, TikaException {

        Tika tika = new Tika();

        InputStream stream = IOUtils.toInputStream(htmlContent);
        String content = tika.parseToString(stream);
        stream.close();
        return content;
}

Apache Tika parser and XHTML SAX events

Apache Tika uses a single interface parser to hide the complexity to manage different file formats:

1
2
3
void parse(
    InputStream stream, ContentHandler handler, Metadata metadata,
    ParseContext context) throws IOException, SAXException, TikaException;

Under the hood: The content parsed by Tika is retuned as a sequence of XHTML SAX events directly to ContentHandler instance.

Tika Parser

The exceptions returned:

  • IOException: input stream not readable
  • TikaException: input stream not parsable
  • SAXException: somethng wrong with SAX? Honestly i'm investingating on this.

Here an example on how to parse HTML document and return the main content:

1
2
3
4
5
6
InputStream stream = IOUtils.toInputStream(htmlContent);
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
parser.parse(stream, new BoilerpipeContentHandler(textHandler), metadata, context);

ContentHandler and Decorators

Apache Tika toolkit includes several handlers and decorators to allow extraction of desired content from a document. In the example above BoilerpipeContentHandler allow to extract the main content from an HTML page.

To have an idea of all parsers implemented by Apache Tika is posible to refer to Apache Tika Parse JavaDoc page. For example:

  • FeedParser
  • HtmlParser
  • ImageParser
  • LanguageDetectingParser
  • OfficeParser
  • PDFParser
  • SpreadsheetMLParser

Apache Tika hosted in a Java class

The most simple Tika usage example:

1
2
3
4
5
6
7
8
9
public String parse(String htmlContent) throws IOException, TikaException {

        Tika tika = new Tika();

        InputStream stream = IOUtils.toInputStream(htmlContent);
        String content = tika.parseToString(stream);
        stream.close();
        return content;
}

Something more complicated using BoilerpipeContentHandler to exact the main content and metadata

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
    public String parseHtml(String htmlContent) throws IOException, SAXException, TikaException {

        InputStream stream = IOUtils.toInputStream(htmlContent);
        ContentHandler textHandler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        AutoDetectParser parser = new AutoDetectParser();
        ParseContext context = new ParseContext();
        parser.parse(stream, new BoilerpipeContentHandler(textHandler), metadata, context);

        return textHandler.toString();
    }

Apache Tika command line utility

The complete command line description refers directly to Apache documentation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
java -jar tika-app.jar [option...] [file|port...]

Main Options:
    -g  or --gui           Start the Apache Tika GUI
    -s  or --server        Start the Apache Tika server
    -x  or --xml           Output XHTML content (default)
    -h  or --html          Output HTML content
    -t  or --text          Output plain text content
    -T  or --text-main     Output plain text content (main content only)
    -m  or --metadata      Output only metadata
    -j  or --json          Output metadata in JSON
    -l  or --language      Output only language
    -d  or --detect        Detect document type
    -z  or --extract       Extract all attachements into current directory
    --extract-dir=<dir>    Specify target directory for -z

example:

java -jar tika-app.jar --text [file]

Tika configuration

Apache Tika allows to configure all its components used for parsing and detection. The Apache Tika configuration page give all information needed.

Here an example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <!-- Default Parser for most things, except for 2 mime types, and never
         use the Executable Parser -->
    <parser class="org.apache.tika.parser.DefaultParser">
      <mime-exclude>image/jpeg</mime-exclude>
      <mime-exclude>application/pdf</mime-exclude>
      <parser-exclude class="org.apache.tika.parser.executable.ExecutableParser"/>
    </parser>
    <!-- Use a different parser for PDF -->
    <parser class="org.apache.tika.parser.EmptyParser">
      <mime>application/pdf</mime>
    </parser>
  </parsers>
</properties>

Using Tika to feed Lucene

A complete example on how to use Apache Tika to add documents to the Lucene index from ListFeeds.com project:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
package com.listfeeds.crawler.readers.html;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.Reader;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.apache.tika.Tika;
import org.apache.tika.metadata.Metadata;
import org.slf4j.LoggerFactory;

public class TikaFeedsLucene {

    static final org.slf4j.Logger log = LoggerFactory.getLogger(TikaFeedsLucene.class);

    public void feed(String luceneIndexPath, String documentPath) throws IOException {

        //Opening or creating lucene index
        IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_40, new StandardAnalyzer(Version.LUCENE_40));
        conf.setOpenMode(OpenMode.APPEND);
        Directory directory = FSDirectory.open(new File(luceneIndexPath));
        IndexWriter iwriter = new IndexWriter(directory, conf);
        log.debug("IndexWriter created and open index at: [{}]", luceneIndexPath);

        //creating lucene document
        Document doc = new Document();
        Metadata metadata = new Metadata();
        Reader content = new Tika().parse(new FileInputStream(documentPath), metadata);
        doc.add(new TextField("content", content));
        doc.add(new StringField("title", metadata.get(Metadata.TITLE), Field.Store.YES));

        //adding document to lucene index
        iwriter.addDocument(doc);
        iwriter.commit();

        //close lucene index
        iwriter.close(true);
        directory.close();
    }

}

Apache Camel Integration

https://github.com/apache/camel/blob/master/components/camel-tika/src/main/docs/tika-component.adoc