Tika 0.1 (incubating) released!

Release 0.1 of Tika is out, grab it while it’s hot!

There’s almost no standalone documentation currently, but between the javadocs and the simple and fairly complete test cases, it shouldn’t be hard to figure out how to use it.

What does Tika do? Basically, extract data and metadata from various binary formats. It’s meant to be a framework for plugging in more binary decoders, and we expect it to be used in various Apache projects, instead of each project reinventing the wheel.

Here’s an example, from the ExcelParserTest:


public void testExcelParser() throws Exception {

InputStream input =
  ExcelParserTest.class.getResourceAsStream(
    "/test-documents/testEXCEL.xls");
try {
  Metadata metadata = new Metadata();
  StringWriter writer = new StringWriter();
  ContentHandler handler = new WriteOutContentHandler(writer);
  new ExcelParser().parse(input, handler, metadata);

  assertEquals(
    "application/vnd.ms-excel",
    metadata.get(Metadata.CONTENT_TYPE));

  assertEquals("Simple Excel document", metadata.get(Metadata.TITLE));
  assertEquals("Keith Bennett", metadata.get(Metadata.AUTHOR));

  String content = writer.toString();
  assertTrue(content.contains("Sample Excel Worksheet"));
  assertTrue(content.contains("Numbers and their Squares"));
  assertTrue(content.contains("9.0"));
  assertTrue(content.contains("196.0"));

} finally {
  input.close();
}
}

Thanks to Chris Mattmann (Tika’s release manager) and the whole team for all the work that made this first release possible!

3 Responses to Tika 0.1 (incubating) released!

  1. leo says:

    Very cool library! I’m sure Tika might help a lot of projects.

    When it commes to the interface of the classes, I’m not sure, but I think it has some potential for optimizing stuff. Why I have to create an ExcelParser instead of Tika is finding the right parser for me in a factory? Why does parse() return nothing? Could it not return a Metadata Object instead of giving it as a parameter?

  2. leo, There is an AutoDetectParser that is smart enough to detect the type of file and use the appropriate parser.

  3. @Leo, the Metadata object is passed in as it can also contain options for the parser.

%d bloggers like this: