Lucene rocks

I’ve been testing Lucene to index a corpus of one million XML files (relatively small, about 1700 bytes each), the code is dead simple and the performance is good to great.

Indexing does take some time, but without much optimizations we indexed 12’000 files per minute on a 2.5GHz Pentium system, which translates to about 80 minutes to index the whole thing. Not bad at all.

Search is really fast, many queries execute in less than 200 msec.

Once again, Lucene rocks!

2 Responses to Lucene rocks

  1. Stephane Bailliez says:

    Out of curiosity, what are your performances for parsing and indexing separatly ? I recently did some tests, as I had to figure out performance for about more than 500,00 xml files of ~3-6KB.

    The major performance problem for me was parsing, so I ended up trying different solutions to check for relative performance gain with Xerces, Piccolo, hand-made regexp (JDK 1.4) and XPP.

    The goal was to retrieve the content of a specific XML element which is itself an xhtml block. So I’m removing xhtml code before indexing it with lucene.

    I have the following performances for 10000 documents of ~7KB with a 1Ko indexed content and the whole document stored in the index (P4 3GHz (6121 bogomips), Debian, Sun jvm 1.4.2-b28, no specific jvm settings but -server):

    XPP3 : parsing: 7469ms indexing: 45255ms
    Xerces : parsing: 259645ms indexing: 44185ms
    RegExp : parsing: 7596ms indexing: 36827ms
    Piccolo: parsing: 162211ms indexing: 43194ms

    I find the RegExp indexing performance a bit too fast as a fluctuation so maybe there’s something wrong but my unit tests works fine so far.

    Xerces parsing is of course SAX based with a caching entity resolver (there are doctypes and dtd) and no validation.

  2. I didn’t do very precise measurements, but commenting out the indexing code made the processing time for a single file go down to 4 msec, from 18 msec with indexing enabled. Tested on my powerbook with the latest release of xerces-j, the linux system where I ran the tests first is faster.

    So the time needed to walk the directory and parse the files is not that big, but my files have a very simple structure and no DTD at this point.

%d bloggers like this: