Crowbar: scrape javascript-generated pages via Gecko and REST!

Careful with buzzwords on that one...from the extremely clever hacks department, Stefano strikes again with crowbar, a RESTish page-scraping component based on the Gecko rendering engine.

As a first proof of concept, scraping crowbar's test page, where some content is generated via javascript, works just fine.

Using a non-javascript scraper gets you the raw page, obviously (you lame crawler!):

$ curl -s

Hi lame crawler

Using crowbar as a proxy, the page is rendered using the Gecko engine (called via a simple XULRunner app), exactly as a a client browser would do (of course - it is the same client browser engine that Firefox would use):

$ curl -s --data "url=" | xml fo -s 2

<-- (stuff added by crowbar for its test page omitted)... -->

Hi Crowbar!

Note the use of XMLStarlet to format the resulting document: as it's a DOM dump from Gecko, the output is well-formed in all cases.

The only thing missing seems to be the encoding declaration in the XML output: crawling for example (one of my favorite references on the Web), didn't work produce parsable XML as the XML output is serialized using the iso-8859-1 encoding (at least here on my macosx system), but the XML declaration doesn't mention this.

The code is at and can be installed directly under XULRunner.

Very clever! This will take some polishing of course, but if you need to index or analyze javascript-generated content (automated testing anyone?), there's no better way than getting it straight from the browser horse's mouth!

Update: the brand new Crowbar web site has more info.