I love the Web

November 9, 2007

Especially when pmarca uses his blog to announce his (and his wife’s) $27.5 million donation to Stanford Hospital. Citizen media, really.


Now even I can understand HTTP error codes

September 20, 2007

adam-koford-http-402.jpgThanks to Adam Koford‘s excellent set of Illustrated HTTP errors.

Via Attila Szegedi.

Update: as Leo indicates, this post has been featured in today’s heute magazine. So let’s see: heute links to me who links to Attila who links to Adam…isn’t the web fun?
heute-online-21092007.jpg
 

Crowbar: scrape javascript-generated pages via Gecko and REST!

February 24, 2007

Careful with buzzwords on that one…from the extremely clever hacks department, Stefano strikes again with crowbar, a RESTish page-scraping component based on the Gecko rendering engine.

As a first proof of concept, scraping crowbar’s test page, where some content is generated via javascript, works just fine.

Using a non-javascript scraper gets you the raw page, obviously (you lame crawler!):

$ curl -s http://simile.mit.edu/crowbar/test.html
<html>
<head>
<script>
function init() {
document.getElementById("message").innerHTML = "Hi Crowbar!";
}
</script>
</head>
<body onload="init()">
<h1 id="message">Hi lame crawler</h1>
</body>
</html>

Using crowbar as a proxy, the page is rendered using the Gecko engine (called via a simple XULRunner app), exactly as a a client browser would do (of course – it is the same client browser engine that Firefox would use):

$ curl -s --data "url=http://simile.mit.edu/crowbar/test.html" http://127.0.0.1:10000/  | xml fo -s 2
<?xml version="1.0"?>
<html>
<-- (stuff added by crowbar for its test page omitted)... -->
<HEAD>
<SCRIPT>
function init() {
document.getElementById("message").innerHTML = "Hi Crowbar!";
}
</SCRIPT>
</HEAD>
<BODY onload="init()">
<H1 id="message">Hi Crowbar!</H1>
</BODY>
</html>

Note the use of XMLStarlet to format the resulting document: as it’s a DOM dump from Gecko, the output is well-formed in all cases.

The only thing missing seems to be the encoding declaration in the XML output: crawling http://www.perdu.com for example (one of my favorite references on the Web), didn’t work produce parsable XML as the XML output is serialized using the iso-8859-1 encoding (at least here on my macosx system), but the XML declaration doesn’t mention this.

The code is at http://simile.mit.edu/repository/crowbar/trunk/ and can be installed directly under XULRunner.

Very clever! This will take some polishing of course, but if you need to index or analyze javascript-generated content (automated testing anyone?), there’s no better way than getting it straight from the browser horse’s mouth!

Update: the brand new Crowbar web site has more info.


Missing LIFT once again…aaarrghh

February 7, 2007

For the second time in a row, I’m missing the LIFT conference, due to collisions with teaching gigs that were planned way too early.

I guess I’ll mark the dates in red for next year’s conference as soon as they are published.

Too bad…especially as the list of participants includes a lot of people that I’d haved loved to meet there. Next time.


The eBay architecture

December 19, 2006

Ugo points to the slides of a recent SAMSIG presentation on eBay’s architecture. A fascinating read…and it’s comforting to learn that they threw away most of J2EE: eBay scales on servlets and a rewritten connection pool.


The Venice Project opens up

November 16, 2006

tvp-image.jpgSeveral of the cool people who have been working on the super secret Venice project are finally talking about it publically Just follow the links.

Reading between the lines: Open Source rocks. And Open Source developers even more…


Behold, Internet Explorer 7 is upon us

October 19, 2006

The M$ download page (no I’m not linking to that;-) says upgrade with confidence. According to my colleagues who have tested it, this might indicate that IE7 faithfully reproduces most bugs of its predecessors, while adding a new set of fun and original ones. There’s a better way.

Worse, it seems like windows update will push this new and improved pile of…software to masses of unsuspecting customers. Be ready for some of these how come I suddently cannot read your site anymore? messages.