jk's blog

Perl Mechanize Screen Scraper Makes it Easy to Copy Data from Web Pages

This is something I wrote to learn some of the latest Perl tech to scrape pages. What’s cool about this demo script is that it doesn’t use regular expressions that much. Instead, it uses HTML::TreeBuilder::XPath to treat the HTML as a queryable hierarchical data structure. How cool is that?!

It also uses Date::Parse to convert textual dates into machine-usable dates.

And, finally, it uses a good meta-pattern: don’t use PHP to retrieve data from websites – use a script like this, and copy the data into a neutral location. You shouldn’t run “cron” jobs through the web; you shouldn’t allow PHP net access in general; and you should use Perl’s superior libraries.

#! /usr/bin/perl

use WWW::Mechanize;
use HTML::TreeBuilder::XPath;
use Date::Parse;

my $mech = WWW::Mechanize->new( agent => 'jlabot', autocheck => 1 );

$url = 'http://www.edd.ca.gov/jobs_and_training/job_fairs_and_events.htm';

if (! -f 'tmp.html') {
  open FH, '>tmp.html';
  print FH $mech->content();
  close FH;

$tree = new HTML::TreeBuilder->new;

my $p = $tree->findnodes('//div[@id="middle_column"]//table');

my $n = $p->get_node(1);

my $td = $n->findnodes('//td');

print $td->get_node(1)->format();
print "--n";
print $date = $td->get_node(2)->format();
print "--n";
print $td->get_node(3)->format();

my @lines = split /n/, $date;
print str2time($lines[0]);
print "n";

print $lines[1] . "n";

@times = split /x{2013}/, $lines[1];

my ($ss,$mm,$hh,$day,$month,$year,$zone) = strptime( $times[0] );

print "n$hh $mm n";

($ss,$mm,$hh,$day,$month,$year,$zone) = strptime( $times[1] );

print "n$hh $mm n";