web page update notifier
Juerd
created: 2004-06-17 16:51:32
#!/usr/bin/perl -w
use strict;
use File::Copy qw(copy);
use LWP::Simple qw(mirror is_error);

my $url = 'http://...';
my $file = '/home/juerd/tmp/foo.html';

copy $file, "$file.old" or warn $!;

my $status = mirror $url, $file;
warn "HTTP $status" if is_error $status;

system qw(diff -u), "$file.old", $file;
 #!/usr/bin/perl -w
 use strict;
+use File::Copy qw(copy);
-use LWP::Simple qw(mirror is_success);
+use LWP::Simple qw(mirror is_error);

 my $url = 'http://...';
 my $file = '/home/juerd/tmp/foo.html';

-rename $file, "$file.old" or warn $!;
+copy $file, "$file.old" or warn $!;

 my $status = mirror $url, $file;
-warn "HTTP $status" unless is_success $status;
+warn "HTTP $status" if is_error $status;

 system qw(diff -u), "$file.old", $file;
Re: web page update notifier
created: 2004-06-17 17:04:40

In this snippet, you actually download the whole file each time you (cron) run(s) the script. Wouldn't it be nicer if you'd just ask for a HEAD and check the "Last-Modified" header and do some local testing on that?

$ HEAD http://www.server.tld/page.htm | grep "Last-Modified"
--
[b10m]

All code is usually tested, but rarely trusted.
Re^2: web page update notifier
created: 2004-06-17 17:16:36

In this snippet, you actually download the whole file each time you (cron) run(s) the script.

Not true.

From LWP::UserAgent, that LWP::Simple uses under the hood:

$ua->mirror( $url, $filename ) This method will get the document identified by $url and store it in file called $filename. If the file already exists, then the request will contain an "If-Modified-Since" header matching the modification time of the file. If the document on the server has not changed since this time, then nothing happens. If the document has been updated, it will be downloaded again. The modification time of the file will be forced to match that of the server.

Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

Re^3: web page update notifier
created: 2004-06-17 17:35:25

Aren't you defeating the mirror check by renaming $file to "$file.old" before giving $file to the mirror call by which time it won't exists?


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
Re^4: web page update notifier
created: 2004-06-17 18:27:36

Aren't you defeating the mirror check by renaming $file to "$file.old" before giving $file to the mirror call by which time it won't exists?

Oops; yes. Updated.

Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

Re: web page update notifier
ihb
created: 2004-06-17 21:25:05

I'd be even happier if you used Text::Diff or something equivalent instead of a system call. :-)

ihb

Re^2: web page update notifier
created: 2004-06-18 03:10:42

I'd be even happier if you used Text::Diff or something equivalent instead of a system call. :-)

For something that runs once per day, it's not worth the trouble. I even use `cat foo` in scripts like this one.

Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

Re^3: web page update notifier
ihb
created: 2004-06-18 09:52:32

In short, my point is that when sharing it with other monks I'd be happier to see a portable snippet since it doesn't require much work to make it that. Of course, it's better to share a non-portable snippet than not share at all; that's why I said "happier" and not "happy".

It's not worth the trouble for you when you use it, but since this post isn't targeted to you I just figured it would be nice if you patched it so that more could benefit from it. Just as you'd do with any CPAN module you publish.

ihb

Re: web page update notifier
created: 2004-06-18 03:13:59
I've posted something similar previously on this site with the node node 130402 which may additionally be of interest (albeit it is somewhat dated now). This code differs in that it employs the last-modified-header or, where this is unavailable, a message digest of the page, in order to determine page changes.

 

perl -le "print unpack'N', pack'B32', '00000000000000000000001011100100'"

Re: web page update notifier
zby
created: 2004-06-18 04:36:10
In my spare time I am developing a more complicated notifier with a web interface. The additional feature is that it let's you add some regexps to ignore some changes (it is usefull for pages that for example show current date somewhere). I plan it to evolve into something like what RSS does by extracting what is new on the page (with a kind of HTML diff). You can read some documentation for that, download it or try it on my home server at Active Bookmarks Manual.

I wanted to use it as a replacement for Personal Nodelet - so it has a special (undocumented) feature that links to Perl Monks are internally converted to links to appriopriate The Pen pages.

By the way most current web browsers can notify you about changes to pages in your bookmarks.

Re^2: web page update notifier
created: 2004-06-18 08:27:16

By the way most current web browsers can notify you about changes to pages in your bookmarks.

I don't just want to know that it changed, I want to know exactly which lines were added and removed. There are numerous scripts that do something like this, but creating a new one is MUCH easier than reading manuals of other scripts, because they're all bloated with features I don't need right now.

Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

Re: web page update notifier
created: 2004-07-09 11:31:16

You could use a md5 hash to check if the file has
modified or not. It is much more accurate and safe than
only using a diff. In addition, using the MD5 hash will
make the storage for the file much smaller..

*someone said to use "HEAD", to check the last modified
date. This value is not safe/trustworthy.

[]'s

-DBC
Re^2: web page update notifier
created: 2004-07-12 06:54:37

It is much more accurate and safe than only using a diff.

Accuracy is irrelevant for text documents. Either a line is the same, or it is not. Besides that, I'm especially interested in *which* lines are different, and how they changed. diff tells me exactly that.

the last modified date. This value is not safe/trustworthy.

It has proven to be worthy of my trust.

Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

perlmonks.org content © perlmonks.org and b10m, BrowserUk, danielcid, ihb, Juerd, rob_au, zby

prlmnks.org © 2006 edmund von der burg (eccles & toad)

v 0.03