Seeking to avoid re-inventing the wheel, can anyone point me towards anything pre-rolled to grab a web page and return the size of the page in bytes, both source and total (eg including the referenced images)? If I need to roll my own, I'm thinking of a mix of WWW::Mechanize, HTML::Parser, and Image::Size. But would rather use something existing, either perl, or a a perl wrap a linux command line app. Thanks for any ideas
Update: After thinking about this on a long drive today ( and checking the W3C website to make sure ) I realized that all you're going to be able to parse are the alloted pixel sizes of a web page. I beleive that other monks in the thread have pointed out ways to get image sizes.
You might be able to determine a rough size without having to download all the images. If you download the HTML page, you could then get a HEAD for each image it has, and add up all the reported sizes. Of course, this relies on the web server reporting a proper size, and I don't know how reliable this is, but it seems like it should be pretty good. (Doesn't HTTP use the Content-Length to determine how much to download? If so, then the Content-Length would probably be pretty accurate.)
HTH
Update: after re-reading, I'm not sure if I really answered your questions at all, but this is a thought I had. Perhaps it will at least give you some ideas...
Using wget might be much easier in this situation. It has a bunch of switches to download all the graphics related to the page. Curl and ht-track also do the same thing, but I am not familiar with them.
Just a tongue-tied, twisted, earth-bound misfit. -- Pink Floyd
-- Randal L. Schwartz, Perl hacker
Be sure to read my standard disclaimer if this is a reply.
get_sizes( URL, BASE_URL )
The get_sizes function is like get_size, although for HTML pages it also fetches all of the images then sums the sizes of the original page and image sizes. It returns a total download size. In list context it returns the total download size and a hash reference whose keys are the URLs of the images found in the HTML and whose values are hash references with these keys:
I couldn't tell from the docs -- does the module handle redirects? Included external CSS and javascript files etc?
Curious, thanks!
perlmonks.org content © perlmonks.org and brian_d_foy, data64, merlyn, mhi, Popcorn Dave, revdiablo, water
prlmnks.org © 2006 edmund von der burg (eccles & toad)
v 0.03