determine web page size, w/ and w/o images
water
created: 2004-06-14 21:43:05
Hi --

Seeking to avoid re-inventing the wheel, can anyone point me towards anything pre-rolled to grab a web page and return the size of the page in bytes, both source and total (eg including the referenced images)? If I need to roll my own, I'm thinking of a mix of WWW::Mechanize, HTML::Parser, and Image::Size. But would rather use something existing, either perl, or a a perl wrap a linux command line app. Thanks for any ideas

water

Re: determine web page size, w/ and w/o images
created: 2004-06-14 22:51:51
If you had a page that followed the W3C standards, couldn't you grab the page with LWP::Simple, store that to a file, get the size, and use HTML::TokeParser to parse out the image sizes?

Update: After thinking about this on a long drive today ( and checking the W3C website to make sure ) I realized that all you're going to be able to parse are the alloted pixel sizes of a web page. I beleive that other monks in the thread have pointed out ways to get image sizes.

There is no emoticon for what I'm feeling now.
Re: determine web page size, w/ and w/o images
created: 2004-06-14 23:20:28

You might be able to determine a rough size without having to download all the images. If you download the HTML page, you could then get a HEAD for each image it has, and add up all the reported sizes. Of course, this relies on the web server reporting a proper size, and I don't know how reliable this is, but it seems like it should be pretty good. (Doesn't HTTP use the Content-Length to determine how much to download? If so, then the Content-Length would probably be pretty accurate.)

HTH

Update: after re-reading, I'm not sure if I really answered your questions at all, but this is a thought I had. Perhaps it will at least give you some ideas...

Re: determine web page size, w/ and w/o images
created: 2004-06-15 00:42:10

Using wget might be much easier in this situation. It has a bunch of switches to download all the graphics related to the page. Curl and ht-track also do the same thing, but I am not familiar with them.


Just a tongue-tied, twisted, earth-bound misfit. -- Pink Floyd

Re: determine web page size, w/ and w/o images
mhi
created: 2004-06-15 01:08:14
See node 362257 for a more in-depth view on this topic and a couple of good hints on the possible complexities involved.
•Re: determine web page size, w/ and w/o images
created: 2004-06-15 11:15:48
The code may be a bit old and crufty but I think my column on calculating download time may be right up your alley.

-- Randal L. Schwartz, Perl hacker
Be sure to read my standard disclaimer if this is a reply.

Re: determine web page size, w/ and w/o images
created: 2004-06-18 02:16:45
I wrote HTTP::Size to do just this. From the docs:
get_sizes( URL, BASE_URL )

The get_sizes function is like get_size, although for HTML pages it also fetches all of the images then sums the sizes of the original page and image sizes. It returns a total download size. In list context it returns the total download size and a hash reference whose keys are the URLs of the images found in the HTML and whose values are hash references with these keys:
--
brian d foy
Re^2: determine web page size, w/ and w/o images
created: 2004-06-18 16:51:39
Great! Thanks!

I couldn't tell from the docs -- does the module handle redirects? Included external CSS and javascript files etc?

Curious, thanks!

perlmonks.org content © perlmonks.org and brian_d_foy, data64, merlyn, mhi, Popcorn Dave, revdiablo, water

prlmnks.org © 2006 edmund von der burg (eccles & toad)

v 0.03