This isn't so much about Mechanize as it is about being gentle with webservers. If I'm just looking to spider a domain, looking for text, when examining links for where to go next, would this be the best way to only get HTML and be gentle to the webserver:
1. Regex the links for non-html extensions, keeping everything else as a possibility.
2. HEAD $url for text/html and charset info, if is_html and a good charset/no charset.
3. GET
Instead of doing the HEAD, I could just do a get, and check the header info and body for good HTML, but I thought just getting the header would be easier on the webserver even if it meant two seperate connections had to be made.
Re: WWW::Mechanize treading lightly
Since the decision between GET/HEAD would be very website dependent, looking for non-html extensions would seem the way to go.
Judicious use of sleep between requests would also be a simple way to be gentle.
Re: WWW::Mechanize treading lightly
How can you make _any_ decision based upon the extension of the url?!?!? Any of these (and many others) could produce html .. you really need to look for the Content-Type.
http://perlmonks.org/?parent=565470;node_id=3333
http://example.com
http://example.com/blah.html
http://example.com/blah.foo
http://example.com/blah.htm
http://example.com/blah.php
http://example.com/blah.cgi
http://example.com/blah.pl
http://example.com/blah.asp
http://example.com/blah/foo/
http://example.com/blah.exe # even this, if someone so configured the web server
Re^2: WWW::Mechanize treading lightly
Right, but I was thinking that you could at least drop .mp3, .gif, .jpeg just for an easy first cut, no?
Re^3: WWW::Mechanize treading lightly
That is exactly what I was thinking. Anyone who has their webserver configured to spit out HTML from .jpg extensions isn't a site I want to bother with. Extensions serve a purpose, and while they can be abused, that abuse would negate my need to see their text.
Thanks for the input.