Update: I guess I wasn't clear. This question is not "how can I download a lot of Web pages quickly" or "how do I cache a Web page" or "how do I check a Web resource most efficiently using HTTP." (Although I do appreciate the efforts made and answers issued along these lines.)
The question is given a collection of download times, how would you determine the best "typical" time to download a collection of Web pages. I offer my planned approach below, if you can think of something better, I'd love to hear about it. /Update
I have a script like the one described above, and do indeed run it once each morning. It takes 5-10 seconds to download the Web pages and parse (in the case of RSS/Atom feeds) or scrape (in the case of HTML) the pages and amalgamate the bits of info I care about into one page.
I recently got greedy and thought, "how can I make this even faster?" Like, make it run in 2 seconds or less.
I thought perhaps of caching the results of my script's HTTP fetches, so that subsequent runs of the script are faster. But since most of the Web pages change on a daily basis, and I rarely check more than once each day, and am the sole user (for now), this seemed like a waste of time. The cache would always be out of date.
The I realized I usually check at the roughly the same time every day. In a given 20 weekdays, I might check within 15 minutes of 8:30 am 15 times, closer to 6:30 am once, closer to 7:30 am once, and closer to 10 am three times.
The ideal time to pre-fetch those Web pages would probably be about 8 am -- early enough to be before 18 of the visits, so I hit the cache, and late enough that the results are less than 45 minutes old 15 times, so the cache is really fresh.
I am presently thinking about a simple, rough approach -- take the last two weeks of downloads, compute a time that would come before 80 percent of the downloads, and subtract 30 minutes from that.
I have two concerns:
1. Am I reinventing the wheel? I have looked into tools like squid but do not believe I have found any existing toold, inside or outside the Perl world, to do what I want.
2. Is there a more flexible approach to be had without adding too much complexity or having to go back to university for a proper math/stats/cs/ai schooling (I do not program for a living)? I have looked at AI::Fuzzy* modules (see for example AI::FuzzyInference) but not played around with them yet.
Flexibility could help if my needs change. For example, say I start adding new collections of aggregated Web pages that I check more than once per day or less than once per day? Obviously I would need a more sophisticated system.
Or I might add another user who turns out to be much less predictable. It might be nice if the script could say, "bleh, you are too random, let's not pre-fetch at all and suck unneeded bandwidth."
After all, if I *only* want this pre-fetching to help just me in this one use scenario, I can just eyeball my own script invocation and pick a time (like 8 am) and implement the cache. I'd like to come up with something that can be fast for other people.
Any general thoughts appreciated. Obviously, I am not yet at the coding stage.
Is it really worth your time to save a couple of seconds?
You could cache the data, then send conditional GET requests to fetch the resources whenever you need them. If the server tells you the page hasn't been updated, you use the cache. If it has been updated, the server sends you the new data and you use that.
I'm also planning to use Keep-Alive in cases where I need multiple items from the same server, if that's of interest to you as well. But it's sort of beside the point.
And yes, it would be worth it to save a couple of seconds, if I learned lessons that would let me implement such a cache for an arbitrary collection of Web pages for an arbitrary user. The difference between 1-2 seconds response time and 5-10 seconds makes all the difference in the world for a Web application.
Obviously, with multiple users, the value of a conventional cache goes up. But I am interested in pre-fetching, so I reduced my question to the simplest case (which happens to be the only real one at the moment).
Something that will work is parallelizing the retrieval of the pages/feeds. Create an application, say with Parallel::ForkManager, that creates multiple process, each one fetching one site and processing it. Then assemble the results from all the children into your composite feed. The time taken will be only a little longer than the slowest website/feed.
-Mark
The benefit is: once cached, do not have to connect to server and download the Web page. When there are 30 pages, this is an issue.
I'm already parallelizing the retrieval. I'm using LWP::Parallel after finding little additional speed benefit from either POE or HTTP::GHTTP with P::ForkManager.
Thanks anyway.
-- Randal L. Schwartz, Perl hacker
Be sure to read my standard disclaimer if this is a reply.
I think you need to work on your specification more. You said something about averaging the last two weeks of downloads and putting the prefetch time 30 minutes before 80% of them. Why? You haven't specified. Why not just do it before all of those times? For that matter, why not prefetch at midnight the night before? There's presumably some constraint, like you need the latest content possible. If so, then you need to specify the maximum tolerable oldness of the content. Or the maximum average oldness. Then you need to specify whether you care if the user sometimes fetches un-prefetched content, or what percent of the time that's allowed to happen. The way you've presented it here, it seems to me that simply caching the first download would be sufficient, or as others suggested using a normal caching proxy like squid.
I think after you've really specified the problem, the solution will probably fall out naturally. It seems to me, however, that you're less concerned about solving a problem than trying to find a problem. If so, then maybe studying up on math and AI really is what you want to do.
Why not do it at midnight? Freshness. Note the part where I say "late enough that the results are less than 45 minutes old 15 times, so the cache is really fresh."
Many of the sources I read update during the night. Think of a page of online newspaper links, typically updated around 3 am in whatever time zone the newspaper is located. But I'm also mixing in blog feeds, updated on a less predictable schedule. So the goal is to cache as soon as possible before a likely visit.
Simply caching -- with a conventional ttl scheme like you describe -- is, as I explained in my post, not going to cut it. Note I am not dealing with images or other static content that could live happily in such a cache -- or links to other pages, some of which maybe static -- only the text of the Web pages, most of which, again, change every single day.
I appreciate your reply.
perlmonks.org content © perlmonks.org and brian_d_foy, ForgotPasswordAgain, ioannis, kvale, merlyn, perrin, ryantate
prlmnks.org © 2006 edmund von der burg (eccles & toad)
v 0.03