Total Elapsed Time = 0.080048 Seconds User+System Time = 0.080048 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 87.4 0.070 0.070 40 0.0018 0.0018 main::extract 12.4 0.010 0.010 1 0.0100 0.0100 warnings::BEGIN 0.00 0.000 0.010 2 0.0000 0.0050 main::BEGIN 0.00 0.000 0.000 1 0.0000 0.0000 warnings::import 0.00 0.000 0.000 1 0.0000 0.0000 strict::import 0.00 0.000 0.000 1 0.0000 0.0000 strict::bits 0.00 0.000 0.000 1 0.0000 0.0000 Exporter::import 0.00 0.000 0.000 1 0.0000 0.0000 warnings::bitsPerl5.8.0
Total Elapsed Time = 123.5199 Seconds User+System Time = 39.62993 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 97.1 38.49 38.520 40 0.9622 0.9630 main::extract 0.05 0.020 0.020 1 0.0200 0.0200 utf8::SWASHNEW 0.03 0.010 0.010 1 0.0100 0.0100 utf8::AUTOLOAD 0.00 - -0.000 1 - - utf8::SWASHGET 0.00 - -0.000 1 - - Exporter::import 0.00 - -0.000 1 - - warnings::unimport 0.00 - -0.000 2 - - warnings::import 0.00 - -0.000 1 - - warnings::BEGIN 0.00 - -0.000 2 - - strict::unimport 0.00 - -0.000 4 - - strict::bits 0.00 - -0.000 2 - - strict::import 0.00 - -0.000 3 - - main::BEGIN 0.00 - -0.000 5 - - utf8::BEGINPerl5.8.5
%Time ExclSec CumulS #Calls sec/call Csec/c Name 98.4 0.630 0.630 40 0.0157 0.0157 main::extract 1.56 0.010 0.010 1 0.0100 0.0100 warnings::BEGIN 0.00 - -0.000 1 - - warnings::import 0.00 - -0.000 1 - - strict::import 0.00 - -0.000 1 - - strict::bits 0.00 - 0.010 2 - 0.0050 main::BEGINThe main::extract subroutine takes about 9 times longer under Perl 5.8.5, and 549 times more under Perl 5.8.0, compared to Perl 5.6.1. The program itself took 1,543 times longer to finish under Perl 5.8.0 than it did under Perl 5.6.1. You may be wondering what the Perl program is:
use strict; use warnings; open (FILE, "a.txt"); my $text = ""; while (As you can see, this code slurps a file and removes all occurences of a certain word (`whatever'). If you're wondering why Perl 5.8.0 took 2 minutes, it's not because I was using a larger file, and it's not because the file was large. The size of the file was exactly 11,221 (about ten thousand) bytes.) { $text .= $_; } close (FILE); while (my ($one, $two) = extract ($text)) { $text = $one . $two; } sub extract { my ($text) = @_; if ($text =~ /(.*?)whatever(.*)/is) { return ($1, $2); } return (); }
%Time ExclSec CumulS #Calls sec/call Csec/c Name 88.1 0.670 0.670 1 0.6700 0.6700 main::extractPerl 5.8.0
%Time ExclSec CumulS #Calls sec/call Csec/c Name 95.0 2.490 2.510 1 2.4900 2.5100 main::extractPerl 5.8.5
%Time ExclSec CumulS #Calls sec/call Csec/c Name 19.5 0.080 0.080 1 0.0800 0.0800 main::extractIt's obvious that the little hat balanced out the differences between the three releases (although 3.7 times longer with Perl 5.8.0 is reason enough NOT to upgrade). Perl 5.8.5, in its current build was faster than Perl 5.6.1. The differences exist on account of different versions and different build parameters. To be more exact, here are the configuration summaries for the three releases:
usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
useperlio=undef d_sfio=undef uselargefiles=undef usesocks=undef
use64bitint=undef use64bitall=undef uselongdouble=undef
Perl 5.8.0. Configuration Summary
usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
useperlio= d_sfio=undef uselargefiles=define usesocks=undef
use64bitint=undef use64bitall=undef uselongdouble=undef
Perl 5.8.5 Configuration Summary
usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
useperlio=undef d_sfio=undef uselargefiles=undef usesocks=undef
use64bitint=undef use64bitall=undef uselongdouble=undef
The conclusion is that all regular expressions written like this:
$text =~ /(.*?)take a thousand times more on 5.8.0. The same expressions written as/
$text =~ /^(.*?)which obviously means the same thing (look for the first occurence of/
The conclusion is that all regular expressions written like this:Sadly, that is not true, and that is exactly what I had to change in the source of perl. You say that /(.*)X/ and /^(.*)X/, but that is a half-truth. Consider this case:$text =~ /(.*?)take a thousand times more on 5.8.0. The same expressions written as/ $text =~ /^(.*?)which obviously means the same thing (look for the first occurence of <whatever> and save the text preceding it in the corresponding variables) has the same performance implications across these two versions./
"xxyyyRyyy" =~ /(.*)R\1/If, as you state, the leading ^ is implied, the regex fails, because "xxyyy" cannot be found after the "R" as my regex requires. Only by not anchoring that regex can it ever match ($1 is "yyy").
There is no "easy" way to fix this problem in the source of perl; you have to explicitly state the anchor yourself. The reason is that perl has no way of knowing whether or not you'll end up using what you captured as a backreference, so anchoring has an unknown effect. The problem is not only when the .* is captured, either; any capturing in the regex causes a problem.
(The case of "abc\ndef1" =~ /.*\d/ is already handled by the engine so as not to fail. It would fail if the regex were treated as /^.*\d/, but the engine makes it (?m:^) if necessary.)
And since perldeveloper removes the "whatever"s from the string the regexp will fail in the last iteration. So I would not be surprised if most of the wasted time was in the last iteration :-)
Jenda
Always code as if the guy who ends up maintaining your code
will be a violent psychopath who knows where you live.
-- Rick Osborne
#reg.pl $s = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyyyRRRRyyyy\n" x 500; $n = 0; $n++ while ($s =~ /(.*?)RRRR\1/sg); print "$n matches\n";
time ~/bin/perl5.8.0 reg.pl 500 matches real 0m4.836s user 0m4.800s sys 0m0.010s time ~/bin/perl5.6.1 reg.pl 0 matches real 0m0.020s user 0m0.020s sys 0m0.000sSo, in fact, you are complaining that a bug got fixed. The problem is that these are extremely inefficient regular expressions because they involve a lot of backtracking. I recommend reading Mastering Regular Expressions for a detailed explanation.
time ~/bin/perl5.8.0 reg.pl 500 matches real 0m0.018s user 0m0.010s sys 0m0.010s time ~/bin/perl5.6.1 reg.pl 1 matches real 0m0.015s user 0m0.010s sys 0m0.000s
So at least in this case Perl 5.8.0 doesn't have a speed problem. I don't know exactly what's going on in your code though.
$s = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyyyRRRyyyy\n" x 300; $n = 0; $n++ while ($s =~ /(.*?)RRRR/isg); print "$n matches\n";
To summarize: 5.6.1: 0 matches, 0.32 s; 5.8.0: 0 matches, 2.2 s.
But note that if I change the regex to /x(.*?)RRRR/isg the results are reversed: 5.6.1: 9.2 s; 5.8.0: 1.4 s. That's because now 5.6.1 can't get away with the fake anchor. Interesting...
Unfortunately you began with the rude shock of seeing an amazing slowdown. Therefore while in other circumstances you might agree that you want the right answer, anything below the speed which you were accustomed to is bad.
On the specific optimization that you offer, you're right and wrong. You're right that you can optimize that one regular expression that way and it would be good for that regular expression. But it wouldn't speed up the one that you did want to run. Furthermore adding a check for that special case would slow down the compilation of every other regular expression out there (including the one that you wanted to run). Furthermore you've just added a code path that has potential bugs which might not get caught.
This is not to say that you never want to speed up special cases - of course you do and the regular expression engine has a lot of special tricks. But you have to balance out what is sped up by any one trick against how it slows other people down and causes opportunities for bugs to lurk.
That said, I'd like to point out why the optimization that you point out would not solve your problem. It would tell how to solve a particular expression that you weren't running. The one that you tried to run is different enough that the optimization would probably not run. What you actually would have benefited from is an optimization that says, "Check that there are no backreferences within the RE, then turn on the old special case optimizations." Which might or might not work out to be worthwhile. (And I do not wonder that japhy just chose to turn the optimization off rather than put a test that is that complicated in.)
Ok so japhy told you why that is slow, here's a way to make your code fast regardless - don't even bother with the capturing.
Capturing is always slow because it has to make a copy of the source string. $1, internally, is just substr( $safe_copy_of_match, $-[1], $+[1] - $-[1] ). So the largest speed hit (that I'm aware of) is the memory operation of making a safe duplicate of the data that was just matched. COW (copy on write) may mitigate this if/when it ever gets into perl.
Likely to be be fastest. This was my second thought.
my $whatever_index = index lc $text , $whatever;
return( substr( $text, 0, $whatever_index ),
substr( $text, $whatever_index + length $whatever ) );This may be the fastest. It was my third thought.
my $whatever_index = index lc $text, $whatever' ; my $whatever_length = length $whatever; return unpack "a" . $whatever_index . "x" . $whatever_length . "a*", $text;
This was my first thought. Use a plain regex to *locate* the thing in the string and then just substr() the equivalent of the captures out. This happens to be simplest to look at so it wins on the visual-complexity scale. This is a great general technique to avoid capturing on regexes and as such is a great post-bechmarking optimization.
if ( $text =~ /whatever/i )
{
return( substr( $text, 0, $-[0] ),
substr( $text, $+[0] );
}
open (FILE, "a.txt"); $/=undef; $txt =; $txt =~ s/whatever//sig;
"As you can see, this code slurps a file and removes all occurences of a certain word (`whatever')."
Perl 5.8.0 is slow on your system because Red Hat compiled it with threads and debugging turned on (which you didn't do in your 5.8.5 compile) and because they set the locale to use unicode and folded in a bunch of patches for unicode that were not in the official 5.8.0 release. This has been written about extensively. See the Red Hat bugzilla for more details. This was fixed in 5.8.1. The remaining slowdown of 3.7 is probably due to the regex change that japhy mentioned.
The whole point of my post is that it wasn't caused by a new version of Perl, but rather by things that Red Hat did when packaging Perl for RH 8/9. The regex change was a difference in Perl itself, but it was also fixing a bug so it seems like a resonable choice to include it.
I can tell you one thing: if IBM had written Perl, this would have never happened. Maybe there aren't enough alpha and beta testers, maybe developers don't have the time to write enough warning messages. What's certain is that Perl is not seen as a product, and the members of the community it attempts to serve are not being looked upon as customers. And that's the very difference between Open source and closed source software. What good is it's free, if it is deceiving its users about the problems it claims to solve?
Do you often find that insinuating that people are ignorant, malicious, sloppy, or stupid makes them likely to help you?
I understand the frustration. It's sometimes difficult to remember that dozens of people have put thousands of hours into a project given away freely for other people to use when you find an apparent bug, but it's very wise to keep that in mind.
Your description of the problem was very good, though.
I've talked to proprietary vendors before. Maybe some don't cause you to worry, but those I can think of did not inspire me with confidence.
Barnraising your IT might be an interesting read.
Make sure to read the comments on sentiments about commercial vendors and contracts.
Makeshifts last the longest.
This is an issue of good Perl and bad Perl
You are correct. .*? is perhaps one of the least efficient singular regex constructs available. Why are you matching text you are not keeping, anyways? Are you unaware that there is an entirely separate construct (s/whatever//) made for removing text?
Have you not read the extensive perlre documentation for the product that you are using? Just because something is free doesn't mean that you automatically know how to use it right-out-of-the-box.
Also, if IBM had written Perl, it would probably take over a minute to start while it loaded its built in WSADIE plugins for J2EE development.
I do not understand your reasoning. This loose collective that you are calling the perl community is supposed to police the entire tech sector to make sure that perl is implemented correctly everywhere? Even the 8000 pound Gorilla of Microsoft can't do something like that (I've seen pretty horrible stuff done with their tools). That is akin to saying the fact that your house is defective is the fault of the company who made the hammers that were used.
What do you expect? RedHat have packaged a completely broken version of GCC. RedHat have packaged problematically patched Linux kernels. RedHat have packaged a half-broken TeX distribution. That RedHat would seriously bork a Perl package seems hardly surprising.
It's not the GCC project's fault and not the Linux kernel team's fault and not the TeX project's fault that RedHat broke their software and it isn't the Perl5 porters' fault that RedHat broke their Perl package either.
Makeshifts last the longest.
... RedHat ships with badly packaged Tcl/Tk, (this was discussed in Tcl::Tk module development list and this makes supporting of Tcl::Tk considerably harder on RedHat)
Not sure I agree with *that*... As an old mainframe programmer, I can tell you that when COBOL went from COBOL to VS COBOL II back in the late 80s, we practically had to recompile our entire mainframe library.
And it wasn't like it was obscure stuff... They did away with the EXAMINE statement, which was a staple of COBOL development.
It did away with the ON statement... and would no longer accept LABEL RECORDS...
Worst of all, the TRANSFORM statement vanished.
They had (supposedly) good reasons for making those kind of fundamental changes, but it didn't change the fact that COBOL, arguably the de facto programming standard of the time, was fundamentally changed long after it was a mature product.
So, don't be so sure that IBM wouldn't have done the same thing... they've done it before :)
Trek
There was a *little* of that... it wasn't so much that code ran slower, but the amount of memory that was taken up by the reserved levels (77 level, if I recall correctly... it was a *long* time ago...) did increase, which had it's own set of side effects.
But I'll at least grant that we had prior warnings, and did receive a transformation guide from IBM
Trek
In some places IBM is still best remembered as the company that made typewriters which came with a service contract that guaranteed they were replaced or repaired within 24 hours. They broke down a lot, but were repaired/replaced even faster. The price of the service oontract was such that you actually paid for a new machine every two years!
Wonderful piece of company PR: how to turn bad quality and expensive servicing into a strong selling point!
At least we don't have that with Open Source!
CountZero
"If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law
I am going to guess that you have a UTF-8 LANG set in your environment.
A consequence of this in 5.8.0, but not in 5.6.1 or 5.8.5, is that your file is implicitly opened as UTF-8. This may seem minor, but because you included the /i modifier, it probably slowed it down a lot, since case insensitivity in Unicode is a lot more complicated. You could test this by modifying your environment and rerunning, or by explicitly opening the file as latin-1, or by removing the /i.
You seem to discount the speed up you saw between 5.6.1 and 5.8.5 with your second regex version. I don't think this is really fair. I suspect that the regex engine really is faster in the later versions, when they are actually doing the same thing.
The problem really seems to be that due to some subtleties in how certain things work in different versions of perl, the regex engine is not doing the same things in each of your cases. Since you are so willing to criticize the Perl community, I will gladly turn around and criticize you. This is not particularly obscure information. It's pretty well explained in perldelta, perlunicode, and other man pages. You apparently made the decision to upgrade perl versions without taking the time to research what changed. 5.6 to 5.8 is not a minor change: there are significant changes between the two which you would have been well advised to consider before making the switch.
Furthermore, did you even stop to wonder why there were additional functions being called in one case and not the others? Don't you think this ought to have been a clue that things were not as simple as you would like to think?
perlmonks.org content © perlmonks.org and Aristotle, chromatic, CountZero, Courage, diotalevi, itub, japhy, Jenda, jryan, kscaldef, perldeveloper, perrin, PhilHibbs, sleepingsquirrel, synistar, tilly, TrekNoid
prlmnks.org © 2006 edmund von der burg (eccles & toad)
v 0.03