Here's a snippet of my code
my @patterns = (
qr/\bcreate\b/,
qr/\bdrop\b/,
qr/\bdelete\b/,
qr/\bupdate\b/,
qr/\binsert\b/,
);
open a database connection here (using Sybase::DBD) and
create a statement string to execute
$sth=$dbh->prepare("@sqlstatement");
$sth->execute;
while ($data = $sth->fetchrow_arrayref()) {
next if($data->[10] =~ /tempdb/i);
for ($loop_index = 0; $loop_index < $#patterns; $loop_index++) {
if($data->[13] =~ /$patterns[$loop_index]/i) { print "$data->[3] $data->[9] $data->[10] $data->[13]\n";
}
}
}
This extracts an audit trail from a sybase database and examines each row via the regex above. It produces the output I want but I can't help feeling there's a more efficient way of doing the regex bit. The first question is , if I wanted to get the keywords from a file (so I can build up a dictionary of keywords to search for) instead of hardcoding them as complied regular expressions as in the code above , how would I do this ? e.g. assume my keyword input file would look like this
create delete insert update drop
Is there a more efficient way of doing the regex ?... I read somewhere about the possibility of using the study function to improve the performance.
Any help appreciated, thanks in advance
how would I do this ? e.g. assume my keyword input file would look like this
Something along these lines:
my @patterns = map { chomp; qr/\b\Q$_\E\b/ } <$filehandle>;
Is there a more efficient way of doing the regex?
There are various trade-offs; as long as you have few and simple patterns, combining them is more likely to slow things down than speed them up, because a combined pattern will cause the engine to waste a lot of effort backtracking to check alternatives. [doc://study] has never been of any use to me, either. If you have a lot of patterns, you may want to look into [cpan://Regexp::Assemble] as a way of combining them.
Makeshifts last the longest.
In modern Perls -- I'm not sure which versions qualify here, maybe 5.6+ -- Perl will check whether the contents of the variable has changed. If the content of the variable has not changed, the regexp is not recompiled.
For example, compare
my @words = (
'foo',
'bar',
);
foreach (@array) {
foreach my $word (@words) {
if (/\Q$word/) { # BAD!! $word always changes.
print;
last;
}
}
}
to
my @words = (
'foo',
'bar',
);
foreach my $word (@words) {
foreach (@array) {
if (/\Q$word/) { # GOOD!! regexp only recompiled when needed.
print;
last;
}
}
}
It's not always practical to change the order of the loops. For example, when one of them reads from a file. In that case, the solution is to precompile the regexps. For example, compare
my @words = ( 'foo', 'bar', ); while () { foreach my $word (@words) { if (/\Q$word/) { # BAD!! $word always changes. print; last; } } }
to
my @words = (
'foo',
'bar',
);
# Precompile the regexps.
my @regexps = map { qr/\Q$_/ } @words;
while () {
foreach my $regexp (@regexps) {
if (/$regexps/) { # GOOD!! $regexp is a compiled regexp.
#if ($_ =~ $regexp) { # GOOD!! Alternate syntax.
print;
last;
}
}
}
If you're trying to match constant strings rather than regexps, then I recommend [mod://Regexp::List]:
use Regexp::List (); my @words = ( 'foo', 'bar', ); my $regexp = Regexp::List->new()->list2re(@words); while () { print if /$regexp/; #print if $_ =~ $regexp; # Alternate syntax. }
By the way,
for ($loop_index = 0; $loop_index < $#patterns; $loop_index++) {
is much less readable and no more efficient than
for my $loop_index (0..$#patterns) {
You could also have used
foreach (@patterns) {
Finally, in your case, I'd use
my @patterns = qw(
create
drop
delete
update
insert
);
my $regexp;
$regexp = Regexp::List->new(modifiers => 'i')->list2re(@patterns);
$regexp = qr/\b(?:$regexp)\b/;
...
while ($data = $sth->fetchrow_arrayref()) {
# index is faster than regexps on constant strings.
next if index(lc($data->[10]), 'tempdb') >= 0;
if ($data->[13] =~ $regexp) {
print "$data->[3] $data->[9] $data->[10] $data->[13]\n";
last;
}
}
Update: Bug Fix: Changed $word to $_ in map's code block.
--
[ e d @ h a l l e y . c c ]
In modern Perls -- I'm not sure which versions qualify here, maybe 5.6+ -- Perl will check whether the contents of the variable has changed. If the content of the variable has not changed, the regexp is not recompiled.Hi [ikegami],
...foreach my $word (@words) { if (/\Q$word/) { # BAD!! $word always changes.
Thanks for the very informative post! A quick question: I thought that in a loop like the one above, the iterator variable ($word) was temporarily aliased to each value of the array. So I would expect that as long as the array's contents didn't change, perl would know not to recompile the RE, and so both of your above examples would be the same speed.
But a quick Benchmark agrees with you: putting the RE in the outer loop is about twice as fast as in the inner loop.
Any hints as to what's wrong with my understanding of variable aliasing or RE caching?
Thanks!
What matters is the value of $word, because we're using the value of $word to compile the regexp. While the value of the variable to which $word is/was aliased doesn't change, the value of $word itself does change.
In the first pass, $word is "foo". The regexp was thus compiled with to /foo/. In the second pass, $word is "bar". We obviously need to recompile the regexp because we want /bar/ and it's currently /foo/. Whether $words[0] and $words[1] changed or not is completely irrelevant.
You might be thinking that the compiled regexp is stored with the variable used in the regexp. It's not. That wouldn't work when no variables or multiple variables are used to create the regexps. Instead, the uncompiled regexp or the values of the variables used to compile the regexp -- I don't know which -- is stored along with the compiled regexp in the code.
That depends on a whole lot of factors. In old perls, that was true; modern perls have optimisations to avoid recompiling patterns if the interpolated variables havent changed.
However, note that a match such as $foo =~ /$bar/ is magical if $bar is a pattern precompiled with qr, in that the pattern compilation phase is skipped completely. (This applies when the pattern consists of nothing but the interpolated variable.)
And finally, the (?{}) delayed evaluation construct lets you embed pre-compiled patterns in another pattern without recompiling the embedded pattern.
Makeshifts last the longest.
modern perls have optimisations to avoid recompiling patterns if the interpolated variables havent changedModern perls are even smarter than that. They look at the string after interpolation, and if that's the same, the regex isn't executed:
$ cat xx
#!/usr/bin/perl
use strict;
use warnings;
foreach my $x (["foo", "bar"], ["fo", "obar"]) {
my ($foo, $bar) = @$x;
"" =~ /$foo/;
"" =~ /$foo$bar/;
}
__END__
$ perl -Dr xx 2>&1 |
perl -nle 'if (/EXECUTING/ .. eof) {print if/^Compiling/}'
Compiling REx `foo'
Compiling REx `foobar'
Compiling REx `fo'
Perl won't recompile the second regex, since while the values of both $foo and $bar have changed, the value of "$foo$bar" hasn't.
To load up keywords from a file you would need to do something like this:
my @patterns;
open( my $fh, "input-file.txt") or die "Cannot open input file:$!";
while( my $line = <$fh> ) {
chomp( $line );
push( @patterns, qr/\b$line\b/ );
}
close($fh);
Your regexes are already fairly speedy by using the qr// operator to precompile them. I use this same method to look for roughly 72,000 keywords in thousands of 5-20k full text documents at NewsCloud and it can process a single full text article in under a second.
One thing that will make your life easier is to start using foreach loops instead of for loops like you are. Much less typing and confusing. Remember, you aren't using C anymore :)
Frank Wiles
http://www.wiles.org
I'd recommend two minor changes --
* I know, that's not specifically true, as with your case, you can match each item in the list once, but I'm making a general assumption that your items look to be SQL statements, and if they're single statements, they're most likely mutually exclusive, so long as you don't have sub queries
I'd probably rewrite it something like:
my $pattern = qr/\b(?:create|drop|delete|update|insert)\b/i;
# open a database connection here (using Sybase::DBD) and
# create a statement string to execute
$sth=$dbh->prepare("@sqlstatement");
$sth->execute;
while ($data = $sth->fetchrow_arrayref()) {
next if($data->[10] =~ /tempdb/i);
if ($date->[13] =~ $pattern) {
print "$data->[3] $data->[9] $data->[10] $data->[13]\n";
}
}
You can also use perl's foreach loops to deal with iterating through a list, when the actual index isn't important. (yes, I know, it can be called with 'for', but I always think of C's for loops when I use 'for')
There's no garantee that it will be faster - your regexes are already compiled, and it might be that re-visiting the database costs too much. It may even be slower.
But it's worth a shot.
my $pattern = qr/\b(create|delete|insert|update|drop)\b/;
See also Regex::PreSuf for a slightly fancier option that might be even faster still if your different words share parts.
One thing you could do to speed things up is to let SQL do what it is good for. For example, say your $data->[10] row is called 'database'. Then get SQL to remove everything that has 'tempdb' in it with a WHERE database NOT LIKE '%tempdb%'. And also, if your list of keywords to search for are never actually regexes, just a list of keywords, then you can add something like:
WHERE database NOT LIKE '%tempdb%'
AND (
keyword LIKE '%create%'
OR keyword LIKE '%delete%'
OR keyword LIKE '%insert%'
...
)
I would achieve that by some code like this probably.
my @patterns = qw/create delete insert update drop/;
my $query = "@sqlstatement"; # don't know what you already have in @sqlstatement though
$query .= qq{WHERE database NOT LIKE '%tempdb%';
$query .= qq{AND (};
$query .= join " OR ",
map { qq{keyword LIKE '%$_%'} }
@patterns;
$query .= qq{)};
sth=$dbh->prepare( $query );
$sth->execute;
while ($data = $sth->fetchrow_arrayref()) {
# the query did all of our checking, so we don't need to check anything.
print "$data->[3] $data->[9] $data->[10] $data->[13]\n";
}
-Bryan
#!/usr/bin/perl
use warnings;
use strict;
use Benchmark;
our @list = ('create the world','blah blah',
'drop it already','foo schnerp',
'need to delete','flip schnitzel',
'send me that update!','a flibbertygibitz',
'mailing insert collection','grand central station');
our @wordlist = qw(create drop delete update insert);
our @relist = map { qr/\b$_\b/ } @wordlist;
our $bigre_t = '\b(?:'.join('|',@wordlist).')\b';
our $bigre = qr/$bigre_t/;
print "bigre: $bigre\n";
sub several_re
{
my $match = 0;
foreach my $s (@list) {
foreach my $re (@relist) {
if ($s =~ /$re/) {
$match++;
last;
}
}
}
$match;
}
sub one_re
{
my $match = 0;
foreach my $s (@list) {
if ($s =~ /$bigre/) {
$match++;
}
}
$match;
}
sub use_index
{
my $match = 0;
foreach my $s (@list) {
foreach my $word (@wordlist) {
if (index($s,$word) >= 0) {
$match++;
last;
}
}
}
$match;
}
print "several_re: ", several_re(),"\n";
print "one_re: ", one_re(),"\n";
print "use_index: ", use_index(),"\n";
timethese(100_000, {
'Several Regexp' => \&several_re,
'One Big Regexp' => \&one_re,
'With index()' => \&use_index,
});
In this benchmark, the one big regexp solution is fastest:
Benchmark: timing 100000 iterations... One Big Regexp: 7 wallclock secs (6.22 CPU) @ 16077.17/s (n=100000) Several Regexp: 12 wallclock secs (11.27 CPU) @ 8873.11/s (n=100000) With index(): 11 wallclock secs (8.71 CPU) @ 11481.06/s (n=100000)But the results you get running on your own data will be more useful.
I like the resuts from cmpthese much better than those from timethese. Here is the same benchmark using cmpthese
use warnings;
use strict;
use Benchmark qw(cmpthese);
our @list = ('create the world','blah blah',
'drop it already','foo schnerp',
'need to delete','flip schnitzel',
'send me that update!','a flibbertygibitz',
'mailing insert collection','grand central station');
our @wordlist = qw(create drop delete update insert);
our @relist = map { qr/\b$_\b/ } @wordlist;
our $bigre_t = '\b(?:'.join('|',@wordlist).')\b';
our $bigre = qr/$bigre_t/;
print "bigre: $bigre\n";
sub several_re
{
my $match = 0;
foreach my $s (@list) {
foreach my $re (@relist) {
if ($s =~ /$re/) {
$match++;
last;
}
}
}
$match;
}
sub one_re
{
my $match = 0;
foreach my $s (@list) {
if ($s =~ /$bigre/) {
$match++;
}
}
$match;
}
sub use_index
{
my $match = 0;
foreach my $s (@list) {
foreach my $word (@wordlist) {
if (index($s,$word) >= 0) {
$match++;
last;
}
}
}
$match;
}
print "several_re: ", several_re(),"\n";
print "one_re: ", one_re(),"\n";
print "use_index: ", use_index(),"\n";
cmpthese(-1, {
'Several Regexp' => \&several_re,
'One Big Regexp' => \&one_re,
'With index()' => \&use_index,
});
Prints:
bigre: (?-xism:\b(?:create|drop|delete|update|insert)\b)
several_re: 5
one_re: 5
use_index: 5
Rate Several Regexp With index() One Big Regexp
Several Regexp 24634/s -- -24% -50%
With index() 32367/s 31% -- -34%
One Big Regexp 49358/s 100% 52% --
perlmonks.org content © perlmonks.org and Anonymous Monk, Aristotle, Fletch, GrandFather, halley, ides, ikegami, jhourcle, kwaping, mrborisguy, nothingmuch, Perl Mouse, sgifford
prlmnks.org © 2006 edmund von der burg (eccles & toad)
v 0.03