Hi Monks!
As I was asking in CB, I have a Perl script that reads a 25 MB file to $_ (undef'ing $/) , and does a lot of regex matches against it (like a state machine).
Now I know (thanks diotalevi) that regex matches are cloning my whole string... so I'd like to ask you all some suggestions for lowering the memory consumption :-)
Below is my script:
(disclaimer - this is my full code, if need I'll update the node cutting unneeded things)
(ps - the code reads all *.ms files in the directory, but currently i'm testing with only one file)
(ps2 - the examples and text were in portuguese, but i've translated them here)
(ps3 - the stoplist is just a (very tiny) list of so called "stopwords" - words like "de","do","da" (of), etc...)
# Unify name tags #da X S. # becomes # X use strict; use warnings; use diagnostics; use Storable; my $debug = 0; $" = ''; if (@ARGV < 1) { print < ; close SW; @stopwords = map { split /\s/ } @stopwords; my %stopword; @stopword{@stopwords} = undef; opendir DIR, $dir; my @files = map { $dir.'/'.$_ } grep { /\.ms$/ } sort readdir(DIR); closedir DIR; undef $/; my $caps = qr/[A-ZÄÅÆÇÈÒÉÜÓÊÝÔËðÕÌßÖÍÎØÏÙÐþÚÑÛÀÁÂÃ]/; my $texto = qr/[A-Za-zÄÅÆÇÈÒÉÜÓÊæÝÔËðçÞÕÌúñèßÖÍûòéàÎüóêáØÏýôëâÙÐþõìãÚÑÿöíäÛÀîåÁøïÂùÃ\s]+/; my @buffer; for my $file (@files) { @buffer = (); open IN, "< $file" or die "'$file' couldn't be opened"; $_ = ; close IN; open OUT, "> $file.new"; my $aux = select(OUT); $|=1; select($aux); print STDERR "Processando $file\n"; my $state = 'TEXT'; my $total_size = length($_); s/(<[^>]*)<([^<>]*>)/$1$2/g; # < baz > vira study $_; /^/gc; my $tick = time; until(pos($_) == $total_size) { print STDERR sprintf "\r%10d bytes",$total_size - pos($_) if time > $tick && ($tick = time); if ($debug) { print STDERR "[@buffer]"; print STDERR snippet(); print STDERR "\n"; } if ($state eq 'TEXT') { if (/\G(\s*<$caps[^<>]*>\s*)/gc) { $state = 'NAME'; push @buffer, $1; next; } if (/\G<([^<>]*)>/gc || /\G([^<>\s]*\s*)/gc) { print OUT $1; next; } die "STRANGE FOO ".snippet(); } if ($state eq 'NAME') { if (/\G(<$caps[^<>]*>\s*)/gc) { push @buffer, $1; next } if (/\G<([^<>]*)>/gc) { flush_name(); $state = 'TEXT'; print OUT $1; next; } if (/\G(\s*([A-Z]\s*\.|(?!\s)$texto(?\s]+\s*)/gc) { flush_name(); $state = 'TEXT'; print OUT $1; next; } die "STRANGE FOO ".snippet(); } die "STRANGE FOO ".snippet(); } close OUT; } print STDERR "\n"; exit 0; sub snippet { my $text = "{".substr($_,pos($_),42)."}"; $text =~ s/[\r\n]/|/g; return $text; } sub flush_name { my $count = 0; for my $token (@buffer) { if ( index($token,'<') >= 0 ) { if (++$count == 2) { my $buffer = "@buffer"; $buffer =~ s/>([^<>]*)$1/g; print OUT $buffer; @buffer = (); return; } } } for my $tk (@buffer) { $tk =~ s/[<>]//g; print OUT $tk; } @buffer = (); return; }
-Mark
In general, the recipe is to eliminate all capture groups that operate on your large string. Beyond that, you can try to cut things off as you process them. Perl keeps a marker about where a string begins so if you're contientious, you can convince perl to just advance that pointer.
This is wasteful. When it matches, it makes a copy of $_ to an internal buffer so $1 can refer back to it. Eliminate the capturing parentheses and use substr() with @- and @+ to refer back to what $1 would have contained. The documentation for @- is a good reference for you right now.
You'll notice how I used 4-arg substr to directly replace the first part of the string.
if (/\G<([^<>]*)>/gc) {
flush_name();
$state = 'TEXT';
print OUT $1;
next;
}Efficient.
if (/\G<[^<>]*>/gc) {
flush_name();
$state = 'TEXT';
print OUT substr $_, $-[0] + 1, $+[0] - $-[0] - 1;
substr $_, 0, $+[0], '';
next;
}
⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊
perlmonks.org content © perlmonks.org and Articuno, diotalevi, duff, kvale
prlmnks.org © 2006 edmund von der burg (eccles & toad)
v 0.03