Regexes eating too much RAM
Articuno
created: 2006-03-03 17:49:19

Hi Monks!

As I was asking in CB, I have a Perl script that reads a 25 MB file to $_ (undef'ing $/) , and does a lot of regex matches against it (like a state machine).

Now I know (thanks diotalevi) that regex matches are cloning my whole string... so I'd like to ask you all some suggestions for lowering the memory consumption :-)

Below is my script:

(disclaimer - this is my full code, if need I'll update the node cutting unneeded things)

(ps - the code reads all *.ms files in the directory, but currently i'm testing with only one file)

(ps2 - the examples and text were in portuguese, but i've translated them here)

(ps3 - the stoplist is just a (very tiny) list of so called "stopwords" - words like "de","do","da" (of), etc...)

# Unify name tags
#  da   X  S. 
# becomes
#  X 

use strict;
use warnings;
use diagnostics;
use Storable;

my $debug = 0;

$" = '';

if (@ARGV < 1) {
    print <;
close SW;
@stopwords = map { split /\s/ } @stopwords;
my %stopword;
@stopword{@stopwords} = undef;

opendir DIR, $dir;
my @files = map { $dir.'/'.$_ } grep { /\.ms$/ } sort readdir(DIR);
closedir DIR;

undef $/;

my $caps =  qr/[A-ZÄÅÆÇÈÒÉÜÓÊÝÔËðÕÌßÖÍÎØÏÙÐþÚÑÛÀÁÂÃ]/;
my $texto = qr/[A-Za-zÄÅÆÇÈÒÉÜÓÊæÝÔËðçÞÕÌúñèßÖÍûòéàÎüóêáØÏýôëâÙÐþõìãÚÑÿöíäÛÀîåÁøïÂùÃ\s]+/;
my @buffer;

for my $file (@files) {
    @buffer = ();
    open IN, "< $file" or die "'$file' couldn't be opened";
    $_ = ;
    close IN;
    open OUT, "> $file.new";
    my $aux = select(OUT);
    $|=1;
    select($aux);
    print STDERR "Processando $file\n";
    my $state = 'TEXT';
    my $total_size = length($_);
    s/(<[^>]*)<([^<>]*>)/$1$2/g; # <  baz > vira 
    study $_;
    /^/gc;
    my $tick = time;
    until(pos($_) == $total_size) {
	print STDERR sprintf "\r%10d bytes",$total_size - pos($_) if time > $tick && ($tick = time);
	if ($debug) {
	    print STDERR "[@buffer]";
	    print STDERR snippet(); print STDERR "\n";
	}
	if ($state eq 'TEXT') {
	    if (/\G(\s*<$caps[^<>]*>\s*)/gc) { $state = 'NAME'; push @buffer, $1; next; } 
	    if (/\G<([^<>]*)>/gc || /\G([^<>\s]*\s*)/gc)  { print OUT $1; next; }
	    die "STRANGE FOO ".snippet();
	} 
	if ($state eq 'NAME') {
	    if (/\G(<$caps[^<>]*>\s*)/gc) { push @buffer, $1; next } 
	    if (/\G<([^<>]*)>/gc)  { flush_name(); $state = 'TEXT'; print OUT $1; next; }
	    if (/\G(\s*([A-Z]\s*\.|(?!\s)$texto(?\s]+\s*)/gc) {
		flush_name(); $state = 'TEXT'; print OUT $1; next;
	    }
	    
	    die "STRANGE FOO ".snippet();
	}
	die "STRANGE FOO ".snippet();
    }
    close OUT;
}
print STDERR "\n";
exit 0;

sub snippet {
    my $text = "{".substr($_,pos($_),42)."}";
    $text =~ s/[\r\n]/|/g;
    return $text;
}


sub flush_name {
    my $count = 0;
    for my $token (@buffer) {
	if ( index($token,'<') >= 0 ) {
	    if (++$count == 2) {
		my $buffer = "@buffer";
		$buffer =~ s/>([^<>]*)]//g; 
	print OUT $tk;
    }
    @buffer = ();
    return;
}

-- 6x9=42
Re: Regexes eating too much RAM
created: 2006-03-03 23:33:19
The main way to save memory is to process your files a bit at a time. Is there any structure to the file - paragraphs, blank lines, etc. - that you can use as a self contained unit of translation? Maybe there is one or a couple of names per line? Then read line, translate line, write line, repeat.

-Mark

Re: Regexes eating too much RAM
created: 2006-03-04 01:08:16

In general, the recipe is to eliminate all capture groups that operate on your large string. Beyond that, you can try to cut things off as you process them. Perl keeps a marker about where a string begins so if you're contientious, you can convince perl to just advance that pointer.

This is wasteful. When it matches, it makes a copy of $_ to an internal buffer so $1 can refer back to it. Eliminate the capturing parentheses and use substr() with @- and @+ to refer back to what $1 would have contained. The documentation for @- is a good reference for you right now.

You'll notice how I used 4-arg substr to directly replace the first part of the string.

            if (/\G<([^<>]*)>/gc) {
                flush_name();
                $state = 'TEXT';
                print OUT $1;
                next;
            }

Efficient.

            if (/\G<[^<>]*>/gc) {
                flush_name();
                $state = 'TEXT';
                print OUT substr $_, $-[0] + 1, $+[0] - $-[0] - 1;
                substr $_, 0, $+[0], '';
                next;
            }

⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

Re: Regexes eating too much RAM
created: 2006-03-06 11:21:26

Perl has this nifty routine called study that might be useful to you. I don't really know though, as everytime I thought it would be useful, it wasn't.

Just thought I'd mention it anyway, in case you found it useful,

perlmonks.org content © perlmonks.org and Articuno, diotalevi, duff, kvale

prlmnks.org © 2006 edmund von der burg (eccles & toad)

v 0.03