Reducing memory usage while matching log entries
matt.tovey
created: 2006-02-01 09:09:57
An application-programmer colleague of mine, noting that I program in Perl, asked me for a script to help him analyse log files. How could I refuse?

Problem is, my solution needs the whole logfile in memory, and these things can be quite large (100MB), so I'm wondering if there's a better way. No, strike that, I'm wondering what the better way is. :)

The logfile consists of the locking and unlocking of various mutex's, where my colleague wants to disregard (i.e. remove) all references to locks which are later unlocked. I'm doing this so:

my @lines;                      # Array of lines to be kept for output.
push(@lines, '');               # Preload output array with null line 0 (for correct line counts).
my %locks;                      # Hash of currently open locks.

while () {
  my $count = push(@lines, $_) - 1;
  Log 2, "Analysing line $count";
  if ($_ =~ /Mutex\((.*)\)::(\w+)/) {                   # Regexp to obtain address and action info from line.
    Log 2, "Address and action parsed correctly.";
    my $address = $1;
    my $action = $2;

    if ($action eq 'locking') {
      Log 2, "Address $address locked at line $count";
      if (defined $locks{$address}) {
        Log 0, "ERROR: Address $address locked at line $count, but already locked at line $locks{$address}.";
      }
      $locks{$address} = $count;
    }

    if ($action eq 'unlocked') {
      Log 2, "Address $address unlocked at line $count";
      unless (defined $locks{$address}) {
        Log 0, "ERROR: Address $address not locked, but unlocked at line $count.";
      } else {
        $lines[$locks{$address}] = '';
        Log 1, "Found a match for address $address: locked $locks{$address}, unlocked $count. Removing from output.";
      }
      $locks{$address} = '';
      $lines[$count] = '';
    }
  }
}

foreach (@lines) {print}

Nearly all of the locks that are set will be cleared close to the time that they are set, so the memory-use could be significantly reduced by a solution that frees memory no longer required. I was (very optimistically!) hoping that setting array entries to '' would free the memory, but a bit of profiling shows me that it's not so.

So, what sould be a better solution? Here are some ideas that I had:

- splice lines no longer required out of @lines.
Con: I'd have to go through %locks, changing the line numbers each time I remove a line.

- grep backwards through @lines when I find an 'unlocked' message.
Cons: Probably much slower, and I'd no longer pick up on the 'Address already locked' error condition that I find in my solution.

Re: Reducing memory usage while matching log entries
created: 2006-02-01 09:47:47
Perhaps providing some sample data would help us help you.

Neil Watson
watson-wilson.ca

Re: Reducing memory usage while matching log entries
created: 2006-02-01 09:53:33
I'm wondering what the better way is. :)

Rather than storing the whole file in memory, use a 2 pass solution. In the first pass, gather the locking information. In the second pass, remove the appropriate lines. Each pass will consist of just a while loop that only holds one line at a time in memory.

Re^2: Reducing memory usage while matching log entries
created: 2006-02-01 11:07:53
I like this idea but I'm having trouble get a 2 pass solution to work nicely. In the first pass I identify which line numbers need to be removed. But then I'll need to mung that list into some convenient form to use while reading the file again in the second pass...

Tie::File looks quite handy, but since I don't want to alter the original log file, I'd have to copy it first and then reduce it, which could be a problem if disk-space is tight.

So, so far storing the whole file in a hash so that I can properly delete the lines no longer required seems like a winner.

Salva: The 'sort' idea has the problem that a given lock ID can be locked and unlocked multiple times in the file, so the sorted valued won't always be 'locking' follwed by 'unlocked'. Plus the contents of the logfile need to remain in the correct order for analysis of the remaining contents... but thanks!

Re^3: Reducing memory usage while matching log entries
created: 2006-02-01 18:08:54
I like this idea but I'm having trouble get a 2 pass solution to work nicely. In the first pass I identify which line numbers need to be removed. But then I'll need to mung that list into some convenient form to use while reading the file again in the second pass...

I don't understand your difficulty. Pass #1 records line numbers, pass #2 writes all the lines that haven't been recorded. Here's some code based on your original but with a few tweaks:

#!/usr/bin/perl

use strict; use warnings;
sub Log;

die "Usage:     $0 \n" unless @ARGV == 1;
my $logfile = shift;

# Pass #1 : gather line numbers to be deleted.
my %locks;                      # Hash of currently open locks.
my @unlock_lines;               # lines to rid ourselves of
open(LOGFILE,$logfile) or die "Can't read $logfile - $!\n";
while () {
  Log 2, "Analysing line $.";
  next unless /Mutex\((.*?)\)::(\w+)/;
  my ($address,$action) = ($1,$2);
  if ($action eq 'locking') {
    Log 2, "Address $address locked at line $.";
    if (defined $locks{$address}) {
      Log 0, "ERROR: Address $address locked at line $., but already locked at line $locks{$address}.";
    }
    $locks{$address} = $.;
  }

  if ($action eq 'unlocked') {
    Log 2, "Address $address unlocked at line $.";
    unless (defined $locks{$address}) {
       Log 0, "ERROR: Address $address not locked, but unlocked at line $..";
    } else {
        push @unlock_lines, $., delete $locks{$address};
    }
  }
}
close LOGFILE;

# Sort the lines numbers that we've accumulated because we put them in
# unordered. This allows us to make just one more pass through the file
# to remove the lines.
@unlock_lines = sort { $a <=> $b } @unlock_lines;

# Pass #2: output all but the lines we're not interested in.
my $rmline = shift @unlock_lines;
open(LOGFILE,$logfile) or die "Can't read $logfile - $!\n";
while () {
   if (defined $rmline && $. == $rmline) {
      $rmline = shift @unlock_lines;
      next;
   }
   print;
}
close LOGFILE;

[duff]

Re^4: Reducing memory usage while matching log entries
created: 2006-02-02 04:00:13
      I like this idea but I'm having trouble get a 2 pass solution to work nicely...

    I don't understand your difficulty...

Sorry, I had to leave work about the time of my last message yesterday (well, actually 10 minutes _before_ my last message!), but didn't want to disappear without writing back. And I'm not good at coding under stress (and not fantastic the rest of the time either!).

Anyway, thanks for taking the time to write out the code there. Strangely enough, I tried it this morning, and the memory usage of this is actually higher than the original, at least as Linux measures it! I'm processing a 65MB test file - after the first pass the script is consuming 100MB of memory, and after the sort then 200MB!
With this test file, @unlock_lines ends up with 800000 entries, but still I was surprised. My original script used (RSS & VSZ) 160MB, and using a hash to store the file in memory (so as to properly free the deleted lines) brings it down to 110MB...

Thanks also for '$.' - I didn't know about that one!

Re^3: Reducing memory usage while matching log entries
created: 2006-02-02 05:55:59
The 'sort' idea has the problem that a given lock ID can be locked and unlocked multiple times in the file, so the sorted valued won't always be 'locking' follwed by 'unlocked'.

That shouldn't be a problem as long as you use an stable sort implementation or alternatively the line number or a timestamp as the secondary sorting key.

Re: Reducing memory usage while matching log entries
created: 2006-02-01 09:54:46

Yeah, without sample data it's hard to give advice. But the correct way to remove an element from a hash is with delete, not by assigning an empty string. And you're doing something really weird with @lines; if you get to line 100,000 it's going to allocate a 100,000 element array. Considering it's probably sparse you should use a hash for that as well.

If you're still running into memory issues you might look at using a tied hash (DB_File or BerkeleyDB) or even DBD::SQLite to store stuff you've seen somewhere easily retrieved. Those wouldn't keep much of anything in memory but fetching to see the state of a given address should still be pretty quick.

Re^2: Reducing memory usage while matching log entries
created: 2006-02-01 10:49:50
And you're doing something really weird with @lines; if you get to line 100,000 it's going to allocate a 100,000 element array. Considering it's probably sparse you should use a hash for that as well.

Yes, that's the file being held in memory. :(

I just tried putting that into a hash (also using 'delete' now - thanks!). It runs only slightly slower, and consumes quite a bit less memory, so success there!

Re: Reducing memory usage while matching log entries
created: 2006-02-01 10:06:49

Take a look at Tie::File, this allows you to treat the file as an array without loading the entire file into memory.


There are ten types of people: those that understand binary and those that don't.
Re: Reducing memory usage while matching log entries
created: 2006-02-01 10:27:55
An easy way is to sort the file on disk using the lock id as the primary sorting key with some CPAN module as Sort::External, Sort::Key::Merger or even the sort command.

After that, assuming nested locks are not allowed, every lock entry should be followed by an unlock one, something very easy to parse!

Re: Reducing memory usage while matching log entries
created: 2006-02-01 13:15:48
BTW, you don't have to track the current line number yourself (or prepend an empty string to your array so it doesn't start with zero).
Just use $. (Yes, dollar sign followed by period) it will contain the current line number of the last read filehandle.
Re: Reducing memory usage while matching log entries
hv
created: 2006-02-01 13:28:22

I don't understand why you are using the @lines array at all, and without that your memory requirements go way down. The logic would be something like:

  • %locks records currently open locks, with the key being the mutex address and the value being the input file line number where the lock was acquired
  • if a 'locking' line is seen for a locked mutex, report it and overwrite the %locks entry with the newer lock
  • if an 'unlocked' line is seen for a locked mutex, remove the %locks entry
  • if an 'unlocked' line is seen for an unlocked mutex, report it
  • when the input file is exhausted, report all remaining entries in %locks as "never unlocked"

Your existing code is close to that: remove the references to @lines; replace the definition of $count with:

  my $count = $.;  # input line number
change the action for unlocking to:
  delete $locks{$address};
and replace the final foreach (@lines) {print} with:
for (sort keys %locks) {
  Log 0, "ERROR: address $_ locked at line $locks{$_} but never unlocked.";
}

Hope this helps,

Hugo

Re: Reducing memory usage while matching log entries
created: 2006-02-01 13:30:06
Try this:
use strict;
my %locks;

while (chomp(my $line = )) {
	next unless ($line =~ /Mutex(.*)::(.*)/);
	my ($mutex,$action) = ($1,$2);

	if ($action eq 'locking') {
		if ($locks{$mutex}) {
			printf ("Mutex %s already locked on line %d, locked again on line %d\n",
				$mutex, $locks{$mutex}, $.);
		} else {
			$locks{$mutex} = $.;
		}
	} elsif ($action eq 'unlocked') {
		if ($locks{$mutex}) {
			delete $locks{$mutex};
		} else {
			printf ("Mutex %s unlocked on line %d, but wasn't locked.\n",
				$mutex, $.);
		}
	}
}

foreach my $mutex (keys %locks) {
	printf ("Mutex %s locked on line %d, nut wasn't unlocked\n",
		$mutex, $locks{$mutex});
}

perlmonks.org content © perlmonks.org and duff, Fletch, hv, matt.tovey, meetraz, neilwatson, salva, tirwhan

prlmnks.org © 2006 edmund von der burg (eccles & toad)

v 0.03