Problem is, my solution needs the whole logfile in memory, and these things can be quite large (100MB), so I'm wondering if there's a better way. No, strike that, I'm wondering what the better way is. :)
The logfile consists of the locking and unlocking of various mutex's, where my colleague wants to disregard (i.e. remove) all references to locks which are later unlocked. I'm doing this so:
my @lines; # Array of lines to be kept for output. push(@lines, ''); # Preload output array with null line 0 (for correct line counts). my %locks; # Hash of currently open locks. while () { my $count = push(@lines, $_) - 1; Log 2, "Analysing line $count"; if ($_ =~ /Mutex\((.*)\)::(\w+)/) { # Regexp to obtain address and action info from line. Log 2, "Address and action parsed correctly."; my $address = $1; my $action = $2; if ($action eq 'locking') { Log 2, "Address $address locked at line $count"; if (defined $locks{$address}) { Log 0, "ERROR: Address $address locked at line $count, but already locked at line $locks{$address}."; } $locks{$address} = $count; } if ($action eq 'unlocked') { Log 2, "Address $address unlocked at line $count"; unless (defined $locks{$address}) { Log 0, "ERROR: Address $address not locked, but unlocked at line $count."; } else { $lines[$locks{$address}] = ''; Log 1, "Found a match for address $address: locked $locks{$address}, unlocked $count. Removing from output."; } $locks{$address} = ''; $lines[$count] = ''; } } } foreach (@lines) {print}
Nearly all of the locks that are set will be cleared close to the time that they are set, so the memory-use could be significantly reduced by a solution that frees memory no longer required. I was (very optimistically!) hoping that setting array entries to '' would free the memory, but a bit of profiling shows me that it's not so.
So, what sould be a better solution? Here are some ideas that I had:
- splice lines no longer required out of @lines.
Con: I'd have to go through %locks, changing the line numbers each time I remove a line.
- grep backwards through @lines when I find an 'unlocked' message.
Cons: Probably much slower, and I'd no longer pick up on the 'Address already locked' error condition that I find in my solution.
Neil Watson
watson-wilson.ca
I'm wondering what the better way is. :)
Rather than storing the whole file in memory, use a 2 pass solution. In the first pass, gather the locking information. In the second pass, remove the appropriate lines. Each pass will consist of just a while loop that only holds one line at a time in memory.
Tie::File looks quite handy, but since I don't want to alter the original log file, I'd have to copy it first and then reduce it, which could be a problem if disk-space is tight.
So, so far storing the whole file in a hash so that I can properly delete the lines no longer required seems like a winner.
Salva: The 'sort' idea has the problem that a given lock ID can be locked and unlocked multiple times in the file, so the sorted valued won't always be 'locking' follwed by 'unlocked'. Plus the contents of the logfile need to remain in the correct order for analysis of the remaining contents... but thanks!
I like this idea but I'm having trouble get a 2 pass solution to work nicely. In the first pass I identify which line numbers need to be removed. But then I'll need to mung that list into some convenient form to use while reading the file again in the second pass...
I don't understand your difficulty. Pass #1 records line numbers, pass #2 writes all the lines that haven't been recorded. Here's some code based on your original but with a few tweaks:
#!/usr/bin/perl use strict; use warnings; sub Log; die "Usage: $0\n" unless @ARGV == 1; my $logfile = shift; # Pass #1 : gather line numbers to be deleted. my %locks; # Hash of currently open locks. my @unlock_lines; # lines to rid ourselves of open(LOGFILE,$logfile) or die "Can't read $logfile - $!\n"; while ( ) { Log 2, "Analysing line $."; next unless /Mutex\((.*?)\)::(\w+)/; my ($address,$action) = ($1,$2); if ($action eq 'locking') { Log 2, "Address $address locked at line $."; if (defined $locks{$address}) { Log 0, "ERROR: Address $address locked at line $., but already locked at line $locks{$address}."; } $locks{$address} = $.; } if ($action eq 'unlocked') { Log 2, "Address $address unlocked at line $."; unless (defined $locks{$address}) { Log 0, "ERROR: Address $address not locked, but unlocked at line $.."; } else { push @unlock_lines, $., delete $locks{$address}; } } } close LOGFILE; # Sort the lines numbers that we've accumulated because we put them in # unordered. This allows us to make just one more pass through the file # to remove the lines. @unlock_lines = sort { $a <=> $b } @unlock_lines; # Pass #2: output all but the lines we're not interested in. my $rmline = shift @unlock_lines; open(LOGFILE,$logfile) or die "Can't read $logfile - $!\n"; while ( ) { if (defined $rmline && $. == $rmline) { $rmline = shift @unlock_lines; next; } print; } close LOGFILE;
[duff]
I don't understand your difficulty...
Anyway, thanks for taking the time to write out the code there. Strangely enough, I tried it this morning, and the memory usage of this is actually higher than the original, at least as Linux measures it! I'm processing a 65MB test file - after the first pass the script is consuming 100MB of memory, and after the sort then 200MB!
With this test file, @unlock_lines ends up with 800000 entries, but still I was surprised. My original script used (RSS & VSZ) 160MB, and using a hash to store the file in memory (so as to properly free the deleted lines) brings it down to 110MB...
Thanks also for '$.' - I didn't know about that one!
That shouldn't be a problem as long as you use an stable sort implementation or alternatively the line number or a timestamp as the secondary sorting key.
Yeah, without sample data it's hard to give advice. But the correct way to remove an element from a hash is with delete, not by assigning an empty string. And you're doing something really weird with @lines; if you get to line 100,000 it's going to allocate a 100,000 element array. Considering it's probably sparse you should use a hash for that as well.
If you're still running into memory issues you might look at using a tied hash (DB_File or BerkeleyDB) or even DBD::SQLite to store stuff you've seen somewhere easily retrieved. Those wouldn't keep much of anything in memory but fetching to see the state of a given address should still be pretty quick.
Yes, that's the file being held in memory. :(
I just tried putting that into a hash (also using 'delete' now - thanks!). It runs only slightly slower, and consumes quite a bit less memory, so success there!
Take a look at Tie::File, this allows you to treat the file as an array without loading the entire file into memory.
After that, assuming nested locks are not allowed, every lock entry should be followed by an unlock one, something very easy to parse!
I don't understand why you are using the @lines array at all, and without that your memory requirements go way down. The logic would be something like:
Your existing code is close to that: remove the references to @lines; replace the definition of $count with:
my $count = $.; # input line numberchange the action for unlocking to:
delete $locks{$address};
and replace the final foreach (@lines) {print} with:
for (sort keys %locks) {
Log 0, "ERROR: address $_ locked at line $locks{$_} but never unlocked.";
}
Hope this helps,
Hugo
use strict; my %locks; while (chomp(my $line =)) { next unless ($line =~ /Mutex(.*)::(.*)/); my ($mutex,$action) = ($1,$2); if ($action eq 'locking') { if ($locks{$mutex}) { printf ("Mutex %s already locked on line %d, locked again on line %d\n", $mutex, $locks{$mutex}, $.); } else { $locks{$mutex} = $.; } } elsif ($action eq 'unlocked') { if ($locks{$mutex}) { delete $locks{$mutex}; } else { printf ("Mutex %s unlocked on line %d, but wasn't locked.\n", $mutex, $.); } } } foreach my $mutex (keys %locks) { printf ("Mutex %s locked on line %d, nut wasn't unlocked\n", $mutex, $locks{$mutex}); }
perlmonks.org content © perlmonks.org and duff, Fletch, hv, matt.tovey, meetraz, neilwatson, salva, tirwhan
prlmnks.org © 2006 edmund von der burg (eccles & toad)
v 0.03