Howto Avoid Memory Problem in List::MoreUtils
Anonymous Monk
created: 2006-05-04 05:02:49
Hi,
Is there any way to get around the out of memory problem given by a method of List::MoreUtils?

Given this snippet within a subroutine

# within a function

  print "FINDNB_BEFORE_UNIQ: ",scalar(@very_big_array),"\n";
  my @uvery_big_array = uniq @very_big_array;
  print "FINDNB_AFTER: ",scalar(uvery_big_array),"\n"; 
    
# return \@uvery_big_array;


It breaks here:
FINDNB_BEFORE_UNIQ: 20905984
Out of memory!

Also I'm wondering, wether the fact that that snippet is located within the subroutine thus it cause the memory problem.
Re: Howto Avoid Memory Problem in List::MoreUtils
created: 2006-05-04 05:33:48
The algorithm used by List::MoreUtils::uniq internally is fast (O(N)) but uses a lot of memory.

You can use an alternative algorithm as sort-and-remove-consecutive-duplicates that is slower (O(NlogN)) but that doesn't require additional memory:

@array = sort @array; # this is optimized by perl as a sort
                      # in place operation that doesn't
                      # require additional memory
my $last = $array[0];
my $i = 1;
while($i<@array) {
  if ($array[$i] eq $last) {
    splice @array, $i, 1
  }
  else {
    $last = $array[$i];
    $i++;
  }
} 
Re^2: Howto Avoid Memory Problem in List::MoreUtils
created: 2006-05-04 05:48:20
well, the algorithm on my previous post is actually O(N^2) on the worst case because of the splice.

A better version is:

@array = sort @array;
my $j = 0;
my $last;
for my $e (@array) {
  if (!$j or $last ne $e) {
    $array[$j] = $last = $e;
    $j++;
  }
}
$#array = $j-1;
Re^3: Howto Avoid Memory Problem in List::MoreUtils
created: 2006-05-09 01:59:13
Hi salva,
Thanks so much for your reply. Just a minor question. What's the meaning of this line:
$array[$j] = $last = $e;
Why not simply:
$array[$j] = $e;
Re^4: Howto Avoid Memory Problem in List::MoreUtils
created: 2006-05-09 02:03:23
$last is used to remove dups. $last needs to be set to the $e of this pass to compare it to the $e of the next pass. I suppose you could get rid of $last and do
@array = sort @array;
my $j = -1;
for my $e (@array) {
  if ($j < 0 || $array[$j] ne $e) {
    $array[++$j] = $e;
  }
}
$#array = $j;
Re^2: Howto Avoid Memory Problem in List::MoreUtils
created: 2006-05-04 06:31:20
this is optimized by perl as a sort in place operation

In which version of Perl did that become true?

Unless I'm doing it wrong, it doesn't seem to be true in 5.8.8?


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
Re^3: Howto Avoid Memory Problem in List::MoreUtils
created: 2006-05-04 06:39:21
n which version of Perl did that become true?
Since 5.8.4.
$ perl583 -le '@a=(1,2); $r1 = \$a[0]; @a=sort@a; $r2 = \$a[0];print "yes"if $r1 == $r2'
$ perl584 -le '@a=(1,2); $r1 = \$a[0]; @a=sort@a; $r2 = \$a[0];print "yes"if $r1 == $r2'
yes
$

Dave.

Re^3: Howto Avoid Memory Problem in List::MoreUtils
created: 2006-05-04 07:07:43
well, you are right, I forgot that mergesort needs additional memory to perform the merge, but yet @a = sort @a uses half the memory required by other sorting expressions, for instance:
use Memchmark 'cmpthese';

# use sort '_quicksort';

cmpthese( none => sub { my @a = map { int rand 100 } 1..1000000; },
          sort0 => sub { my @a = map { int rand 100 } 1..1000000; @a = sort { $a <=> $b } @a },
          sort1 => sub { my @a = map { int rand 100 } 1..1000000; @a = sort { $a <=> $b } @a, 1 } );

__END__

test: none, memory used: 3997696 bytes
test: sort0, memory used: 7999488 bytes
test: sort1, memory used: 12390400 bytes
And if the quicksort algorithm is selected, then the sentence on my previous post becomes really true:
use sort '_quicksort';
...

__END__

test: none, memory used: 3997696 bytes
test: sort0, memory used: 3997696 bytes
test: sort1, memory used: 12390400 bytes

Re^4: Howto Avoid Memory Problem in List::MoreUtils
created: 2006-05-04 07:48:53

Hmm. I guess we have different understandings of the term [http://planetmath.org/encyclopedia/InPlaceSortingAlgorithm.html|In-place sort].

I was trying to sort (using the default algorithm), 20e6 elements which takes 400 MB, but it was consuming 1.4 GB gross (some memory has been return to the system by the time the second memcheck runs below).

C:\test>junk
Mem after building array: 404,556 kb
Mem after sorting array: 1,214,032 kb

Having read your post, I tried the _quickersort option. Things improve marginally, but:

C:\test>junk
Mem after building array: 404,812 kb
Mem after sorting array: 973,584 kb

But that still requires a gross memory usage of 1.3 GB. 2N additional memory is somewhat greater than logN.

A useful feature, and nice to know it's there, but not quite what I was expecting hoping for when you said in-place.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
Re^5: Howto Avoid Memory Problem in List::MoreUtils
created: 2006-05-04 08:13:01
I guess we have different understandings of the term In-place sort.

Yes, inside perl internals, "inplace" just means that instead of passing the elements on the stack, a reference to the target array is passed and the sorting is done directly over it.

... but if the quicksort algorithm is selected, @a = sort @a is also an in-place sort as described on the article pointed by you (if we ignore degenerated cases).

If you need 1.4GB to sort 400MB, it means that you are probably converting numbers to strings inside the sort. For instance:

use Memchmark 'cmpthese';

use sort '_quicksort';

cmpthese( none => sub {
             my @a = map { int rand 100 } 1..1000000;
          },
          sort_num => sub {
             my @a = map { int rand 100 } 1..1000000;
             @a = sort { $a<=> $b } @a
          },
          sort_str => sub {
             my @a = map { int rand 100 } 1..1000000;
             @a = sort { $a cmp $b } @a
          },
          sort_str_workaround => sub {
             my @a = map { int rand 100 } 1..1000000;
             my ($c, $d);
             @a = sort { ($c, $d) = ($a,$b); $c cmp $d } @a
          }
        );

__END__

test: none,                memory used:  3997696 bytes
test: sort_num,            memory used:  3997696 bytes
test: sort_str,            memory used: 68202496 bytes
test: sort_str_workaround, memory used:  3997696 bytes

... and as you can see, there is a simple workaround (though it would make the sorting much slower).
Re^6: Howto Avoid Memory Problem in List::MoreUtils
created: 2006-05-04 08:30:02
If you need 1.4GB to sort 400MB that means that you are probably converting numbers to strings inside the sort.

Yep! That's the problem.

{ ($c, $d) = ($a,$b); $c cmp $d

I wonder if that optimisation could be worked into the internal sort?

I started to suggest that if both had the (same) IOK or NOK flags set, then use a numeric comparison, but I guess it would only take one string embeded at the wrong place for the entire sort to have to be done over.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
Re^7: Howto Avoid Memory Problem in List::MoreUtils
created: 2006-05-04 09:18:27
I started to suggest that if both had the (same) IOK or NOK flags set, then use a numeric comparison

but if you are sorting lexicograficaly, you can not compare numbers as numbers because they are not equivalent operations (i.e.: 2 < 10 but 2 gt 10).

It shouldn't be too difficult to write a sorting sub in XS that uses a comparison callback equivalent to the Perl {($c, $d) = ($a, $b); $c cmp $d }. But converting the numbers to strings is an expensive operation, even in C.

Also, you can apply a transformation to the numbers to make them sort lexicograficaly when compared as numbers. For instance, for positive integers with 8 or less digits (untested):

use Sort::Key 'ukeysort_inplace';
my $digits = 8;
ukeysort_inplace {
  my $a = int $_;
  length $a > $digits and die 'integer $a is too big';
  $a .= ' ' x ($digits - length $a);
  $a =~ tr/ 0123456789/0123456789a/;
  hex $a
} @array;

(usortkey_inplace should not use more than 8 additional bytes per value on a 32bits machine, update: + 4 bytes for the merge phase of the mergesort)

Re^8: Howto Avoid Memory Problem in List::MoreUtils
created: 2006-05-04 09:42:33
but if you are sorting lexicograficaly,

True. One further question though.

Why does using sort{ $a <=> $b } @numbers also cause the memory growth. The same appears to be true for Sort::Key::nsort(_inplace)?


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
Re^9: Howto Avoid Memory Problem in List::MoreUtils
created: 2006-05-04 10:07:37
how have you created the scalars inside @numbers?

If you have read them from a file and never used them as numbers before, comparing then numerically also forces their internal structure to grow to accomodate the NV and/or IV slot.

Also, in not too old versions of perl, there was a bug that caused numbers to be converted to strings inside sort even when numeric comparison was used.

Besides that, in general, perl mergesort uses the memory equivalent to two pointers per value, one to pass the value on the stack and the other for the merge. Sort::Key functions use the same plus the memory required for the key (for instance, 4 bytes for an integer key on a 32bits computer or 8 bytes for a floating point number).

Re^10: Howto Avoid Memory Problem in List::MoreUtils
created: 2006-05-04 10:40:31
how have you created the scalars inside @numbers?

I am creating the array with [rand], so they should already be IVs:

#! perl -slw
use strict;
use sort '_quicksort';

sub mem {
    my( $usage ) =  `tasklist /NH /FI \"pid eq $$\" ` =~ m[ (\S+) \s+ K \s* $ ]x;
    return $usage;
}

$| = 1;

my @a; $#a = 20e6;
$a[ $_ ] = int rand 32767 for 0 .. 20e6;

printf "Mem after building array: %s kb\n", mem;

@a = sort{ $a <=> $b } @a;

printf "Mem after sorting array: %s kb\n", mem;
Also, in not too old versions of perl, there was a bug that caused numbers to be converted to strings inside sort even when numeric comparison was used.

That was it. I'm still using 5.8.6 as my general install. Running the above script under 5.8.8 fixes that problem. Many thanks for your patience.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
Reaped: How can i access a session(CGI::Session) variable in other forms?
created: 2006-05-08 12:34:03
This node was taken out by the NodeReaper on 2006-05-08 13-29-40
Reason: [jdporter]: reap: non-root dupe of How can i access a session(CGI::Session) variable in other forms?

You may view the original node and the consideration vote tally.

Re: How can i access a session(CGI::Session) variable in other forms?
created: 2006-05-08 13:03:11
simy, your question is unrelated to this post. To ask new questions, go to the Seekers of Perl Wisdom section and use the form at the end.
Re: Howto Avoid Memory Problem in List::MoreUtils
created: 2006-05-05 00:04:42
Just curious... would you get the "Out of memory!" result if you simply did this instead of the "uniq" call (other things being equal up to this point):
  my @uvery_big_array = @very_big_array;
In any case, since one way or another you'll need to sacrifice some runtime to economize on memory usage, a last resort might be to simply save @very_big_array to a disk file, run unix "sort -u" on that, and read it back in to the same array variable. (Obviously not an attractive option if portability is an issue, but if this is just an "in-shop" process, unix "sort" has been ported to every popular OS.)
Re^2: Howto Avoid Memory Problem in List::MoreUtils
created: 2006-05-05 05:06:08
Sort::External could be a good option for that also.

Or, comming back to the hash aproach used by List::MoreUtils::uniq, using an in disk tree as provided by DB_File could be a faster solution specially if the number of duplicates is high.

Re^2: Howto Avoid Memory Problem in List::MoreUtils
created: 2006-05-25 22:55:43
Hi graff,
Sorry for coming back to you again. I guess I have to resort to your way of doing it. As for your suggestion below:
save @very_big_array to a disk file
How could one implement it efficiently directly from the existing non-uniq array (as @very_big_array)?
Re^3: Howto Avoid Memory Problem in List::MoreUtils
created: 2006-05-26 04:46:42
use DB_File;
tie my %uniq, "/tmp/uniq"; # you should use File::Temp here!
$uniq{$_} = 1 for @data;
my @uniq = keys %uniq;
Re^4: Howto Avoid Memory Problem in List::MoreUtils
created: 2006-05-27 10:34:24
Thanks a lot salva, I have read your latest posting with File::Temp.
Re: Howto Avoid Memory Problem in List::MoreUtils
created: 2006-05-05 02:27:24
Just a reminder. You can prevent the addition of duplicates with [mod://Array::Unique].
 use Array::Unique;
 tie @a, 'Array::Unique';

 @a = qw(a b c a d e f);
 push @a, qw(x b z);
 print "@a\n";          # a b c d e f x z
Re^2: Howto Avoid Memory Problem in List::MoreUtils
created: 2006-05-05 05:00:18
Array::Unique uses the same approach as List::MoreUtils::uniq to eliminate duplicates, so it is not going to solve the out of memory problem.
Re: Howto Avoid Memory Problem in List::MoreUtils
created: 2006-05-05 13:38:54
For crunching number... use PDL;
Reaped: Re: Howto Avoid Memory Problem in List::MoreUtils
created: 2006-05-07 21:13:33
This node was taken out by the NodeReaper on 2006-05-07 21-31-32
Reason: [blokhead]: reap: anonymous troll

You may view the original node and the consideration vote tally.

perlmonks.org content © perlmonks.org and Anonymous Monk, BrowserUk, codeacrobat, dave_the_m, explorer, graff, ikegami, NodeReaper, salva

prlmnks.org © 2006 edmund von der burg (eccles & toad)

v 0.03