'grouping' substrings?
Anonymous Monk
created: 2006-02-01 10:51:35
Hi all!
I have a string :
$seq="IIIIIMMMMMMMMMMMOOOOOOOOOOOOOOOOMMMMMMMMMMMMMIIIIIMMMMMMMMMOOOOOOOOOOOOOOMMMMMMMMMMMMMIIIMMMMMMMMMMMOOOOOOOOOOOOOOOMMMMMMMMMMMIIIIIIMMMMMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOOOMMMMMMMIIIMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOOOOOOOMMMMMMMIIIIMMMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOMMMMMMMIIIMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOOOOOMMMMMMMMMIIIMMMMMMMMMMMOOOOOOOOOOOOOOOOOMMMMMMMMI";

and I want to find all groups of MMMMMM. I don't want to find only every position in the string that has 'M', i.e. pos6, pos7, pos8, pos9 etc but I want to have something like:
1st group : pos 7-15
2nd group : pos 23-34
3rd group : pos 45-55
etc
How can this be done?
Re: 'grouping' substrings?
created: 2006-02-01 10:54:10
You might like Perl's index function.
Re^2: 'grouping' substrings?
created: 2006-02-01 11:50:08

Using [doc://index] would look something like the following:

sub using_index {
   our $seq; *seq = \$_[0];
   my @groups;

   my $pos = -1;
   my $start = -1;

   for (;;) {
      my $new_pos = index($seq, 'M', $pos);

      if ($new_pos < 0) {
         if (defined($start)) {
            push(@groups, [ $start, $pos ]);
         }
         last;
      }

      if ($start < 0) {
         $start = $new_pos;
      }
      elsif ($new_pos - $pos > 1) {
         push(@groups, [ $start, $pos ]);
         $start = $new_pos;
      }

      $pos = $new_pos + 1;
   }

   return @groups;
}

It would be simpler if there was a function that returned the next character which isn't 'M'.

As you can guess, it's much slower than the regexp approach. The regexp approach is 160% faster than (i.e. 2.6 times the speed of) the index method on the input you provided.

Benchmark code:

use strict;
use warnings;

use Benchmark qw( cmpthese );

sub using_index {
   our $seq; *seq = \$_[0];
   my @groups;

   my $pos = -1;
   my $start = -1;

   for (;;) {
      my $new_pos = index($seq, 'M', $pos);

      if ($new_pos < 0) {
         if (defined($start)) {
            push(@groups, [ $start, $pos ]);
         }
         last;
      }

      if ($start < 0) {
         $start = $new_pos;
      }
      elsif ($new_pos - $pos > 1) {
         push(@groups, [ $start, $pos ]);
         $start = $new_pos;
      }

      $pos = $new_pos + 1;
   }

   return @groups;
}

sub using_regexp {
   our $seq; *seq = \$_[0];
   my @groups;
   push(@groups, [ $-[0], $+[0] ]) while $seq =~ /M+/g;
   return @groups;
}

{
   my $seq = "IIIIIMMMMMMMMMMMOOOOOOOOOOOOOOOOMMMMMMMMMMMMMIIIIIMMMMMMMMMOOOOOOOOOOOOOOMMMMMMMMMMMMMIIIMMMMMMMMMMMOOOOOOOOOOOOOOOMMMMMMMMMMMIIIIIIMMMMMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOOOMMMMMMMIIIMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOOOOOOOMMMMMMMIIIIMMMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOMMMMMMMIIIMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOOOOOMMMMMMMMMIIIMMMMMMMMMMMOOOOOOOOOOOOOOOOOMMMMMMMMI";

   print("using_index\n");
   print("-----------\n");
   printf("%d to %d\n", @$_) foreach using_index($seq);

   print("\n");

   print("using_regexp\n");
   print("------------\n");
   printf("%d to %d\n", @$_) foreach using_regexp($seq);

   print("\n");

   cmpthese(-3, {
      using_index  => sub { my @groups = using_index  $seq; 1; },
      using_regexp => sub { my @groups = using_regexp $seq; 1; },
   });
}

Benchmark results:

               Rate  using_index using_regexp
using_index  2295/s           --         -62%
using_regexp 5995/s         161%           --
Re: 'grouping' substrings?
created: 2006-02-01 10:59:19
I'd make use of Perl's @- and @+ arrays produces by regexes:
my $seq = "...";
my @groups;
push @groups, [$-[0], $+[0]] while $seq =~ /M+/g;
print "$_->[0] to $_->[1]\n" for @groups;
This gives me different values than you've shown, but I believe it's correct.

Jeff [japhy] Pinyan, [id://371157|P.L., P.M., P.O.D, X.S.]: Perl, regex, and perl hacker
How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
Re^2: 'grouping' substrings?
created: 2006-02-01 11:15:15
Yes, don't mind what I wrote, it was just an example... will try your code ASAP. Will also check index function.. thanx to both of you!
Re^3: 'grouping' substrings?
created: 2006-02-01 11:21:56
Sorry to bother you again, but it doesn't seem to work. For example, the first group gives 5-16, while it should be 5-15, the second 32-45, while it should be 32-44 from what I can calculate... Are my maths poor??? Also, I can't understand what $-[0, $+0] mean... Any tips ? Sorry, I'm just beggining Perl...
Re^4: 'grouping' substrings?
created: 2006-02-01 11:25:21
I hope this isn't too obvious, but how about subtracting one from $+[0]? :) An elegant way to do this is --$+[0]. (hey, that kinda looks like a flower on it's side...)
Re^5: 'grouping' substrings?
created: 2006-02-01 12:02:50
--$+[0] doesn't work ("Modification of a read-only value attempted"), but $+[0] - 1 will.
Re^4: 'grouping' substrings?
created: 2006-02-01 12:04:47

[ ... ] means "Construct an array with '...' as content, and return a reference to it." Refer to perldata for examples.

$-[0] and $+[0] are elements of the special arrays @- and @+. Refer to perlvar for more information.

Re: 'grouping' substrings?
created: 2006-02-01 11:29:40

I have come up with this. I dont know whether using $& is effecient or not.

use strict;
use warnings;
my $seq="IIIIIMMMMMMMMMMMOOOOOOOOOOOOOOOOMMMMMMMMMMMMMIIIIIMMMMMMMMMOOOOO
+OOOOOOOOOMMMMMMMMMMMMMIIIMMMMMMMMMMMOOOOOOOOOOOOOOOMMMMMMMMMMMIIIIIIM
+MMMMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOOOMMMMMMMIIIMMMMMMMMMOOOOOOOOOOOOOOO
+OOOOOOOOOOOOMMMMMMMIIIIMMMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOMMMMMMMIIIMMMM
+MMMMMOOOOOOOOOOOOOOOOOOOOOOOOOMMMMMMMMMIIIMMMMMMMMMMMOOOOOOOOOOOOOOOO
+OMMMMMMMMI";
while ($seq=~/(M+)/g)  {
	my $l = pos($seq);
	print $l-length($&)+1," to ",$l,$/;
}

Regards,
Murugesan Kandasamy
use perl for(;;);

Re^2: 'grouping' substrings?
created: 2006-02-01 11:42:28
I dont know whether using $& is effecient or not.

Using $& is not efficient, and usually to be avoided. See the entry in perlvar for details.

Re^3: 'grouping' substrings?
created: 2006-02-01 11:52:44

That's not exactly true. $& is only inefficient if you have another regexp in your program which doesn't capture.

However, it's use is discouraged, since captures can perform the same task without the "effect at a distance" of $&.

Re: 'grouping' substrings?
created: 2006-02-01 19:30:17
Others seem to have interpreted this as "find all groups of one or more M's".

On the off chance that you actually meant 6 or more M's, try this modification of [japhy]'s solution:

my $seq = "...";
my @groups;
push @groups, [$-[0], $+[0]-1] while $seq =~ /M{6,}/g;
print "$_->[0] to $_->[1]\n" for @groups;
(where the displayed positions are 0-based.)
Re: 'grouping' substrings?
created: 2006-02-01 21:13:27
Having the luxury of time to consider it ;-) , here was my approach using index.
my @pos;

my $start = index($str, 'M');

while ($start != -1) {
	my $pos;
	my $i = 0;
	
	1 while ($pos = index($str, 'M', $start + $i)) == $start + $i++;
	
	push @pos, [$start, $start + $i-2];
	
	$start = $pos;
}
Re: 'grouping' substrings?
created: 2006-02-02 07:56:22
TIMTOWTDI:
$seq="IIIIIMMMMMMMMMMMOOOOOOOOOOOOOOOOMMMMMMMMMMMMMIIIIIMMMMMMMMMOOOOOOOOOOOOOOMMMMMMMMMMMMMIIIMMMMMMMMMMMOOOOOOOOOOOOOOOMMMMMMMMMMMIIIIIIMMMMMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOOOMMMMMMMIIIMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOOOOOOOMMMMMMMIIIIMMMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOMMMMMMMIIIMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOOOOOMMMMMMMMMIIIMMMMMMMMMMMOOOOOOOOOOOOOOOOOMMMMMMMMI";
$i=0;
$_= $seq;
s/([^M]*)(M*)/{ my $j=$i+length($1);  $i=$j+length($2); $j==$i ? "" : "pos $j-$i\n" }/ge;
print $seq, "\n", $_, "\n";

s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

perlmonks.org content © perlmonks.org and Anonymous Monk, Cristoforo, ikegami, japhy, kwaping, murugu, revdiablo, Skeeve, ysth

prlmnks.org © 2006 edmund von der burg (eccles & toad)

v 0.03