length: 1 digits: 0 quantity: 3 length: 1 digits: 1 quantity: 2 length: 1 digits: 2 quantity: 2 length: 1 digits: 3 quantity: 4 length: 1 digits: 5 quantity: 2 length: 1 digits: 9 quantity: 3 length: 2 digits: 30 quantity: 2The program:
#!/usr/bin/perl -w
use strict;
my $line;
while ( defined ( $line = <> ) ) {
chomp $line;
for my $a ( 1 .. length ( $line ) ) {
for my $b ( 1 .. ( length ( $line ) + 1 - $a ) ) {
my $out = substr ( $line, $a - 1, $b );
my $count = () = $line =~ /$out/g;
print "length: $b digits: $out quantity: $count\n" if $count > 1;
}
}
}
How I run it:I had taken a chance of reading the weblink that you wrote but after 30 seconds I gave up... so I will try to give only general advices here.
First of all, don't do this:
$ ./test < textfile | sort | uniqPerl has more then enough tools to do the job that is done with sort and uniq programs. System calls are expensive and sometimes the speed of those programs doesn't pay for the cost of invoking them.
To read the file, use open function. If the file is not that big, you can read it entirely and put into memory like this:
open(IN,"<$file") or die "Cannot read $file: $!\n"; my @content =; close(IN);
This will speed up things than using while block.
Use an array to keep the results from the string:
$results[0]++ if ( $digit == 0 );
You can even avoid using if statement to do that. For the next string to process, do a @results = () to start again with zeros.
Try as much as you can to avoid using next loops with for. Look for the Schwartzian Transform to see how to improve your code. Try using @sequence = split( //, $sequence ) instead of a other loop.
And last but not least, check the Benchmark module to test all those things.
$ ./test < textfile | sort | uniqPerl has more then enough tools to do the job that is done with sort and uniq programs. System calls are expensive and sometimes the speed of those programs doesn't pay for the cost of invoking them.
I think you are making the mistake of repeating what you have heard others say without really understanding it yourself. In this particular case, sort and uniq are likely compiled C programs optimized for a single task and are far superior to Perl. While system calls can be expensive - it is just not the case here.
open(IN,"<$file") or die "Cannot read $file: $!\n"; my @content =This will speed up things than using while block. Well it may speed things up at the expense of memory. I do not know how many lines are in the file but if individual strings are 9 million characters this may definately be the wrong way to go. You still need to loop through the array so it is not going to avoid the need to loop. The speed savings come in from disk I/O.; close(IN); close(IN);
Try as much as you can to avoid using next loops with for. Look for the [Schwartzian Transform] to see how to improve your code. Try using @sequence = split( //, $sequence ) instead of a other loop.
I am not exactly sure why you think using [doc://next] inside a for loop is a bad thing. If it is possible to eliminate those loops prior to entering the loop then it is advantageous because you don't have a conditional every loop. That is seldom the case. The ST is used to speed up sorting routines when the comparison of 2 elements is expensive. This looks out of place in the context of the rest of what you said so you should probably be sure to explain why what you are saying has relavence.
Finally, the real problem here is the numbers involved. Using a brute force algorithm, no matter how well it is tuned, to find the longest common substring of a 9 million digit number is going to be extremely slow. If you are interested in the math I will be happy to provide it.Cheers - [Limbic~Region|L~R]
Here's my "regex for the sake of regex" solution:
m{
(?{ [ {}, 0 ] })
^
\d*?
(\d+)
(?(?{ $^R->[0]{$1}++ })(?!))
(?{ [ $^R->[0], 1 ] })
(?>
(?: \d*? \1 (?{ [ $^R->[0], $^R->[1] + 1 ] }) )*
(?{ print report($1, $^R->[1]) })
)
(?!)
}x;
sub report {
my ($str, $count) = @_;
return if $count == 1;
sprintf "length: %d digits: %s quantity: %d\n",
length($str), $str, $count;
}
Dissection will come later. For now, just breathe it in.
Second, the idiom you don't understand deserves to be in the "Perl Idioms Explained" category. In a nutshell, Perl will do what you mean (DWYM) when you give it proper context. A list in scalar context returns the number of items in the list.
Third - good luck. No matter what algorithm you use (LCS) and what language you use (C/Assembler), this is not going to be a fast answer.Cheers - L~R
A list assignment in scalar context returns the number of items in the right-hand list.
while ( @array ) { ... }
# or
print "Array is empty\n" if ! @array;
but then we could discuss the difference between lists and arrays and it wouldn't be worth it. I think I was able to get the OP to understand even if I wasn't completely accurate. I do think it would make for a good entry in "Perl Idioms Explained".
Cheers - [Limbic~Region|L~R]
For purposes of sanity, take a stab at some length of substring that you simply don't expect to be repeated. Maybe somewhere in the 100 to 1000 range. Make a file of all substrings of that length. (Unix) sort the lines of the file.
Now you just need to look at each line's neighbors to see how long their common prefix is. The longest repeated substring is guaranteed to be a common prefix of two neighboring lines.
To find how many times a substring is repeated, you'll have to keep checking subsequent lines until they don't have enough prefix in common to be interesting.
I haven't written any code for this, and probably won't. But I hope the description is helpful.
Here is a simple basic test for "is there an $n-digit repeated substring in $string?":
$string =~ /(.{$n}).*\1/;
That is: find an $n-digit substring, capture it as $1, then look forward in the string for another copy of $1. This is a reasonably efficient way to search for a repeated substring of a given length. However on a 9.2MB string that is still going to take a lot of work.
Note that this won't find an overlapping repeat, such as the repeated "121" in the string "3121213"; a more complex regexp could find such repeats, but would be much slower working on large strings.
One approach to finding the longest repeat would be to iterate $n upwards from 1:
sub longestrepeat {
my($string) = @_;
for (my $n = 1; 1; ++$n) {
return $n - 1 unless $string =~ /(.{$n}).*\1/;
}
}
Other approaches to consider if that isn't fast enough are a) to do a binary chop (probably not useful in this case, since I think $n is likely to be smaller than log_2(digits)), or b) to start each subsequent search at the point the previous one succeeded (not too hard to code, but not likely to gain much).
You can extend the regexp to search for 3 copies of the same substring:
$string =~ /(.{$n}).*\1.*\1/;
.. or more, by including more copies of the .*\1 construct.
In general however I consider it extremely unlikely that you'll discover any patterns special to the digits of Mersenne primes expressed in decimal, beyond the features that any Mersenne number would have (ie the terminating digit patterns implicit in a number of the form 2n - 1). And of course doing the same thing in binary would be very boring. :)
Hugo
perlmonks.org content © perlmonks.org and glasswalk3r, hv, japhy, Limbic~Region, Roy Johnson, Yzzyx
prlmnks.org © 2006 edmund von der burg (eccles & toad)
v 0.03