For the math monks, a formula and description of the algorithm are available over there. Alas, I'm not able to fully grok the description and have been unable to tackle implementing it in perl *. I've come across a couple Fortran and VB implementations; although neither language is very perl-like, and thusly would not be well suited for translation. Would anyone be interested in taking up the challenge of writing an N-D or 1-D implementation in perl with a simple interface in perl? i.e; accept a reference to/list of the values to classify and the number of desired classes** and spit back the classified values or class-divisions.
happy hacking!
P.S. For an implementation reference see Milligan's. I cannot attest to the quality of the Fortran but the README can provide some interesting insights as well.
P.P.S. I inquired about this in the cb and discussed it with demerphq and atcroft, mentioning it in passing today Limbic~Region urged me to post it as a potentially interesting diversion for some.
* There is in fact a wrapper for a C implementation however it lacks documentation, seems to require lots of unusual extras and is oriented towards clustering 2-D data (K-means is a generalized** The number of classes can influence the interpretations of the resulting analysis however, at least in 1-D, there are relatively few meaningful values and so it is easy enough to test them by hand for bias. Typical values are 3-6, with many implementations defaulting to 5. There are many reasons for this:
- for 2 classes it'd be easier to use the mean
- larger numbers of classes are difficult to handle visually. If you insist on 8+ classes you are probably better off with an even gradient of divisions.
--
In Bob We Trust, All Others Bring Data.
#!/usr/bin/perl
use warnings;
use strict;
my $num_clust = 4; # number of clusters
my $tol = 0.001; # stopping tolerance
# my @data = map {rand} 1..100;
my @data = (0.1, 0.15, 0.3, 0.35, 0.5, 0.55, 0.7, 0.75);
# initialize by choosing random points the data
my @center = @data[ map {rand @data} 1..$num_clust ];
my $diff;
do {
$diff = 0;
# Assign points to nearest center
my @cluster;
foreach my $point (@data) {
my $closest = 0;
my $dist = abs $point - $center[ $closest ];
for my $idx (1..$#center) {
if (abs $point - $center[ $idx ] < $dist) {
$dist = abs $point - $center[ $idx ];
$closest = $idx;
}
}
push @cluster, [$point, $closest];
}
# compute new centers
foreach my $center_idx (0..$#center) {
my @members = grep {$_->[1] == $center_idx} @cluster;
my $sum = 0;
foreach my $member (@members) {
$sum += $member->[0];
}
my $new_center = @members ? $sum / @members : $center[ $center_idx ];
$diff += abs $center[ $center_idx ] - $new_center;
$center[ $center_idx ] = $new_center;
}
} while ($diff > $tol);
print "Centers are:\n";
foreach my $center_idx (0..$#center) {
print "$center_idx $center[ $center_idx ]\n";
}
-Mark
I don't know if this observation is that helpful, but to me it almost seems as if in this process you are trying to construct the smallest k n-dimensional "spheres" that encompass the data. (I say "spheres" loosely-a spheroid or ellipsoid might be more appropriate, upon thinking a little further about it.) When projected into one dimension, this would seem to appear as a range; in 2 dimensions, as a circle; and in 3+ dimensions, as a sphere. The "amorphous blob" effect you mentioned in the CB when I made this observation earlier could be a result of overlaps in the spheres, and choosing one to encompass a particular data point than another.
As I said, I don't know how helpful the observation may be, but hope it helps.
use Statistics::Descriptive;
$stat = Statistics::Descriptive::Full->new();
$stat->add_data( split /,/, );
%f = $stat->frequency_distribution(4);
$min = 0;
for ( sort { $a <=> $b } keys %f ) {
printf "[%d\t-%d\t] %d\n", $min, $_, $f{$_};
$min = $_ + 1;
}
__DATA__
0, 12, 25, 38, 50, 62, 75, 88, 100
__END__
# prints
[0 -25 ] 3
[26 -50 ] 2
[51 -75 ] 2
[76 -100 ] 2
[0 -296 ] 86 [297 -580 ] 8 [581 -864 ] 4 [865 -1148 ] 1 [1149 -1432 ] 2 [1433 -1716 ] 0 [1717 -2000 ] 1vs. kvale's
[12 -24 ] 33 [42 -76 ] 27 [80 -128 ] 14 [150 -250 ] 9 [280 -460 ] 9 [550 -950 ] 7 [1226 -2000 ] 312, 14, 16, 18, 18, 20, 20, 20, 20, 20, 20, 20, 22, 24, 24, 24, 24, 30, 30, 30, 32, 35, 35, 35, 35, 35, 35, 35, 36, 40, 42, 46, 48, 48, 50, 50, 50, 50, 54, 54, 55, 56, 56, 58, 58, 60, 60, 60, 60, 63, 64, 67, 67, 68, 70, 70, 76, 80, 86, 86, 86, 90, 90, 99, 100, 100, 100, 100, 104, 105, 128, 150, 150, 154, 169, 190, 200, 200, 200, 250, 280, 291, 291, 300, 325, 325, 330, 450, 460, 550, 566, 600, 700, 750, 770, 950, 1226, 1250, 2000, 15, 22, 24
--
In Bob We Trust, All Others Bring Data.
While I haven't been able to turn up any docs for Algorithm::Cluster, I have found some information about the underlying C clustering library by Michiel Jan Laurens de Hoon, that may be useful to you:
The C Clustering Library
Making and Using DNA Microarrays - This is the manual for Cluster 3.0
HTH,
--
In Bob We Trust, All Others Bring Data.
So here's a cluster finder which uses a [wp://Genetic Algorithm] based approach. I've tested it with trivial data point sets that have very clear (to human eyes) clusters. On these, it finds perfect clustering almost every time. I'd be interested to see how well it works on real-world data.
The parameters are not well isolated in this code. They are:
use Carp;
use List::Util qw( shuffle );
use strict;
use warnings;
# an individual represents a distribution of points among clusters.
# that is, it is a specific allocation of points to clusters.
# in the initial population, in each individual, the points are randomly assigned to clusters.
# each individual is an array.
# each element represents a point in the data set, and its value
# is the number of the cluster to which it has been assigned.
my @datapoints;
# The subs in Point:: need to be customized for the type/representation of a "point".
sub Point::set_metric; # "distance" or "area" or something like that. small values mean "close"
sub Point::as_string;
sub Point::ScalarNumber::set_metric
{
my $set = shift;
my @set = @datapoints[@$set];
@set == 0 and return 1;
@set == 1 and return 2;
# RMS
my $total = 0;
my $n = 0;
for my $i ( 1 .. $#set )
{
for my $j ( $i .. $#set )
{
my $dist = abs( $set[$i-1] - $set[$j] );
$total += $dist ** 2;
$n++;
}
}
sqrt( $total / $n )
}
sub Point::ScalarNumber::as_string
{
$_[0]
}
sub Point::NumberPair::set_metric
{
my $set = shift;
my @set = @datapoints[@$set];
@set == 0 and return 1;
@set == 1 and return 2;
# RMS
my $total = 0;
my $n = 0;
for my $i ( 1 .. $#set )
{
for my $j ( $i .. $#set )
{
my $dist2 = ( ( $set[$i-1][0] - $set[$j][0] ) ** 2 )
+ ( ( $set[$i-1][1] - $set[$j][1] ) ** 2 );
$total += $dist2;
$n++;
}
}
sqrt( $total / $n )
}
sub Point::NumberPair::as_string
{
"[$_[0][0],$_[0][1]]"
}
#######################################################################
my @clusters;
sub Ind::new_randomized
{
#@datapoints <= 0 and croak "No datapoints defined!\n";
#@datapoints < 1 and croak "Only one cluster defined!\n";
#@clusters <= 0 and croak "No clusters defined!\n";
#@clusters < 1 and croak "Only one cluster defined!\n";
[ map { int( rand @clusters ) } @datapoints ]
}
sub Ind::clone
{
my $ind = shift;
[ @$ind ]
}
# optional arg: number of points to move
sub Ind::mutate
{
my( $ind, $n ) = @_;
for my $i ( 0 .. ($n||1) )
{
my $j = int( rand @datapoints );
$ind->[$j] = int( rand @clusters );
}
$ind
}
sub Ind::_crossover_points
{
my $l = @datapoints;
my $seglen = 1 + int rand( $l - 1 );
my $start = int rand( $l - $seglen );
( $start .. ($start+$seglen-1) )
}
sub Ind::crossover
{
my( $ind1, $ind2 ) = @_;
my @xo = Ind::_crossover_points();
for my $i ( @xo )
{
( $ind1->[$i], $ind2->[$i] ) =
( $ind2->[$i], $ind1->[$i] )
}
}
sub Ind::fitness
{
my $ind = shift;
my @cluster_points = map {
my $cl = $_;
[ grep { $ind->[$_] eq $cl } 0 .. $#{$ind} ]
} 0 .. $#clusters;
my $total_metric = 0;
for my $ci ( 0 .. $#cluster_points )
{
my $val = Point::set_metric( $cluster_points[$ci] );
$total_metric += $val;
}
1000/$total_metric # convert it to "large = good"
}
sub Ind::display
{
my $ind = shift;
my @cluster_points = map {
my $cl = $_;
[ grep { $ind->[$_] eq $cl } 0 .. $#{$ind} ]
} 0 .. $#clusters;
my $total_metric = 0;
for my $ci ( 0 .. $#cluster_points )
{
my $val = Point::set_metric( $cluster_points[$ci] );
$total_metric += $val;
printf "$ci: Cluster $clusters[$ci]: %5.2f ( ", $val;
print join ' ', map { Point::as_string($_) }
@datapoints[@{$cluster_points[$ci]}];
print " )\n";
}
printf "Total metric: %.2f\n", $total_metric;
$ind
}
#######################################################################
if(0)
{
@datapoints = shuffle( 11..14, 21..24, 31..34, 41..44 );
*Point::set_metric = \&Point::ScalarNumber::set_metric;
*Point::as_string = \&Point::ScalarNumber::as_string;
}
else
{
@datapoints = shuffle(
[ 1, 2], [ 2, 1], [ 2, 3], [ 3, 2],
[ 1,12], [ 2,11], [ 2,13], [ 3,12],
[11, 2], [12, 1], [12, 3], [13, 2],
[11,12], [12,11], [12,13], [13,12],
);
*Point::set_metric = \&Point::NumberPair::set_metric;
*Point::as_string = \&Point::NumberPair::as_string;
}
@clusters = ( 1 .. 4 );
my @pop =
sort { $b->[0] <=> $a->[0] }
map { [ Ind::fitness($_), $_ ] }
map { Ind::new_randomized }
1 .. 100;
#print "Before:"; printf " %.1f", $_->[0] for @pop; print "\n";
# this clones an element of @pop
sub clone { [ $_[0]->[0], Ind::clone($_[0]->[1]) ] }
for my $iter ( 1 .. 200 )
{
# kill the bottom 30:
splice @pop, @pop-30, 30;
# make 10 new ones:
push @pop,
map { [ Ind::fitness($_), $_ ] }
map { Ind::new_randomized }
1 .. 10;
# clone the top 20:
push @pop, map clone($_), @pop[0 .. 19];
# mutate the top 20:
for my $e ( @pop[0 .. 19] )
{
my $n = 1;
unless ( int(rand 2) )
{
$n++;
unless ( int(rand 3) )
{
$n++;
unless ( int(rand 4) )
{
$n++;
}
}
}
#warn "mut $n\n";
Ind::mutate( $e->[1], $n );
$e->[0] = Ind::fitness( $e->[1] );
}
# sort by fitness again:
@pop = sort { $b->[0] <=> $a->[0] } @pop;
# print "Iter $iter: $pop[0][0]\n";
}
# print "\nAfter:"; printf " %.1f", $_->[0] for @pop; print "\n";
Ind::display( $pop[0][1] );
Note that, as it stands, I'm not doing any crossover, only mutation. I'm sure improvements could be made in this area.
1: Cluster 2: 19.21 ( 20 22 18 14 30 24 60 35 40 36 )
2: Cluster 3: 19.76 ( 20 20 35 56 35 15 35 )
3: Cluster 4: 35.09 ( 80 100 58 100 105 150 86 100 99 60 86 )
4: Cluster 5: 509.63 ( 22 169 566 150 16 50 56 76 100 24 20 55 24 70 60 291 200 325 700 200 35 58 90 460 30 1226 950 67 35 68 200 67 24 300 50 54 770 90 750 450 24 60 46 20 12 280 154 20 128 20 600 42 1250 18 86 291 330 325 250 190 550 104 2000 64 ) With 5000 generations, 50 cullings and 20 spawn it yields slightly more reasonable, though still not too helpful results:
0: Cluster 1: 20.00 ( 770 750 )
1: Cluster 2: 2.00 ( 950 )
2: Cluster 3: 198.66 ( 700 64 63 48 20 42 18 20 58 30 200 55 90 70 22 60 200 54 104 18 60 22 291 20 14 154 291 35 20 76 100 35 54 50 86 67 36 325 24 48 35 20 35 50 100 280 56 30 58 60 60 24 20 460 99 32 24 30 24 86 56 600 40 16 20 150 105 15 566 169 300 86 450 330 50 80 128 325 190 550 35 35 150 50 68 100 35 250 67 100 24 200 90 70 12 46 )
3: Cluster 4: 24.00 ( 1226 1250 )
4: Cluster 5: 2.00 ( 2000 ) Given another order of magnitude or more runtime it might reach palatable results ;-)
--
In Bob We Trust, All Others Bring Data.
Indeed, there were a number of other parameters that could be tweaked, and by doing so, I was able to get better results than that.
However, in the end, it turns out there are some special properties of your problem that allow much simpler and more effective solutions. Namely, the fact that your data points are one-dimensional. (I'm assuming they are.)
It means, for example, that (1 3),(2 4) is never an optimal clustering. Neither is (1 2),(1 2).
From these, we can define the following constraints on clusters:
In the following solution, I'm using variance as a measure of the "coherence" or "binding strength" or whatever you want to call it within each cluster. You could use other measures; I wouldn't be surprised if variance doesn't necessarily give the best results. I've seen discussions of clustering that talk about maximizing variance between clusters, in addition to minimizing it within clusters. I have my doubts that that would help in a simple one-dimensional problem like this one.
use Statistics::Lite;
use strict;
use warnings;
my @d =
# I assume the order as given in the OP is not important:
sort { $a <=> $b }
( 12, 14, 16, 18, 18, 20, 20, 20, 20, 20, 20, 20, 22, 24, 24, 24, 24, 30, 30, 30, 32, 35, 35, 35, 35, 35, 35, 35, 36, 40, 42, 46, 48, 48, 50, 50, 50, 50, 54, 54, 55, 56, 56, 58, 58, 60, 60, 60, 60, 63, 64, 67, 67, 68, 70, 70, 76, 80, 86, 86, 86, 90, 90, 99, 100, 100, 100, 100, 104, 105, 128, 150, 150, 154, 169, 190, 200, 200, 200, 250, 280, 291, 291, 300, 325, 325, 330, 450, 460, 550, 566, 600, 700, 750, 770, 950, 1226, 1250, 2000, 15, 22, 24 );
my %count;
$count{$_}++ for @d;
my @e = sort { $a <=> $b } keys %count;
# an individual is an array of arrays of nums.
# each element of the (top-level) array represents a cluster.
# the order of all the numbers, if you were to concat all the arrays,
# is strictly numeric ascending.
# in a 1-d space only, it never makes sense to cluster the
# numbers (1,2,3,4) as (1,3), (2,4).
# the only mutation possible is shifting a number off the
# beggining of one array and pushing it onto the previous array.
# (and the other way).
sub random_bipartition
{
my( $min_size, $ar ) = @_;
my $sel = @$ar - (2*$min_size);
my $p = $min_size + int rand( $sel );
$p > $#{$ar}
? ( [ @{$ar} ], [ ] )
: ( [ @{$ar}[0 .. $p] ], [ @{$ar}[$p+1 .. $#{$ar}] ] )
}
use Data::Dumper;
sub Ind::new_randomized
{
my $nc = shift;
my $n = $nc <= 2 ? 1 : $nc <= 4 ? 2 : $nc <= 8 ? 3 : 4; # 16 max
my @a = ( \@e );
@a = map { random_bipartition( 1<<$n, $_ ) } @a while $n--; # deep magic :-)
my $i = 0;
while ( @a > $nc )
{
unshift @{$a[$i+1]}, @{$a[$i]};
splice @a, $i, 1;
$i++;
}
\@a
}
sub Ind::clone
{
my $ind = shift;
[ map { [@$_] } @$ind ]
}
sub Ind::as_string
{
my $ind = shift;
my $varsum;
my $s;
for my $cl ( @$ind )
{
my $var = 0;
if ( @$cl )
{
my @d;
push @d, ($_) x $count{$_} for @$cl;
$var = int( Statistics::Lite::variance(@d)||0 );
$varsum += $var;
}
$s .= "$var ( @$cl )\n";
}
$s .= "Total variance: $varsum\n";
$s
}
sub Ind::fitness
{
my $ind = shift;
my $sum = 0;
my $empty = 0;
for my $cl ( @$ind )
{
if ( @$cl )
{
my @d;
push @d, ($_) x $count{$_} for @$cl;
$sum += (Statistics::Lite::variance(@d)||0);
}
else
{
$empty++;
}
}
# harshly penalize individuals with the wrong number (too few) of clusters:
$sum *= ( 1 + $empty / 10 );
1_000_000 - $sum # convert to Larger Is Better
}
sub Ind::mutate
{
my( $ind, $n ) = @_;
$n ||= 1;
my $lo = int rand( @$ind - 1);
if ( int rand 2 )
{
# up
my($to,$from) = ( $ind->[$lo+1], $ind->[$lo] );
unshift @$to, pop @$from
while $n-- && @$from
}
else
{
# down
my($to,$from) = ( $ind->[$lo], $ind->[$lo+1] );
push @$to, shift @$from
while $n-- && @$from
}
$ind
}
# this clones an element of @pop
sub clone { [ $_[0]->[0], Ind::clone($_[0]->[1]) ] }
sub do_run
{
my $n_clusters = shift;
my @pop =
sort { $b->[0] <=> $a->[0] }
map { [ Ind::fitness($_), $_ ] }
map { Ind::new_randomized($n_clusters) }
1 .. 60;
for my $gen ( 1 .. 200 )
{
# kill the bottom 30:
splice @pop, @pop-30, 30;
# make 10 new ones:
push @pop,
map { [ Ind::fitness($_), $_ ] }
map { Ind::new_randomized($n_clusters) }
1 .. 10;
# clone the top 20:
push @pop, map clone($_), @pop[0 .. 19];
# mutate the top 20:
for my $e ( @pop[0 .. 19] )
{
Ind::mutate( $e->[1], 1 + int rand 4 );
$e->[0] = Ind::fitness( $e->[1] );
}
# sort by fitness again:
@pop = sort { $b->[0] <=> $a->[0] } @pop;
}
$pop[0][1] # best individual
}
for my $nc ( 2 .. 8 )
{
print "\n$nc Clusters:\n";
my $winner = do_run( $nc );
print Ind::as_string( $winner ), "\n";
}
Anyway! The result is here, the kMeans module is part of the script, but you can use it as a general library for kMeans clustering too, if you'd want to. It's not even limited to 2D. :) All you have to do is copy the bottom half of the program into its own pm file.
Enjoy!
#!/usr/bin/perl -w
# k Means demo program
# Adapted from a VB program by Kardi Teknomo
# http://people.revoledu.com/kardi/tutorial/kMean/index.html
# Ported to Perl/Tk by bart @ Perlmonks
use strict;
# ----------------------- Tk interface -------------------------
use Tk;
my $mw = MainWindow->new( -height => 403, -width => 477,
-title => "k Means Clustering, adapted from tutorial by Kardi Teknomo" );
my $button_reset = $mw->Button(-text => "Clear Data", -relief => "raised", -command => \&reset_click);
$button_reset->place( -x => 176, -y => 24, -height => 25, -width => 65);
{
my $label = $mw->Label(
-text => "Click data in the picture box below. The program will automatically cluster the data by color code."
);
$label->place( -x => 0, -y => 0, -height => 16, -width => 473);
$label = $mw->Label(-text => "Number of clusters");
$label->place( -x => 10, -y => 28, -height => 18, -width => 95);
$label = $mw->Label(-text => "(X, Y)", -justify => 'right');
$label->place( -x => 280, -y => 24, -height => 13, -width => 40);
}
my $label_xy = $mw->Label( -text => "X, Y");
$label_xy->place( -x => 330, -y => 24, -height => 13, -width => 50);
my $clusters_entry = $mw->Entry(-relief => "sunken", -disabledforeground => 'darkgray');
$clusters_entry->place( -x => 112, -y => 24, -height => 24, -width => 24);
$clusters_entry->insert('end', '3');
my $canvas = $mw->Scrolled('Canvas', -scrollbars => '', -background => 'white');
$canvas->place( -x => 0, -y => 56, -height => 403-56, -width => 477);
$canvas->CanvasBind( "", [ \&canvas_click, Ev('x'), Ev('y')]);
$canvas->CanvasBind( "", [ \&canvas_mousemove, Ev('x'), Ev('y')]);
$canvas->CanvasBind( "", [ \&canvas_mouseleave ]);
#----------------------- Event Handlers ------------------------
my @color = qw(red yellow green cyan blue purple gray magenta pink
chartreuse coral darkolivegreen);
# If you want to be able to have more clusters, add more colours.
my(%point, @cluster, $dataset);
sub reset_click {
$canvas->delete('all');
(@cluster, %point, $dataset) = ();
$clusters_entry->configure(-state => 'normal');
}
sub canvas_mousemove {
my $canvas = shift;
my($x, $y) = @_;
$label_xy->configure(-text => "($x, $y)");
}
sub canvas_mouseleave {
my $canvas = shift;
$label_xy->configure(-text => "");
}
sub canvas_click {
my $canvas = shift;
my($x, $y) = @_;
if(!$dataset) {
my $clusters = $clusters_entry->get;
if($clusters !~ /^\d+$/ or $clusters == 0 or $clusters > @color) {
warn "Not a valid value for cluster count";
return;
}
$dataset = Data::Cluster::kMean->new(0+$clusters) or die "Failed to make object";
$clusters_entry->configure(-state => 'disabled');
}
$dataset->add(my $point = [ $x, $y ]);
my %record = ( data => $point, cluster => -1 );
$record{id} = $canvas->createLine($x, $y, $x, $y,
-fill => 'red', -width => 8,
-capstyle => 'round', -tags => ['dot'],
);
$point{$point} = \%record;
foreach my $cluster ($dataset->clusters) {
my $i = $cluster->index;
my $r = $cluster[$i] ||= { obj => $cluster, id =>
$canvas->createText($x, $y, -anchor => 'c', -width => 150, tag => 'label',
-text => 1+$i) };
my($x, $y) = @{$cluster->centroid};
$canvas->coords($r->{id}, $x, $y);
foreach my $p ($cluster->points) {
if($point{$p}{cluster} != $i) {
$point{$p}{cluster} = $i;
$canvas->itemconfigure($point{$p}{id}, -fill => $color[$i]);
}
}
}
$canvas->raise('label', 'dot');
}
#------------------------ Main Program -------------------------
MainLoop;
#------------------------ kMean module -------------------------
package Data::Cluster::kMean;
use List::Util qw(min sum);
sub new {
my $class = shift;
my($max_clusters) = @_;
return bless { max_clusters => $max_clusters, data => [], cluster => [], clusters => [] }, $class;
}
sub add {
# add data point(s) (array references) -- by reference, so make sure they're not reused for something else
my $self = shift;
return unless @_;
unless(ref $_[0] eq 'ARRAY') {
@_ = [ @_ ];
}
foreach my $p (@_) {
push @{$self->{data}}, $p;
push @{$self->{cluster}}, -1; # not in a cluster
if(@{$self->{clusters}} < $self->{max_clusters}) {
my $index = @{$self->{clusters}};
push @{$self->{clusters}}, Data::Cluster::kMean::Cluster->new($self, $index);
$self->{cluster}[-1] = $index;
} else {
my $c;
{
my $j = 0;
my $min_dist;
for my $cluster (@{$self->{clusters}}) {
my $dist = _dist($p, $cluster->centroid);
if(!defined $min_dist or $dist < $min_dist) {
$c = $j;
$min_dist = $dist;
}
} continue {
$j++;
}
}
$self->{clusters}[$c]->invalidate;
$self->{cluster}[-1] = $c;
my $is_still_moving = 1;
while($is_still_moving) {
# this loop will surely converge
my @centroid = map $_->centroid, @{$self->{clusters}};
# assign all data to the new centroids
$is_still_moving = 0;
my $i = 0;
for my $p (@{$self->{data}}) {
my $c;
{
my $min_dist;
for my $j (0 .. $#{$self->{clusters}}) {
my $dist = _dist($p, $centroid[$j]);
if(!defined $min_dist or $dist < $min_dist) {
$c = $j;
$min_dist = $dist;
}
}
}
if($c != $self->{cluster}[$i]) {
$self->{clusters}[$self->{cluster}[$i]]->invalidate;
$self->{clusters}[$c]->invalidate;
$self->{cluster}[$i] = $c;
$is_still_moving = 1;
}
} continue {
$i++;
}
}
}
}
}
sub clusters {
my $self = shift;
return @{$self->{clusters}};
}
sub _dist {
# function
return sqrt(sum map { my $d = $_[0][$_]-$_[1][$_]; $d*$d } 0 .. $#{$_[0]});
}
package Data::Cluster::kMean::Cluster;
use List::Util qw(sum);
sub new {
my $class = shift;
my($parent, $index) = @_;
bless { index => $index, data => $parent->{data}, cluster => $parent->{cluster}, centroid => undef}, $class;
}
sub points {
my $self = shift;
my $index = $self->{index};
my @point = @{$self->{data}}[grep $self->{cluster}[$_] == $index, 0 .. $#{$self->{data}}];
return @point;
}
sub centroid {
my $self = shift;
return $self->{centroid} ||= _centroid($self->points);
}
sub _centroid {
# function
return undef unless @_;
my $dim = @{$_[0]};
return [ map { my $i = $_; sum(map $_->[$i], @_) / @_ } 0 .. $dim-1 ];
}
sub invalidate {
# throw away cache
my $self = shift;
undef $self->{centroid};
}
sub index {
my $self = shift;
return $self->{index};
}
1;
perlmonks.org content © perlmonks.org and atcroft, bart, belg4mit, codeacrobat, jdporter, kvale, marto, moklevat, monkey_boy, planetscape, srdst13, Win
prlmnks.org © 2006 edmund von der burg (eccles & toad)
v 0.03