Making sense of data: Clustering OR A coding challenge
belg4mit
created: 2006-04-03 14:38:02
Often it is helpful or more instructive to examine aggregated or otherwise summarized data en lieu of the raw data set. However, determining the best means of doing so is not always evident, and can strongly influence the outcome. For instance, given the rated maximum occupancies for a bunch of rooms, what would be the best way to divide the range of values into classes? Quantiles (equal number of members in each class)? Nice round or culturally meaningful numbers (12, 25, 50, 75, 100)? There are in fact several algorithmic means of addressing this problem, known as clustering. One of the more common/robust is K-means, also known as Jenks natural breaks (especially amongst cartographers). Outside of select circles K-means seems to be rather unheard of, which is surprising since it is so powerful and general.

For the math monks, a formula and description of the algorithm are available over there. Alas, I'm not able to fully grok the description and have been unable to tackle implementing it in perl *. I've come across a couple Fortran and VB implementations; although neither language is very perl-like, and thusly would not be well suited for translation. Would anyone be interested in taking up the challenge of writing an N-D or 1-D implementation in perl with a simple interface in perl? i.e; accept a reference to/list of the values to classify and the number of desired classes** and spit back the classified values or class-divisions.

happy hacking!

P.S. For an implementation reference see Milligan's. I cannot attest to the quality of the Fortran but the README can provide some interesting insights as well.

P.P.S. I inquired about this in the cb and discussed it with demerphq and atcroft, mentioning it in passing today Limbic~Region urged me to post it as a potentially interesting diversion for some.

* There is in fact a wrapper for a C implementation however it lacks documentation, seems to require lots of unusual extras and is oriented towards clustering 2-D data (K-means is a generalized

** The number of classes can influence the interpretations of the resulting analysis however, at least in 1-D, there are relatively few meaningful values and so it is easy enough to test them by hand for bias. Typical values are 3-6, with many implementations defaulting to 5. There are many reasons for this:

  1. for 2 classes it'd be easier to use the mean
  2. larger numbers of classes are difficult to handle visually. If you insist on 8+ classes you are probably better off with an even gradient of divisions.

--
In Bob We Trust, All Others Bring Data.

Re: Making sense of data: Clustering OR A coding challenge
created: 2006-04-03 15:32:40
K-means is not esoteric within the clustering community; in fact, it is the first method people turn to because it is easy to implement. Unfortunately, it is also among the weakest of the adaptive clustering methods. It gets stuck in local minima easily, as this program will show if you run it multiple times:
#!/usr/bin/perl

use warnings;
use strict;

my $num_clust = 4; # number of clusters
my $tol = 0.001;    # stopping tolerance

# my @data = map {rand} 1..100;
my @data = (0.1, 0.15, 0.3, 0.35, 0.5, 0.55, 0.7, 0.75);


# initialize by choosing random points the data
my @center = @data[ map {rand @data} 1..$num_clust ];

my $diff;
do {
    $diff = 0;

    # Assign points to nearest center
    my @cluster;
    foreach my $point (@data) {
        my $closest = 0;
        my $dist = abs $point - $center[ $closest ];
        for my $idx (1..$#center) {
            if (abs $point - $center[ $idx ] < $dist) {
                $dist = abs $point - $center[ $idx ];
                $closest = $idx;
            }
        }
        push @cluster, [$point, $closest];
    }

    # compute new centers
    foreach my $center_idx (0..$#center) {
        my @members = grep {$_->[1] == $center_idx} @cluster;
        my $sum = 0;
        foreach my $member (@members) {
            $sum += $member->[0];
        }
        my $new_center = @members ? $sum / @members : $center[ $center_idx ];
        $diff += abs $center[ $center_idx ] - $new_center;
        $center[ $center_idx ] = $new_center;
    }

} while ($diff > $tol);

print "Centers are:\n";
foreach my $center_idx (0..$#center) {
    print "$center_idx    $center[ $center_idx ]\n";
}

-Mark

Re: Making sense of data: Clustering OR A coding challenge
created: 2006-04-03 15:34:24

I don't know if this observation is that helpful, but to me it almost seems as if in this process you are trying to construct the smallest k n-dimensional "spheres" that encompass the data. (I say "spheres" loosely-a spheroid or ellipsoid might be more appropriate, upon thinking a little further about it.) When projected into one dimension, this would seem to appear as a range; in 2 dimensions, as a circle; and in 3+ dimensions, as a sphere. The "amorphous blob" effect you mentioned in the CB when I made this observation earlier could be a result of overlaps in the spheres, and choosing one to encompass a particular data point than another.

As I said, I don't know how helpful the observation may be, but hope it helps.

Re^2: Making sense of data: Clustering OR A coding challenge
created: 2006-04-03 15:39:29
See the screen shot here.

--
In Bob We Trust, All Others Bring Data.

Re: Making sense of data: Clustering OR A coding challenge
created: 2006-04-03 21:27:31
For most problems like this, I move over to using a programming language with more context-specific tools like R. At least a dozen clustering methods are available and there are facilities for estimating the "true" number of clusters. Using it is fairly simple from perl. Just write a little script that you call from the command line that reads a file of numbers that you supply. The output can be graphical or some numeric summary--whatever suits your needs. You can also use the perl module, Statistics::R (basically a bi-directional pipe) to interact without the intermediate running of an R script from the commmand line.

Sean
Re^2: Making sense of data: Clustering OR A coding challenge
Win
created: 2006-04-04 06:16:45
I am quite interested in this. Can you please offer Perl scripts that does this by way of example. Are the graphs that you get from R really much better than Excel? Is there a web page that demonstrates that graphical capabilities of R?
Re^3: Making sense of data: Clustering OR A coding challenge
created: 2006-04-04 06:25:43
"Is there a web page that demonstrates that graphical capabilities of R?"

Did you look at the website that srdst13 links to in his post Re: Making sense of data: Clustering OR A coding challenge?
There is screenshot link on the menu, which, oddly enough, displays graphical examples.

Martin
Re^4: Making sense of data: Clustering OR A coding challenge
Win
created: 2006-04-04 06:46:02
I can see that, when using macs, the graphics are very good. Almost as good as Matlab.

Is there anyway that R could produce geographical maps where a rate is indicated by colour and the confidence limits are indicated by something else. Not sure what. Any suggestions? I expect that graph theory hasn't stretched that far yet.
Re^3: Making sense of data: Clustering OR A coding challenge
created: 2006-04-04 21:06:29
For graphical examples using R, check out this website. If there are any questions as to whether R can generate better graphs than Excel, that website should answer them.

Sean
Re: Making sense of data: Clustering OR A coding challenge
created: 2006-04-04 02:04:13
Something similar to k-means can be archieved with Statistics::Descriptive and its frequency_distribution.
use Statistics::Descriptive;
$stat = Statistics::Descriptive::Full->new();
$stat->add_data( split /,/,  );
%f   = $stat->frequency_distribution(4);

$min = 0;
for ( sort { $a <=> $b } keys %f ) {
    printf "[%d\t-%d\t] %d\n", $min, $_, $f{$_};
    $min = $_ + 1;
}
__DATA__
0, 12, 25, 38, 50, 62, 75, 88, 100
__END__
# prints 
[0      -25     ] 3
[26     -50     ] 2
[51     -75     ] 2
[76     -100    ] 2
Re^2: Making sense of data: Clustering OR A coding challenge
created: 2006-04-04 11:24:50
Interesting, although it seems this method seems to favor outliers. With my sample dataset below it creates more bins with few or no items on the high end of the spectrum whereas nearly everything else continues to be lumped into the first category.
[0      -296    ] 86
[297    -580    ] 8
[581    -864    ] 4
[865    -1148   ] 1
[1149   -1432   ] 2
[1433   -1716   ] 0
[1717   -2000   ] 1
vs. kvale's
[12     -24     ] 33
[42     -76     ] 27
[80     -128    ] 14
[150    -250    ] 9
[280    -460    ] 9
[550    -950    ] 7
[1226   -2000   ] 3
12, 14, 16, 18, 18, 20, 20, 20, 20, 20, 20, 20, 22, 24, 24, 24, 24, 30, 30, 30, 32, 35, 35, 35, 35, 35, 35, 35, 36, 40, 42, 46, 48, 48, 50, 50, 50, 50, 54, 54, 55, 56, 56, 58, 58, 60, 60, 60, 60, 63, 64, 67, 67, 68, 70, 70, 76, 80, 86, 86, 86, 90, 90, 99, 100, 100, 100, 100, 104, 105, 128, 150, 150, 154, 169, 190, 200, 200, 200, 250, 280, 291, 291, 300, 325, 325, 330, 450, 460, 550, 566, 600, 700, 750, 770, 950, 1226, 1250, 2000, 15, 22, 24

--
In Bob We Trust, All Others Bring Data.

Re^3: Making sense of data: Clustering OR A coding challenge
created: 2006-04-05 02:21:04
Then you might be interested in the
$stat->frequency_distribution(\@bins);
notation and add more bins at lower values.
Re^4: Making sense of data: Clustering OR A coding challenge
created: 2006-04-05 10:19:43
Except that the point was to find the natural bins inherent in the data :-P

--
In Bob We Trust, All Others Bring Data.

Re: Making sense of data: Clustering OR A coding challenge
created: 2006-04-04 06:23:51
Have you tried Algorithm::Cluster?, it does k-means clustering, specifically look at the kcluster method.


This is not a Signature...
Re^2: Making sense of data: Clustering OR A coding challenge
created: 2006-04-04 10:46:30
Yes, and as I mentioned in the OP it is totally undocumented, etc. etc.

--
In Bob We Trust, All Others Bring Data.

Re: Making sense of data: Clustering OR A coding challenge
created: 2006-04-04 12:29:27

While I haven't been able to turn up any docs for Algorithm::Cluster, I have found some information about the underlying C clustering library by Michiel Jan Laurens de Hoon, that may be useful to you:


The C Clustering Library
Making and Using DNA Microarrays - This is the manual for Cluster 3.0

HTH,

planetscape
Re^2: Making sense of data: Clustering OR A coding challenge
created: 2006-04-04 12:54:37
General information (includning ppm installation information) about Algorithm::Cluster can be found here. Reasonably good documentation can be found in this pdf document. Unfortunately, as the OP noted, it does seem to be aimed at 2 variable problems.
Re^2: Making sense of data: Clustering OR A coding challenge
created: 2006-04-04 13:05:31
Thanks, that seems to agree with what I gleaned from reading the source. Although it does imply that weight and mask are optional. Alas, it provides no insight into 1-D klustering (I get no errors nor values back from kcluster).

--
In Bob We Trust, All Others Bring Data.

Re: Making sense of data: Clustering OR A coding challenge
created: 2006-04-04 12:45:52

So here's a cluster finder which uses a [wp://Genetic Algorithm] based approach. I've tested it with trivial data point sets that have very clear (to human eyes) clusters. On these, it finds perfect clustering almost every time. I'd be interested to see how well it works on real-world data.

The parameters are not well isolated in this code. They are:

  • Maximum number of clusters (the algorithm is free to find fewer)
  • Number of generations to run
  • Number of individuals to kill, clone, and mutate in each generation
  • In a mutation, how many datapoints to re-cluster


use Carp;
use List::Util qw( shuffle );
use strict;
use warnings;

# an individual represents a distribution of points among clusters.
# that is, it is a specific allocation of points to clusters.
# in the initial population, in each individual, the points are randomly assigned to clusters.

# each individual is an array. 
# each element represents a point in the data set, and its value
# is the number of the cluster to which it has been assigned.

my @datapoints;

# The subs in Point:: need to be customized for the type/representation of a "point".

sub Point::set_metric; # "distance" or "area" or something like that. small values mean "close"
sub Point::as_string;

sub Point::ScalarNumber::set_metric
{
	my $set = shift;
	my @set = @datapoints[@$set];
	@set == 0 and return 1;
	@set == 1 and return 2;
	# RMS
	my $total = 0;
	my $n = 0;
	for my $i ( 1 .. $#set )
	{
		for my $j ( $i .. $#set )
		{
			my $dist = abs( $set[$i-1] - $set[$j] );
			$total += $dist ** 2;
			$n++;
		}
	}
	sqrt( $total / $n )
}
sub Point::ScalarNumber::as_string
{
	$_[0]
}

sub Point::NumberPair::set_metric
{
	my $set = shift;
	my @set = @datapoints[@$set];
	@set == 0 and return 1;
	@set == 1 and return 2;
	# RMS
	my $total = 0;
	my $n = 0;
	for my $i ( 1 .. $#set )
	{
		for my $j ( $i .. $#set )
		{
			my $dist2 = ( ( $set[$i-1][0] - $set[$j][0] ) ** 2 )
			          + ( ( $set[$i-1][1] - $set[$j][1] ) ** 2 );
			$total += $dist2;
			$n++;
		}
	}
	sqrt( $total / $n )
}
sub Point::NumberPair::as_string
{
	"[$_[0][0],$_[0][1]]"
}

#######################################################################

my @clusters;

sub Ind::new_randomized
{
	#@datapoints <= 0 and croak "No datapoints defined!\n";
	#@datapoints <  1 and croak "Only one cluster defined!\n";
	#@clusters   <= 0 and croak "No clusters defined!\n";
	#@clusters   <  1 and croak "Only one cluster defined!\n";
	[ map { int( rand @clusters ) } @datapoints ]
}

sub Ind::clone
{
	my $ind = shift;
	[ @$ind ]
}

# optional arg: number of points to move
sub Ind::mutate
{
	my( $ind, $n ) = @_;
	for my $i ( 0 .. ($n||1) )
	{
		my $j = int( rand @datapoints );
		$ind->[$j] = int( rand @clusters );
	}
	$ind
}

	sub Ind::_crossover_points
	{
		my $l = @datapoints;
		my $seglen = 1 + int rand( $l - 1 );
		my $start = int rand( $l - $seglen );
		( $start .. ($start+$seglen-1) )
	}

sub Ind::crossover
{
	my( $ind1, $ind2 ) = @_;
	my @xo = Ind::_crossover_points();
	for my $i ( @xo )
	{
		( $ind1->[$i], $ind2->[$i] ) =
		( $ind2->[$i], $ind1->[$i] )
	}
}

sub Ind::fitness
{
	my $ind = shift;
	my @cluster_points = map {
		my $cl = $_;
		[ grep { $ind->[$_] eq $cl } 0 .. $#{$ind} ]
	} 0 .. $#clusters;

	my $total_metric = 0;
	for my $ci ( 0 .. $#cluster_points )
	{
		my $val = Point::set_metric( $cluster_points[$ci] );
		$total_metric += $val;
	}

	1000/$total_metric # convert it to "large = good"
}

sub Ind::display
{
	my $ind = shift;
	my @cluster_points = map {
		my $cl = $_;
		[ grep { $ind->[$_] eq $cl } 0 .. $#{$ind} ]
	} 0 .. $#clusters;

	my $total_metric = 0;
	for my $ci ( 0 .. $#cluster_points )
	{
		my $val = Point::set_metric( $cluster_points[$ci] );
		$total_metric += $val;
		printf "$ci: Cluster $clusters[$ci]: %5.2f ( ", $val;
		print join ' ', map { Point::as_string($_) }
			@datapoints[@{$cluster_points[$ci]}];
		print " )\n";
	}
	printf "Total metric: %.2f\n", $total_metric;

	$ind
}

#######################################################################

if(0)
{
@datapoints = shuffle( 11..14, 21..24, 31..34, 41..44 );
*Point::set_metric = \&Point::ScalarNumber::set_metric;
*Point::as_string = \&Point::ScalarNumber::as_string;
}
else
{
@datapoints = shuffle(
	[ 1, 2], [ 2, 1], [ 2, 3], [ 3, 2],
	[ 1,12], [ 2,11], [ 2,13], [ 3,12],
	[11, 2], [12, 1], [12, 3], [13, 2],
	[11,12], [12,11], [12,13], [13,12],
);
*Point::set_metric = \&Point::NumberPair::set_metric;
*Point::as_string = \&Point::NumberPair::as_string;
}

@clusters = ( 1 .. 4 );

my @pop =
	sort { $b->[0] <=> $a->[0] }
	map { [ Ind::fitness($_), $_ ] }
	map { Ind::new_randomized }
	1 .. 100;

#print "Before:"; printf " %.1f", $_->[0] for @pop; print "\n";

	# this clones an element of @pop
	sub clone { [ $_[0]->[0], Ind::clone($_[0]->[1]) ] }

for my $iter ( 1 .. 200 )
{
	# kill the bottom 30:
	splice @pop, @pop-30, 30;

	# make 10 new ones:
	push @pop,
	map { [ Ind::fitness($_), $_ ] }
	map { Ind::new_randomized }
	1 .. 10;

	# clone the top 20:
	push @pop, map clone($_), @pop[0 .. 19];

	# mutate the top 20:
	for my $e ( @pop[0 .. 19] )
	{
		my $n = 1;
		unless ( int(rand 2) )
		{
			$n++;
			unless ( int(rand 3) )
			{
				$n++;
				unless ( int(rand 4) )
				{
					$n++;
				}
			}
		}
		#warn "mut $n\n";
		Ind::mutate( $e->[1], $n );
		$e->[0] = Ind::fitness( $e->[1] );
	}

	# sort by fitness again:
	@pop = sort { $b->[0] <=> $a->[0] } @pop;
	# print "Iter $iter: $pop[0][0]\n";
}

# print "\nAfter:"; printf " %.1f", $_->[0] for @pop; print "\n";

Ind::display( $pop[0][1] );

Note that, as it stands, I'm not doing any crossover, only mutation. I'm sure improvements could be made in this area.

We're building the house of the future together.
Re^2: Making sense of data: Clustering OR A coding challenge
created: 2006-04-04 14:54:32
With 5 clusters and the dataset in Re^2: Making sense of data: Clustering OR A coding challenge I get some nasty results like: 0: Cluster 1: 18.27 ( 48 54 32 48 30 50 35 50 70 63 )

1: Cluster 2: 19.21 ( 20 22 18 14 30 24 60 35 40 36 )

2: Cluster 3: 19.76 ( 20 20 35 56 35 15 35 )

3: Cluster 4: 35.09 ( 80 100 58 100 105 150 86 100 99 60 86 )

4: Cluster 5: 509.63 ( 22 169 566 150 16 50 56 76 100 24 20 55 24 70 60 291 200 325 700 200 35 58 90 460 30 1226 950 67 35 68 200 67 24 300 50 54 770 90 750 450 24 60 46 20 12 280 154 20 128 20 600 42 1250 18 86 291 330 325 250 190 550 104 2000 64 ) With 5000 generations, 50 cullings and 20 spawn it yields slightly more reasonable, though still not too helpful results:

0: Cluster 1: 20.00 ( 770 750 )

1: Cluster 2: 2.00 ( 950 )

2: Cluster 3: 198.66 ( 700 64 63 48 20 42 18 20 58 30 200 55 90 70 22 60 200 54 104 18 60 22 291 20 14 154 291 35 20 76 100 35 54 50 86 67 36 325 24 48 35 20 35 50 100 280 56 30 58 60 60 24 20 460 99 32 24 30 24 86 56 600 40 16 20 150 105 15 566 169 300 86 450 330 50 80 128 325 190 550 35 35 150 50 68 100 35 250 67 100 24 200 90 70 12 46 )

3: Cluster 4: 24.00 ( 1226 1250 )

4: Cluster 5: 2.00 ( 2000 ) Given another order of magnitude or more runtime it might reach palatable results ;-)

--
In Bob We Trust, All Others Bring Data.

Re^3: Making sense of data: Clustering OR A coding challenge
created: 2006-04-06 16:23:56

Indeed, there were a number of other parameters that could be tweaked, and by doing so, I was able to get better results than that.

However, in the end, it turns out there are some special properties of your problem that allow much simpler and more effective solutions. Namely, the fact that your data points are one-dimensional. (I'm assuming they are.)

It means, for example, that (1 3),(2 4) is never an optimal clustering. Neither is (1 2),(1 2).

From these, we can define the following constraints on clusters:

  1. clusters are simply segments within the ordered data set.
  2. all repetitions of a number must be kept together.

In the following solution, I'm using variance as a measure of the "coherence" or "binding strength" or whatever you want to call it within each cluster. You could use other measures; I wouldn't be surprised if variance doesn't necessarily give the best results. I've seen discussions of clustering that talk about maximizing variance between clusters, in addition to minimizing it within clusters. I have my doubts that that would help in a simple one-dimensional problem like this one.


use Statistics::Lite;
use strict;
use warnings;

my @d =
# I assume the order as given in the OP is not important:
sort { $a <=> $b }
( 12, 14, 16, 18, 18, 20, 20, 20, 20, 20, 20, 20, 22, 24, 24, 24, 24, 30, 30, 30, 32, 35, 35, 35, 35, 35, 35, 35, 36, 40, 42, 46, 48, 48, 50, 50, 50, 50, 54, 54, 55, 56, 56, 58, 58, 60, 60, 60, 60, 63, 64, 67, 67, 68, 70, 70, 76, 80, 86, 86, 86, 90, 90, 99, 100, 100, 100, 100, 104, 105, 128, 150, 150, 154, 169, 190, 200, 200, 200, 250, 280, 291, 291, 300, 325, 325, 330, 450, 460, 550, 566, 600, 700, 750, 770, 950, 1226, 1250, 2000, 15, 22, 24 );

my %count;
$count{$_}++ for @d;

my @e = sort { $a <=> $b } keys %count;

# an individual is an array of arrays of nums.
# each element of the (top-level) array represents a cluster.
# the order of all the numbers, if you were to concat all the arrays,
# is strictly numeric ascending.
# in a 1-d space only, it never makes sense to cluster the
# numbers (1,2,3,4) as (1,3), (2,4).
# the only mutation possible is shifting a number off the
# beggining of one array and pushing it onto the previous array.
# (and the other way).

	sub random_bipartition
	{
		my( $min_size, $ar ) = @_;
		my $sel = @$ar - (2*$min_size);
		my $p = $min_size + int rand( $sel );
		$p > $#{$ar} 
	       	? ( [ @{$ar} ], [ ] )
		: ( [ @{$ar}[0 .. $p] ], [ @{$ar}[$p+1 .. $#{$ar}] ] )
	}

use Data::Dumper;

sub Ind::new_randomized
{
	my $nc = shift;
	my $n = $nc <= 2 ? 1 : $nc <= 4 ? 2 : $nc <= 8 ? 3 : 4; # 16 max

	my @a = ( \@e );
	@a = map { random_bipartition( 1<<$n, $_ ) } @a while $n--; # deep magic :-)


	my $i = 0;
	while ( @a > $nc )
	{
		unshift @{$a[$i+1]}, @{$a[$i]};
		splice @a, $i, 1;
		$i++;
	}

	\@a
}

sub Ind::clone
{
	my $ind = shift;
	[ map { [@$_] } @$ind ]
}

sub Ind::as_string
{
	my $ind = shift;
	my $varsum;
	my $s;
	for my $cl ( @$ind )
	{
		my $var = 0;
		if ( @$cl )
		{
			my @d;
			push @d, ($_) x $count{$_} for @$cl;
			$var = int( Statistics::Lite::variance(@d)||0 );
			$varsum += $var;
		}
		$s .= "$var ( @$cl )\n";
	}
	$s .= "Total variance: $varsum\n";
	$s
}

sub Ind::fitness
{
	my $ind = shift;
	my $sum = 0;
	my $empty = 0;
	for my $cl ( @$ind )
	{
		if ( @$cl )
		{
			my @d;
			push @d, ($_) x $count{$_} for @$cl;
			$sum += (Statistics::Lite::variance(@d)||0);
		}
		else
		{
			$empty++;
		}
	}

	# harshly penalize individuals with the wrong number (too few) of clusters:
	$sum *= ( 1 + $empty / 10 );

	1_000_000 - $sum # convert to Larger Is Better
}

sub Ind::mutate
{
	my( $ind, $n ) = @_;
	$n ||= 1;
	my $lo = int rand( @$ind - 1);
	if ( int rand 2 )
	{
		# up
		my($to,$from) = ( $ind->[$lo+1], $ind->[$lo] );
		unshift @$to, pop @$from
			while $n-- && @$from
	}
	else
	{
		# down
		my($to,$from) = ( $ind->[$lo], $ind->[$lo+1] );
		push @$to, shift @$from
			while $n-- && @$from
	}
	$ind
}

	# this clones an element of @pop
	sub clone { [ $_[0]->[0], Ind::clone($_[0]->[1]) ] }

sub do_run
{
	my $n_clusters = shift;

	my @pop =
		sort { $b->[0] <=> $a->[0] }
		map { [ Ind::fitness($_), $_ ] }
		map { Ind::new_randomized($n_clusters) }
		1 .. 60;

	for my $gen ( 1 .. 200 )
	{
		# kill the bottom 30:
		splice @pop, @pop-30, 30;

		# make 10 new ones:
		push @pop,
		map { [ Ind::fitness($_), $_ ] }
		map { Ind::new_randomized($n_clusters) }
		1 .. 10;

		# clone the top 20:
		push @pop, map clone($_), @pop[0 .. 19];

		# mutate the top 20:
		for my $e ( @pop[0 .. 19] )
		{
			Ind::mutate( $e->[1], 1 + int rand 4 );
			$e->[0] = Ind::fitness( $e->[1] );
		}

		# sort by fitness again:
		@pop = sort { $b->[0] <=> $a->[0] } @pop;
	}

	$pop[0][1] # best individual
}

for my $nc ( 2 .. 8 )
{
	print "\n$nc Clusters:\n";
	my $winner = do_run( $nc );
	print Ind::as_string( $winner ), "\n";
}
We're building the house of the future together.
Re: Making sense of data: Clustering OR A coding challenge
created: 2006-04-08 18:45:49
You pointed to a [http://people.revoledu.com/kardi/tutorial/kMean/index.html|k-Means Clustering tutorial] by Kardi Teknomo, which includes a demo program in VB. I had promised you that I'd try to port the algorithm to Perl... I found the VB program so illustrative, that I ported the whole program to Perl/Tk, even though I've never actually used Tk for anything real. As a result, it took me a few days.

Anyway! The result is here, the kMeans module is part of the script, but you can use it as a general library for kMeans clustering too, if you'd want to. It's not even limited to 2D. :) All you have to do is copy the bottom half of the program into its own pm file.

Enjoy!

#!/usr/bin/perl -w
# k Means demo program
# Adapted from a VB program by Kardi Teknomo
# http://people.revoledu.com/kardi/tutorial/kMean/index.html
# Ported to Perl/Tk by bart @ Perlmonks

use strict;


# ----------------------- Tk interface -------------------------
use Tk;
my $mw = MainWindow->new( -height => 403, -width => 477,
  -title => "k Means Clustering, adapted from tutorial by Kardi Teknomo" );

my $button_reset = $mw->Button(-text => "Clear Data", -relief => "raised", -command => \&reset_click);
$button_reset->place( -x => 176, -y => 24, -height => 25, -width => 65);

{
    my $label = $mw->Label(
      -text => "Click data in the picture box below. The program will automatically cluster the data by color code."
    );
    $label->place( -x => 0, -y => 0, -height => 16, -width => 473);

    $label = $mw->Label(-text => "Number of clusters");
    $label->place( -x => 10, -y => 28, -height => 18, -width => 95);

    $label = $mw->Label(-text => "(X, Y)", -justify => 'right');
    $label->place( -x => 280, -y => 24, -height => 13, -width => 40);
}

my $label_xy = $mw->Label( -text => "X, Y");
$label_xy->place( -x => 330, -y => 24, -height => 13, -width => 50);

my $clusters_entry = $mw->Entry(-relief => "sunken", -disabledforeground => 'darkgray');
$clusters_entry->place( -x => 112, -y => 24, -height => 24, -width => 24);
$clusters_entry->insert('end', '3');

my $canvas = $mw->Scrolled('Canvas', -scrollbars => '', -background => 'white');
$canvas->place( -x => 0, -y => 56, -height => 403-56, -width => 477);
$canvas->CanvasBind( "", [ \&canvas_click, Ev('x'), Ev('y')]);
$canvas->CanvasBind( "", [ \&canvas_mousemove, Ev('x'), Ev('y')]);
$canvas->CanvasBind( "", [ \&canvas_mouseleave ]);


#----------------------- Event Handlers ------------------------

my @color = qw(red yellow green cyan blue purple gray magenta pink
  chartreuse coral darkolivegreen);
# If you want to be able to have more clusters, add more colours.


my(%point, @cluster, $dataset);

sub reset_click {
   $canvas->delete('all');
   (@cluster, %point, $dataset) = ();
   $clusters_entry->configure(-state => 'normal');
}

sub canvas_mousemove {
    my $canvas = shift;
    my($x, $y) = @_;
    $label_xy->configure(-text => "($x, $y)");
}

sub canvas_mouseleave {
    my $canvas = shift;
    $label_xy->configure(-text => "");
}

sub canvas_click {
    my $canvas = shift;
    my($x, $y) = @_;
    if(!$dataset) {
        my $clusters = $clusters_entry->get;
        if($clusters !~ /^\d+$/ or $clusters == 0 or $clusters > @color) {
            warn "Not a valid value for cluster count";
            return;
        }
        $dataset  = Data::Cluster::kMean->new(0+$clusters) or die "Failed to make object";
        $clusters_entry->configure(-state => 'disabled');
    }
    $dataset->add(my $point = [ $x, $y ]);
    my %record = ( data => $point, cluster => -1 );
    $record{id} = $canvas->createLine($x, $y, $x, $y,
      -fill => 'red', -width => 8,
      -capstyle => 'round', -tags => ['dot'],
    );
    $point{$point} = \%record;

    foreach my $cluster ($dataset->clusters) {
        my $i = $cluster->index;
        my $r = $cluster[$i] ||= { obj => $cluster, id =>
          $canvas->createText($x, $y, -anchor => 'c', -width => 150, tag => 'label',
            -text => 1+$i) };
        my($x, $y) = @{$cluster->centroid};
        $canvas->coords($r->{id}, $x, $y);
        foreach my $p ($cluster->points) {
            if($point{$p}{cluster} != $i) {
                $point{$p}{cluster} = $i;
                $canvas->itemconfigure($point{$p}{id}, -fill => $color[$i]);
            }
        }
    }
    $canvas->raise('label', 'dot');
}

#------------------------ Main Program -------------------------
MainLoop;


#------------------------ kMean module -------------------------

package Data::Cluster::kMean;
use List::Util qw(min sum);

sub new {
    my $class = shift;
    my($max_clusters) = @_;
    return bless { max_clusters => $max_clusters, data => [], cluster => [], clusters => [] }, $class;
}

sub add {
    # add data point(s) (array references) -- by reference, so make sure they're not reused for something else
    my $self = shift;
    return unless @_;
    unless(ref $_[0] eq 'ARRAY') {
        @_ = [ @_ ];
    }
    foreach my $p (@_) {
        push @{$self->{data}}, $p;
        push @{$self->{cluster}}, -1;   # not in a cluster
        if(@{$self->{clusters}} < $self->{max_clusters}) {
            my $index = @{$self->{clusters}};
            push @{$self->{clusters}}, Data::Cluster::kMean::Cluster->new($self, $index);
            $self->{cluster}[-1] = $index;
        } else {
            my $c;
            {
                my $j = 0;
                my $min_dist;
                for my $cluster (@{$self->{clusters}}) {
                    my $dist = _dist($p, $cluster->centroid);
                    if(!defined $min_dist or $dist < $min_dist) {
                        $c = $j;
                        $min_dist = $dist;
                    }
                } continue {
                    $j++;
                }
            }
            $self->{clusters}[$c]->invalidate;
            $self->{cluster}[-1] = $c;

            my $is_still_moving = 1;
            while($is_still_moving) {
                # this loop will surely converge

                my @centroid = map $_->centroid, @{$self->{clusters}};

                # assign all data to the new centroids
                $is_still_moving = 0;

                my $i = 0;
                for my $p (@{$self->{data}}) {
                    my $c;
                    {
                        my $min_dist;
                        for my $j (0 .. $#{$self->{clusters}}) {
                            my $dist = _dist($p, $centroid[$j]);
                            if(!defined $min_dist or $dist < $min_dist) {
                                $c = $j;
                                $min_dist = $dist;
                            }
                        }
                    }
                    if($c != $self->{cluster}[$i]) {
                        $self->{clusters}[$self->{cluster}[$i]]->invalidate;
                        $self->{clusters}[$c]->invalidate;
                        $self->{cluster}[$i] = $c;
                        $is_still_moving = 1;
                    }
                } continue {
                    $i++;
                }
            }
        }
    }
}

sub clusters {
    my $self = shift;
    return @{$self->{clusters}};
}

sub _dist {
    # function
    return sqrt(sum map { my $d = $_[0][$_]-$_[1][$_]; $d*$d } 0 .. $#{$_[0]});
}


package Data::Cluster::kMean::Cluster;
use List::Util qw(sum);

sub new {
    my $class = shift;
    my($parent, $index) = @_;
    bless { index => $index, data => $parent->{data}, cluster => $parent->{cluster}, centroid => undef}, $class;
}

sub points {
    my $self = shift;
    my $index = $self->{index};
    my @point = @{$self->{data}}[grep $self->{cluster}[$_] == $index, 0 .. $#{$self->{data}}];
    return @point;
}

sub centroid {
    my $self = shift;
    return $self->{centroid} ||= _centroid($self->points);
}

sub _centroid {
    # function
    return undef unless @_;
    my $dim = @{$_[0]};
    return [ map { my $i = $_; sum(map $_->[$i], @_) / @_ } 0 .. $dim-1 ];
}

sub invalidate {
    # throw away cache
    my $self = shift;
    undef $self->{centroid};
}

sub index {
    my $self = shift;
    return $self->{index};
}

1;

perlmonks.org content © perlmonks.org and atcroft, bart, belg4mit, codeacrobat, jdporter, kvale, marto, moklevat, monkey_boy, planetscape, srdst13, Win

prlmnks.org © 2006 edmund von der burg (eccles & toad)

v 0.03