s-news
[Top] [All Lists]

Re: Cluster Analysis of Gridded Continuous Fields

To: "'Kim Elmore'" <Kim.Elmore@noaa.gov>, "'S-News Mail List'" <s-news@lists.biostat.wustl.edu>
Subject: Re: Cluster Analysis of Gridded Continuous Fields
From: "Alan Hochberg" <alan.hochberg@prosanos.com>
Date: Tue, 23 May 2006 09:28:39 -0400
In-reply-to: <7.0.1.0.2.20060522220517.02846f50@noaa.gov>
Thread-index: AcZ+GbfTOESVWi47To66nbetaOQ0ggASvphA
Kim,

One of the difficulties of diagnosing a problem with clustering is that you
have three sub-problems:

1) Choosing the right feature set
2) Choosing the right similarity measure
3) Choosing the right clustering algorithm (and implementing/invoking it
correctly)

Once you get SOMETHING to work, it's pretty easy to start tweaking, doing
controlled experiments where you change parameters for one of these three
things at a time, and seeing which alternative you like best.  (It sounds
like your criterion for "best" clustering is fairly subjective, which is
fine.  There are objective criteria like "silhouette coefficients" out there
if you need them.)  The problem is that if you put your data into a process
with all three of these components together, it's hard to see what's going
on and why it's not working.  So let's break it down.

Let's start at the beginning with feature sets and similarity measures.  If
I understand the data set, it consists of pressure as a function of height
over a grid of points.  So, yes, average difference or mean squared
difference in the height of a pressure contour would be a good feature to
start with.  Here is an experiment: First, pick out some representative
pairs of runs, some that you would call "quite similar" to one another, and
others that you would call "quite different" from one another.  Ten pairs
should do it, five of each case.  For each of these pairs, compute the
feature you are planning to use (squared difference in the height at 500mb,
say) at each grid point, and plot a histogram of those values.  You should
get a skewed but nicely-shaped curve with a peak at a small value for runs
that you judge as similar, and a peak at a higher value for runs that you
judge as "different".  Or are there heavy tails and lots of outliers, even
for runs that the eye judges as similar?  Or do you get little or no shift
in the peak position, even when the runs look different to you?

You can transform your similarity measure--squaring it, say, to put more
emphasis on differences; taking a square root or even a log to make things
look more the same, to control noise, and to tame outliers.  If this doesn't
get you where you want to be, you may be using the wrong feature.  Maybe you
need to look at difference in pressure at a given height, say.  But until
you can get the histograms to make sense in terms of the way your eye judges
similarity, then no cluster algorithm can help you.

Here is another experiment: For your selected ten pairs, plot whatever
feature difference you are planning to use, over your grid of points, as a
color or gray-scale plot, using the image() function of S-plus (there is an
R equivalent).  One possible reason that clustering isn't working for you is
that your grid has "interesting" regions where your eye picks up on
differences between runs, and "boring", relatively flat, featureless
regions, that your eye ignores.  Yet when the computer calculates a
similarity matrix, differences in the "boring" regions are swamping out the
"interesting" differences.  The cure here will be to weight the various grid
points by their "interestingness", which may be some kind of gradient.
Looking at these images will tell you whether the differences that your eye
is responding to are limited to certain regions of the grid space, or are
evenly distributed across it.

Once these images and histograms look the way you expect them to, THEN
you're ready to start clustering.

Obviously I know little about atmospheric science (though a local TV
meteorologist is a drinking buddy).  I have done a fair amount of
clustering.  The principle that I'm following here is that appropriate
visualization of the input data can help you debug problems with ANY complex
"black box" algorithm, whether it's a Cox regression, a neural net, or a
cluster algorithm.  Fortunately, we R/S-Plus folk have good visualization
tools handy.

Good luck and happy clustering.

Alan

Alan Hochberg
Vice President, Research
ProSanos Corp.
225 Market Street
Suite 502
Harrisburg, PA 17101
Tel. 717-635-2124
Fax 717-635-2575
alan.hochberg@prosanos.com
www.prosanos.com



<Prev in Thread] Current Thread [Next in Thread>