Why
do you think your paper is highly cited?
“Clustering”
is a commonly used exploratory data analysis technique for
attempting to discover patterns in large, complex data sets such
as gene expression microarray experiments.
The explosion of such data in recent years has spawned
invention of new clustering algorithms and renewed interest in old
ones. Which of these
many methods are best for particular tasks or particular kinds of
data? It's easy enough
to compare clustering algorithms on data sets where the
"ideal clustering" is known in advance, but these cases
are usually toy examples. Our
paper provides one of the few methods
available for comparative evaluation of clustering algorithms on
real data where the "right answer" isn't known in
advance.
Does
it describe a new discovery or a new methodology that's useful to
others?
It's
a new methodology that's basically applicable for the comparison
of any clustering algorithms on any particular data set.
What
were some of the circumstances that led you to do this research?
In
the early work on microarrays, every research group seemed to have
their own favorite clustering method, and all clustering
algorithms find "clusters"—that's their job.
But how could you tell whether those clusters were good
ones? We were looking
for a more systematic, data-driven way to evaluate the methods, so
that researchers could have more confidence in their results.
Could
you summarize the significance of your paper in layman's terms?
"Clustering"
is the process of dividing data points into groups or clusters so
that, hopefully, the points in each group are more similar to each
other than to points in other groups.
With luck, this will lead you to some useful hypotheses
about the biological system you're studying, e.g., the genes in
cluster A are related to such-and-such a function, while those in
cluster B serve a different purpose.
Unfortunately, these divisions usually aren't clear-cut,
and different clustering algorithms make different choices.
We proposed a way to test the "quality" of the
algorithms without
knowing the best clustering in advance.