|
George C. Tseng & Wing H. Wong answers a
few questions about this month's fast breaking paper in
the field of Mathematics.
From
•>>August 2006
Field:
Mathematics
Article Title: Tight clustering: A resampling-based approach for identifying stable and tight patterns in data
Authors: Tseng, GC;Wong, WH
Journal: BIOMETRICS
Volume: 61
Issue: 1
Page: 10-16
Year: MAR 2005
* Univ Pittsburgh, Dept Biostat, Pittsburgh, PA 15261 USA.
* Univ Pittsburgh, Dept Biostat, Pittsburgh, PA 15261 USA.
* Univ Pittsburgh, Dept Human Genet, Pittsburgh, PA 15261 USA.
* Harvard Univ, Dept Stat, Cambridge, MA 02138 USA.
* Harvard Univ, Dept Biostat, Cambridge, MA 02138 USA.
|
Why
do you think your paper is highly cited?
|

“Cluster analysis is an important tool for information extraction in data mining.
|
|
In the past, many people have been aware of the advantage of
a re-sampling evaluation in cluster analysis. Such information
was especially useful for post-clustering validation and an
estimation of the number of clusters. However, thus far, no one
has directly utilized this information for cluster formation,
which naturally uncovers more meaningful information in the data
mining.
We think our paper is highly cited because it is the first
systematic attempt at this approach and it reveals a new concept
in cluster analysis. It has shown great success and feasibility
in high-throughput genomic data, especially in microarray
analysis. It is expected to also apply to other
high-dimensional, complex data sets.
Does
it describe a new discovery, methodology, or synthesis of
knowledge?
Tight clustering is a new methodology for cluster analysis.
Its novelty comes from a systematic integration of information
from the repeated clustering of sub-samples. Its concept is very
different from traditional clustering methods, in that it
directly searches for tight clusters in a sequential manner,
making estimation of the number of clusters secondary. The
method automatically allows some of the objects not to be
assigned to any cluster and thus avoids dilution of information
from these "noise" objects.
Could
you summarize the significance of your paper in layman’s terms?
Cluster analysis is an important tool for information
extraction in data mining. It groups similar objects together
according to their attributes or behavior.
Our major contribution in this paper is to improve clustering
performance by assessing the repeated clustering results of
multiple sub-samples. Multiple (for instance, ten) sub-samples
of the data are generated and clustered respectively. The really
good and tight clustering relationship should almost always be
clustered together for each sub-sampling evaluation, whereby the
"random" cluster relationships are thus excluded.
Consequently, the most important patterns are better extracted
from the data. Our method works particularly well in large and
complex data sets.
How
did you become involved in this research, and were any problems
encountered along the way?
We got interested in this approach after reading Robert
Tibshirani’s paper on prediction strength for estimating the
number of clusters ("Cluster Validation by Prediction
Strength. Technical Report," Tibshirani et al,
Statistics Department, Stanford University, 2001). There they
applied a re-sampling approach and cross-validation idea to
evaluate which number of clusters gives the most stable
clustering result.
The approach was quite interesting but was limited only to
the estimation of the number of clusters, which is usually not
the most important task in the analysis of large complex
datasets, including the microarray data we were dealing with at
that time.
We then turned our attention to utilizing the re-sampling
information to directly produce good and tight clusters in a
sequential manner. Consequently, our approach is a more robust
one whereby selecting the correct number of clusters is
secondary in contrast to other clustering methods.
The major difficulties we have encountered in this project
include the development of a systematic mechanism to incorporate
the complex information from re-sampling and how to adequately
demonstrate the improvement of our new method since assessing
the performance of different clustering methods is a well-known,
difficult issue.
Are
there any social or political implications for your research?
So far, our method is mostly used in large datasets from
genomic or biological experiments. The cluster analysis is an
early step in data mining used to extract genetic information
from the data for further validation or hypotheses generation,
in order to fully understand the underlying disease mechanism,
which in turn will promote new drug discoveries or better
treatment selections.
Conceptually, tight clustering can be applied in any area
with a need for clustering large, complex data. These include
data from marketing research, internet information, social
network analysis, and image analysis. Recently, we have expanded
its use to an earthquake dataset in Taiwan.
George C. Tseng, Ph.D.
Assistant Professor
Department of Biostatistics
University of Pittsburgh
Pittsburgh, PA, USA
Wing H. Wong, Ph.D.
Professor
Department of Statistics
Stanford University
Stanford, CA, USA
|
ESI Special Topics,
August 2006
Citing URL - http://www.esi-topics.com/fbp/2006/august06-Tseng_Wong.html
|
|