Beginning in mid-February 2008, the 1997-2007 online version of the Science Watch® newsletter, ESI-Topics.com, and in-cites.com, will all be featured together on the redesigned ScienceWatch.com. All previous content from the three sites will be permanently archived, and remain accessible from any existing bookmarks to the archived pages. No new content will be added to this site. Updates and new content (updated biweekly) are available at ScienceWatch.com now.

Fast Breaking Comments

By George C. Tseng & Wing H. Wong

ESI Special Topics, August 2006
Citing URL - http://www.esi-topics.com/fbp/2006/august06-Tseng_Wong.html

George C. Tseng & Wing H. Wong answers a few questions about this month's fast breaking paper in the field of Mathematics.


From •>>August 2006

Field: Mathematics
Article Title: Tight clustering: A resampling-based approach for identifying stable and tight patterns in data
Authors: Tseng, GC;Wong, WH
Journal: BIOMETRICS
Volume: 61
Issue: 1
Page: 10-16
Year: MAR 2005
* Univ Pittsburgh, Dept Biostat, Pittsburgh, PA 15261 USA.
* Univ Pittsburgh, Dept Biostat, Pittsburgh, PA 15261 USA.
* Univ Pittsburgh, Dept Human Genet, Pittsburgh, PA 15261 USA.
* Harvard Univ, Dept Stat, Cambridge, MA 02138 USA.
* Harvard Univ, Dept Biostat, Cambridge, MA 02138 USA.

ST:  Why do you think your paper is highly cited?

Tseng
Wong
“Cluster analysis is an important tool for information extraction in data mining.

In the past, many people have been aware of the advantage of a re-sampling evaluation in cluster analysis. Such information was especially useful for post-clustering validation and an estimation of the number of clusters. However, thus far, no one has directly utilized this information for cluster formation, which naturally uncovers more meaningful information in the data mining.

We think our paper is highly cited because it is the first systematic attempt at this approach and it reveals a new concept in cluster analysis. It has shown great success and feasibility in high-throughput genomic data, especially in microarray analysis. It is expected to also apply to other high-dimensional, complex data sets.

ST:  Does it describe a new discovery, methodology, or synthesis of knowledge?

Tight clustering is a new methodology for cluster analysis. Its novelty comes from a systematic integration of information from the repeated clustering of sub-samples. Its concept is very different from traditional clustering methods, in that it directly searches for tight clusters in a sequential manner, making estimation of the number of clusters secondary. The method automatically allows some of the objects not to be assigned to any cluster and thus avoids dilution of information from these "noise" objects.

ST:  Could you summarize the significance of your paper in layman’s terms?

Cluster analysis is an important tool for information extraction in data mining. It groups similar objects together according to their attributes or behavior.

Our major contribution in this paper is to improve clustering performance by assessing the repeated clustering results of multiple sub-samples. Multiple (for instance, ten) sub-samples of the data are generated and clustered respectively. The really good and tight clustering relationship should almost always be clustered together for each sub-sampling evaluation, whereby the "random" cluster relationships are thus excluded. Consequently, the most important patterns are better extracted from the data. Our method works particularly well in large and complex data sets.

ST:  How did you become involved in this research, and were any problems encountered along the way?

We got interested in this approach after reading Robert Tibshirani’s paper on prediction strength for estimating the number of clusters ("Cluster Validation by Prediction Strength. Technical Report," Tibshirani et al, Statistics Department, Stanford University, 2001). There they applied a re-sampling approach and cross-validation idea to evaluate which number of clusters gives the most stable clustering result.

The approach was quite interesting but was limited only to the estimation of the number of clusters, which is usually not the most important task in the analysis of large complex datasets, including the microarray data we were dealing with at that time.

We then turned our attention to utilizing the re-sampling information to directly produce good and tight clusters in a sequential manner. Consequently, our approach is a more robust one whereby selecting the correct number of clusters is secondary in contrast to other clustering methods.

The major difficulties we have encountered in this project include the development of a systematic mechanism to incorporate the complex information from re-sampling and how to adequately demonstrate the improvement of our new method since assessing the performance of different clustering methods is a well-known, difficult issue.

ST:  Are there any social or political implications for your research?

So far, our method is mostly used in large datasets from genomic or biological experiments. The cluster analysis is an early step in data mining used to extract genetic information from the data for further validation or hypotheses generation, in order to fully understand the underlying disease mechanism, which in turn will promote new drug discoveries or better treatment selections.

Conceptually, tight clustering can be applied in any area with a need for clustering large, complex data. These include data from marketing research, internet information, social network analysis, and image analysis. Recently, we have expanded its use to an earthquake dataset in Taiwan.End

George C. Tseng, Ph.D.
Assistant Professor
Department of Biostatistics 
University of Pittsburgh
Pittsburgh, PA, USA

Wing H. Wong, Ph.D.
Professor
Department of Statistics
Stanford University
Stanford, CA, USA

ESI Special Topics, August 2006
Citing URL - http://www.esi-topics.com/fbp/2006/august06-Tseng_Wong.html

•> Search Special Topics
Fast Breaking Papers Menu || All Topics Menu
Fast Breaking Papers Comments Menu
Help || About || Contact

ScienceWatch.com - Tracking Trends and Perfomance in Basic Research
Go to the new ScienceWatch.com

Write to the Webmaster with questions/comments. Terms of Usage.
The Research Services Group of Thomson Scientific |
(c) 2008 The Thomson Corporation.