Beginning in mid-February 2008, the 1997-2007 online version of the Science Watch® newsletter, ESI-Topics.com, and in-cites.com, will all be featured together on the redesigned ScienceWatch.com. All previous content from the three sites will be permanently archived, and remain accessible from any existing bookmarks to the archived pages. No new content will be added to this site. Updates and new content (updated biweekly) are available at ScienceWatch.com now.

Fast Moving Fronts Comments

Return to menu of Fast Moving Fronts

ESI Special Topics, January 2004
Citing URL: http://www.esi-topics.com/fmf/2003/january04-KaYeeYeung.html

From •>>January 2004

Ka Yee Yeung answers a few questions about this month's fast moving front in the field of Computer Science.

Field: Computer Science
Article: Model-based clustering and data transformations for gene expression data
Authors: Yeung, KY;Fraley, C;Murua, A;Raftery, AE;Ruzzo, WL
Journal: BIOINFORMATICS, 17: (10) 977-987, OCT 2001
Addresses: Univ Washington, Box 352350, Seattle, WA 98195 USA.
Univ Washington, Seattle, WA 98195 USA.
Insightful Corp, Seattle, WA 98109 USA.
  

This paper has also been named the New Hot Paper in Computer Science for January 2004.


ST:  Why do you think your paper is highly cited?

"Clustering" is a commonly used exploratory data analysis technique for pattern discovery in gene expression datasets. The goal is to find groups, or clusters, of genes that tend to vary together across experiments. The idea is that such genes may be functionally related, or co-regulated. While most previously used clustering algorithms were largely based on heuristics, clustering algorithms based on probability models offer a principled alternative, including an objective way to estimate the number of clusters. Our paper was possibly the first to apply model-based clustering methods to gene expression data, and demonstrated that clusters suggested by model-based methods are similar to those known to have biological significance.

ST:  Does it describe a new discovery or new methodology that's useful to others?


“Now that the Human Genome Project has provided a list of genes, the focus has turned to discovering what genes do and how they work.„

Our work presented a novel application of model-based clustering algorithms to gene expression data. We showed that the quality of clusters from model-based clustering methods is relatively high compared to leading heuristic-based methods. In addition, the probability framework allows us to come up with a reasonable estimate of the optimal number of clusters. We hope that our results will motivate biomedical researchers to use probability-based methods to analyze their data, and that our method will facilitate the data mining process in biomedical sciences.

ST:  How did you become involved in this research?

This paper is part of my dissertation project titled "Cluster Analysis of Gene Expression Data" in Computer Science at the University of Washington. Seeking more principled alternatives to the heuristic methods then prevalent in microarray cluster analysis, my dissertation advisor, Dr. Ruzzo, and I decided to study the performance of model-based clustering methods on gene expression datasets, and to collaborate with Drs. Raftery, Fraley, and Murua, who are experts in model-based clustering methods.

ST:  Could you summarize the significance of your paper in layman's terms?

Now that the Human Genome Project has provided a list of genes, the focus has turned to discovering what genes do and how they work. The analysis of gene expression data from microarrays, which simultaneously measure the activity levels of many different genes, is one of the most prevalent approaches to this task. Genes that vary similarly across experimental conditions may be functionally related, and finding groups of such genes is one goal of gene expression data analysis. Clustering, which is the automatic assigment of data points to meaningful groups, is particularly difficult in this setting because of natural variability and measurement error in the data.

Before our paper, the clustering methods used were largely heuristic. Our contribution was to apply model-based clustering methods to this task. Based on established statistical principles, these methods also provide a way of estimating the number of groups in the data, which was previously lacking. We showed that model-based method produces clusters similar to those known to have biological significance, and subsequent research by others has confirmed the promise of this approach to gene expression data analysis.End

Ka Yee Yeung
Bioinformatics Scientist
Department of Microbiology
University of Washington
Seattle, WA, USA

Co-authors:

Chris Fraley
Senior Research Scientist
Department of Statistics
University of Washington
Seattle, WA, USA

Alejandro Murua
Acting Assistant Professor
Department of Statistics
University of Washington
Seattle, WA, USA

Adrian E. Raftery
Professor
Department of Statistics
University of Washington
Seattle, WA, USA

Walter L. Ruzzo
Professor
Department of Computer Science
University of Washington
Seattle, WA, USA
  

This paper has also been named the New Hot Paper in Computer Science for January 2004.

Return to Fast Moving Fronts | Return to Special Topics main menu
 

ESI Special Topics, January 2004
Citing URL: http://www.esi-topics.com/fmf/2003/january04-KaYeeYeung.html

ScienceWatch.com - Tracking Trends and Perfomance in Basic Research
Go to the new ScienceWatch.com

Write to the Webmaster with questions/comments. Terms of Usage.
The Research Services Group of Thomson Scientific |
(c) 2008 The Thomson Corporation.