|
From
•>>January 2004
Ka Yee Yeung answers
a few questions about this month's fast moving front in the
field of Computer Science.
Field: Computer Science
Article: Model-based clustering and data transformations for gene expression data
Authors: Yeung,
KY;Fraley, C;Murua, A;Raftery, AE;Ruzzo, WL
Journal: BIOINFORMATICS, 17: (10) 977-987, OCT 2001
Addresses: Univ Washington, Box 352350, Seattle, WA 98195 USA.
Univ Washington, Seattle, WA 98195 USA.
Insightful Corp, Seattle, WA 98109 USA.
| This paper has
also been named the New Hot Paper in Computer
Science for January
2004.
|
|
|

Why
do you think your paper is highly cited?
"Clustering" is a commonly used exploratory data
analysis technique for pattern discovery in gene expression
datasets. The goal is to find groups, or clusters, of genes that
tend to vary together across experiments. The idea is that such
genes may be functionally related, or co-regulated. While most
previously used clustering algorithms were largely based on
heuristics, clustering algorithms based on probability models
offer a principled alternative, including an objective way to
estimate the number of clusters. Our paper was possibly the
first to apply model-based clustering methods to gene expression
data, and demonstrated that clusters suggested by model-based
methods are similar to those known to have biological
significance.
Does
it describe a new discovery or new methodology that's useful to
others?
|

“Now that the Human Genome Project has provided a list of genes, the focus has turned to discovering what genes do and how they work.„
|
|
Our work presented a novel application of model-based
clustering algorithms to gene expression data. We showed that
the quality of clusters from model-based clustering methods is
relatively high compared to leading heuristic-based methods. In
addition, the probability framework allows us to come up with a
reasonable estimate of the optimal number of clusters. We hope
that our results will motivate biomedical researchers to use
probability-based methods to analyze their data, and that our
method will facilitate the data mining process in biomedical
sciences.
How
did you become involved in this research?
This paper is part of my dissertation project titled
"Cluster Analysis of Gene Expression Data" in Computer
Science at the University of Washington. Seeking more principled
alternatives to the heuristic methods then prevalent in
microarray cluster analysis, my dissertation advisor, Dr. Ruzzo,
and I decided to study the performance of model-based clustering
methods on gene expression datasets, and to collaborate with
Drs. Raftery, Fraley, and Murua, who are experts in model-based
clustering methods.
Could
you summarize the significance of your paper in layman's terms?
Now that the Human Genome Project has provided a list of
genes, the focus has turned to discovering what genes do and how
they work. The analysis of gene expression data from microarrays,
which simultaneously measure the activity levels of many
different genes, is one of the most prevalent approaches to this
task. Genes that vary similarly across experimental conditions
may be functionally related, and finding groups of such genes is
one goal of gene expression data analysis. Clustering, which is
the automatic assigment of data points to meaningful groups, is
particularly difficult in this setting because of natural
variability and measurement error in the data.
Before our paper, the clustering methods used were largely
heuristic. Our contribution was to apply model-based clustering
methods to this task. Based on established statistical
principles, these methods also provide a way of estimating the
number of groups in the data, which was previously lacking. We
showed that model-based method produces clusters similar to
those known to have biological significance, and subsequent
research by others has confirmed the promise of this approach to
gene expression data analysis.
Ka Yee Yeung
Bioinformatics Scientist
Department of Microbiology
University of Washington
Seattle, WA, USA
Co-authors:
Chris Fraley
Senior Research Scientist
Department of Statistics
University of Washington
Seattle, WA, USA
Alejandro Murua
Acting Assistant Professor
Department of Statistics
University of Washington
Seattle, WA, USA
Adrian E. Raftery
Professor
Department of Statistics
University of Washington
Seattle, WA, USA
Walter L. Ruzzo
Professor
Department of Computer Science
University of Washington
Seattle, WA, USA
| This paper has
also been named the New Hot Paper in Computer
Science for January
2004.
|
|
Return to Fast Moving Fronts |
Return to Special Topics main menu
|