Cluster Analysis for Large, High-Dimensional Datasets: Methodology and Applications

Ilies, Iulian

Cluster Analysis for Large, High-Dimensional Datasets: Methodology and Applications

Cluster analysis represents one of the most versatile methods in statistical science. It is employed in empirical sciences for the summarization of datasets into groups of similar objects, with the purpose of facilitating the interpretation and further analysis of the data. Cluster analysis is of particular importance in the exploratory investigation of data of high complexity, such as that derived from molecular biology or image databases. Consequently, recent work in the field of cluster analysis has focused on designing algorithms that can provide meaningful solutions for data with high cardinality and/or dimensionality, under the natural restriction of limited resources. The present thesis aims to develop improved methods for the clustering of high-dimensional datasets, as well as further applications of such algorithms in practical settings. In the first part of the thesis, a more detailed review of the representative clustering algorithms focused on the analysis of very large or high-dimensional datasets is provided. Subsequently, a newly developed method for this purpose is described and evaluated. The developed algorithm is based on the principles of projection pursuit and grid partitioning, and focuses on reducing computational requirements for large datasets without loss of performance. In the second part of the thesis, a novel method for generating synthetic datasets with variable structure and clustering difficulty, that is aimed at evaluating clustering algorithms is presented. In the third part of the thesis, the applications of cluster analysis to the field of automatic image classification are investigated. A novel system for the semi-supervised annotation of images is described and evaluated. The system is based on a vocabulary of clusters of visual features extracted from images with known classification.

Meta data
Publishing Institution:	IRC-Library, Information Resource Center der Jacobs University Bremen
Granting Institution:	Jacobs Univ.
Author:	Iulian Ilies
Referee:	Adalbert Wilhelm, Lars Linsen, Patrick Groenen
Advisor:	Adalbert Wilhelm
Persistent Identifier (URN):	urn:nbn:de:101:1-201307119263
Document Type:	PhD Thesis
Language:	English
Date of Successful Oral Defense:	2010/12/01
Date of First Publication:	2010/12/08
PhD Degree:	Statistics
School:	SHSS School of Humanities and Social Sciences
Library of Congress Classification:	Q Science / QA Mathematics (incl. computer science) / QA273-280 Probabilities. Mathematical statistics / QA276 Mathematical statistics / QA276.3 Graphic methods. Data processing
Call No:	Thesis 2010/31

Cluster Analysis for Large, High-Dimensional Datasets: Methodology and Applications

Download full text

Cite this publication

Search for this publication