Cluster Analysis for Large, High-Dimensional Datasets: Methodology and Applications
- Cluster analysis represents one of the most versatile methods in statistical science. It is employed in empirical sciences for the summarization of datasets into groups of similar objects, with the purpose of facilitating the interpretation and further analysis of the data. Cluster analysis is of particular importance in the exploratory investigation of data of high complexity, such as that derived from molecular biology or image databases. Consequently, recent work in the field of cluster analysis has focused on designing algorithms that can provide meaningful solutions for data with high cardinality and/or dimensionality, under the natural restriction of limited resources. The present thesis aims to develop improved methods for the clustering of high-dimensional datasets, as well as further applications of such algorithms in practical settings.
In the first part of the thesis, a more detailed review of the representative clustering algorithms focused on the analysis of very large or high-dimensional datasets is provided. Subsequently, a newly developed method for this purpose is described and evaluated. The developed algorithm is based on the principles of projection pursuit and grid partitioning, and focuses on reducing computational requirements for large datasets without loss of performance. In the second part of the thesis, a novel method for generating synthetic datasets with variable structure and clustering difficulty, that is aimed at evaluating clustering algorithms is presented. In the third part of the thesis, the applications of cluster analysis to the field of automatic image classification are investigated. A novel system for the semi-supervised annotation of images is described and evaluated. The system is based on a vocabulary of clusters of visual features extracted from images with known classification.