Expanding the Scope of the k-Prototypes Algorithm - Addressing Issues in Cluster Analysis of Mixed-Type Data Arising from Real-World Applications
- Cluster analysis is a common part of data analysis. Its aim is the identification of unknown structure in data and the determination of a partition with groups of objects as similar as possible (so-called clusters). In contrast to the frequent occurrence of mixed-type data in real-world applications, involving numerical as well as categorical features, research tends to concentrate on data containing exclusively numerical features. There are comparatively few methods for clustering mixed-type data, with the k-prototypes algorithm being presumably the most widely recognized. The purpose of this cumulative dissertation is to expand the scope of this clustering algorithm. It addresses aspects that are not treated in Huang's original publication of the k-prototype algorithm, including the validation of the number of clusters, variable selection of data to be clustered, imputation of incomplete data, algorithm initialization, and the integration of an alternative distance measure in the algorithm routine. These issues are covered as they are prevalent in the application of the k-prototypes algorithm on real-world data. In these clustering tasks, the user lacks knowledge about the optimal number of clusters or the most useful variables to determine the cluster partition. In addition, incomplete data often occur and need to be dealt with. The algorithm’s initialization is analyzed to optimize the iterative routine, which was originally published with a random-based choice of initial prototypes. Additionally, the distance-based partitioning algorithm is extended to ordinal data for distance calculation with the change of the algorithm’s distance measure. To conduct the research, simulation studies on artificially generated data are utilized as well as exemplary analyzes on real-world data.