Expanding the Scope of the k-Prototypes Algorithm - Addressing Issues in Cluster Analysis of Mixed-Type Data Arising from Real-World Applications

  • Cluster analysis is a common part of data analysis. Its aim is the identification of unknown structure in data and the determination of a partition with groups of objects as similar as possible (so-called clusters). In contrast to the frequent occurrence of mixed-type data in real-world applications, involving numerical as well as categorical features, research tends to concentrate on data containing exclusively numerical features. There are comparatively few methods for clustering mixed-type data, with the k-prototypes algorithm being presumably the most widely recognized. The purpose of this cumulative dissertation is to expand the scope of this clustering algorithm. It addresses aspects that are not treated in Huang's original publication of the k-prototype algorithm, including the validation of the number of clusters, variable selection of data to be clustered, imputation of incomplete data, algorithm initialization, and the integration of an alternative distance measure in the algorithm routine. These issues are covered as they are prevalent in the application of the k-prototypes algorithm on real-world data. In these clustering tasks, the user lacks knowledge about the optimal number of clusters or the most useful variables to determine the cluster partition. In addition, incomplete data often occur and need to be dealt with. The algorithm’s initialization is analyzed to optimize the iterative routine, which was originally published with a random-based choice of initial prototypes. Additionally, the distance-based partitioning algorithm is extended to ordinal data for distance calculation with the change of the algorithm’s distance measure. To conduct the research, simulation studies on artificially generated data are utilized as well as exemplary analyzes on real-world data.

Download full text

Cite this publication

  • Export Bibtex
  • Export RIS

Citable URL (?):

Search for this publication

Search Google Scholar Search Catalog of German National Library Search OCLC WorldCat Search Bielefeld Academic Search Engine
Meta data
Publishing Institution:IRC-Library, Information Resource Center der Constructor University
Granting Institution:Constructor Univ.
Author:Rabea Aschenbruck
Referee:Adalbert F. X. Wilhelm, Mathias Bode, Gero Szepannek
Advisor:Adalbert F. X. Wilhelm
Persistent Identifier (URN):urn:nbn:de:gbv:579-opus-1012001
Document Type:PhD Thesis
Language:English
Date of Successful Oral Defense:2024/04/16
Date of First Publication:2024/06/12
PhD Degree:Statistics
Academic Department:School of Business, Social and Decision Sciences
Call No:2024/5

$Rev: 13581 $