Conceptual Clustering with Numeric-and-Nominal Mixed Data -- A New Similarity Based System

Cen Li and Gautam Biswas

Department of Computer Science, Vanderbilt University
Box 1679, Station B,
Nashville, TN 37235

IEEE Transaction on Knowledge and Data Engineering, (to appear, 2001)


This paper presents a new Similarity Based Agglomerative Clustering(SBAC) algorithm that works well for data with mixed numeric and nominal features. A similarity measure, proposed by Goodall for biological taxonomy, that gives greater weight to uncommon feature-value matches in similarity computations and makes no assumptions of the underlying distributions of the feature-values, is adopted to define the similarity measure between pairs of objects. An agglomerative algorithm is employed to construct a concept tree, and a simple distinctness heuristic is used to extract a partition of the data. The performance of SBAC has been studied on real and artificially generated data sets. Results demonstrate the effectiveness of this algorithm in unsupervised discovery tasks. Comparisons with other schemes illustrate the superior performance of the algorithm.

Index Terms: agglomerative clustering, conceptual clustering, feature weighting, interpretation, knowledge discovery, mixed numeric and nominal data, similarity measure, ${\chi^2}$ combination.

Full Paper (PDF 376832 bytes).