Unsupervised Learning with MIxed Numeric and Nominal Data - A New Similarity Measure

Cen Li and Gautam Biswas

KDD: Techniques and Applications - Proc. First Pacific-Asia Conference on Knowledge Discovery and Data Mining


Abstract - This paper presents a new Similarity Based Agglomerative Clustering (SBAC) algorithm that works well for data with mixed numeric and nominal features. A similarity measure, proposed by Goodall for biological taxonomy[15], that gives greater eight to uncommon feature value matches in similarity computations and makes no assumptions of the underlying distributions of the feature values, is adopted to define the similarity measure between pairs of objects. An agglomerative algorithm is employed to construct a dendrogram, and a simple distinctness heuristic is used to extract a partitional of the data. The performance of SBAC has been studied on real and artificially generated data sets. Results demonstrate the effectiveness of this algorithm in unsupervised discovery tasks. Comparisions with other schemes illustrate the superior performance of the algorithm.


Download: Article (PDF 475136 bytes)


World Scientific Publishers, Singapore, pages 35-48, February 1997