一种服务于K-means的初始中心选取方法-AET-电子技术应用

一种服务于K-means的初始中心选取方法

电子技术应用 2023年3期

李秋云1，刘燕武2

（1.中国运载火箭技术研究院北京宇航系统工程研究所，北京 100076； 2.中国电子信息产业集团有限公司，广东深圳 518000）

摘要： 聚类是数据挖掘领域最重要的技术之一，K-means是其中使用频率最高的举足轻重的聚类算法。然而，K-means算法表现严重依赖于初始中心，选取多少个初始中心以及选择哪些数据点作为初始中心对K-means算法十分重要。基于此，提出一种初始中心选取方法DPCC（Density Peak Clustering Centers）。DPCC方法基于密度和距离生成一个选取决策图，将数据集中所有的密度峰值点凸显出来。这些密度峰值点即为DPCC方法为K-means算法提供的初始中心。实验表明，DPCC方法不仅可为K-means提供初始中心数量，还能有效提高K-means算法的准确度，并缩减K-means算法的执行时间。

关键词： 聚类初始中心决策图

中图分类号：TP3-0 文献标志码：A DOI: 10.16157/j.issn.0258-7998.223066
中文引用格式： 李秋云，刘燕武. 一种服务于K-means的初始中心选取方法[J]. 电子技术应用，2023，49(3)：134-138.
英文引用格式： Li Qiuyun，Liu Yanwu. An initial centers selection method serving K-means[J]. Application of Electronic Technique，2023，49(3)：134-138.

An initial centers selection method serving K-means

Li Qiuyun1，Liu Yanwu2

(1.Beijing Institute of Astronautical Systems Engineering，China Academy of Launch Vehicle Technology， Beijing 100076， China； 2.China Electronics Corporation， Shenzhen 518000， China)

Abstract： Clustering is one of the most important data mining technologies, and K-means is the most famous and commonly used clustering algorithm. However, the performance of K-means depends heavily on the initial centers. It is very important for K-means to select how many initial centers and which data points to choose as the initial centers. Therefore, an initial centers selection method called DPCC (density peak clustering centers) is proposed. DPCC generates a selection decision graph based on density and distance, so as to highlight all density peak points in dataset. These density peak points are the initial centers provided by DPCC for K-means. Experiments show that DPCC not only provides decision support for the number of initial centers, but also improves the accuracy of K-means and reduces the running time of K-means.

Key words : clustering；initial centers；decision graph

0　引言

聚类是一种无监督分析方法，其目的是识别出数据集中的所有数据簇，并将每个簇中的数据点看作一类。在众多聚类算法中，K-means[1]是使用频率最高的举足轻重的算法之一。K-means算法从数据集中选取k个数据点作为初始聚类中心，按照距离最近原则，将其他数据点分配给这k个初始中心得到初始簇，再将处于初始簇中心的数据点作为新的聚类中心。重复上述过程，直到聚类中心不再改变为止。K-means算法的原理相对简单，这也是其受到广泛追捧的原因。然而，该算法也存在着明显缺陷：

（1）分析之前，需要明确k值。在K-means算法中，k值就是簇的数量。若k被设置为10，那么K-means算法将识别出10个数据簇。但聚类是一种无监督分析任务，在聚类之前无法得知数据集存在多少簇。显然，K-means算法的机理与聚类初衷是相矛盾的。在真实分析场景中，常常会出现k值多于或少于真实簇数的情况，影响聚类准确度。

（2）初始中心易聚团。K-means算法随机将k个数据点确定为初始聚类中心，易造成多个聚类中心出现在同一簇内，导致该簇被分解为多类。

（3）迭代次数无法控制。K-means算法需要经过多次迭代直至聚类中心不再改变为止。通常情况下，聚类中心最终会迭代到密度稠密区。也就是说，初始中心越远离密度核心，K-means算法的迭代次数越多，运行时间越长。又因初始中心是随机选取的，致使K-means算法的运行时间无法控制。

针对上述问题，本文提出一种名为DPCC（Density Peak Clustering Centers）的方法，为K-means算法提供初始中心。DPCC运用于K-means算法之前，通过计算数据点密度以及与高密度数据点间最近距离生成决策图，以凸显数据集中所有的密度峰值点。这些密度峰值点即可作为K-means算法的初始中心。

本文详细内容请下载：https://www.chinaaet.com/resource/share/2000005243

作者信息：

李秋云1，刘燕武2

（1.中国运载火箭技术研究院北京宇航系统工程研究所，北京 100076；
2.中国电子信息产业集团有限公司，广东深圳 518000）

微信图片_20210517164139.jpg

原创声明：此内容为AET网站原创，未经授权禁止转载。

相关内容