一种基于密度峰值的高效分布式聚类算法

何仝1; 2; 徐蔚鸿1; 2; 马红华2; 曾水玲3

一种基于密度峰值的高效分布式聚类算法

引用本文：何仝1，2，徐蔚鸿1，2，马红华2，曾水玲3.一种基于密度峰值的高效分布式聚类算法[J].计算技术与自动化,2019,(2):64-71

摘要点击次数: 1079

全文下载次数: 0

作者	单位
何仝1，2，徐蔚鸿1，2，马红华2，曾水玲3	（1. 长沙理工大学计算机与通信工程学院，湖南长沙 440114； 2. 长沙理工大学综合交通运输大数据智能处理湖南省重点实验室，湖南长沙 440114； 3. 资兴市科学技术局，湖南郴州 423400）

中文摘要:基于密度峰值的聚类算法（DPC）是最近提出的一种高效密度聚类算法。该算法可以对非球形分布的数据聚类，有待调节参数少、聚类速度快等优点，但在计算每个数据对象的密度值和高密度最邻近距离时，需要进行距离度量，其时间复杂度为。在大数据时代，尤其是处理海量高维数据时，该算法的效率会受到很大的影响。为了提高该算法的效率和扩展性，利用 Spark 在内存计算以及迭代计算上的优势，提出一种高效的基于E2LSH分区的聚类算法ELSDPC（an efficient distributed density peak clustering algorithm based on E2LSH partition with spark）。算法利用DPC算法的局部特性，引入局部敏感哈希算法LSH实现将邻近点集划分到一个区域。通过实验分析表明：该算法可在满足较高准确率的同时有效提高聚类算法的扩展性和时间效率。

中文关键词:聚类密度峰值大数据局部敏感哈希 Spark

An Efficient Distributed Clustering Algorithm Based on Peak Density

Abstract:The density peak clustering algorithm （DPC） is a recently proposed efficient density clustering algorithm. The algorithm can cluster the data of non-spherical distribution，which needs less adjustment parameters and fast clustering speed. But when calculating the density and exclusion value of each data object，the distance measure needs to be measured，and its time complexity is . When dealing with big data，especially high-dimension data ，the efficiency of the algorithm will be greatly affected. In order to improve the efficiency and scalability of the algorithm，take the advantages of Spark in memory calculation and iterative computing，we propose an efficient clustering algorithm based on E2LSH partition-ELSDPC. Using the local characteristics of the DPC algorithm，the LSH implementation is introduced to divide the adjacent point set into a region. The experimental analysis shows that the algorithm can effectively improve the scalability and time efficiency of the clustering algorithm while satisfying the high accuracy.

keywords:clustering density peak big data LSH Spark

查看全文 查看/发表评论 下载pdf阅读器