一种基于密度峰值的高效分布式聚类算法

一种基于密度峰值的高效分布式聚类算法

投稿时间：2018-05-27 修订日期：2018-08-14 点此下载全文

引用本文：

摘要点击次数: 135

全文下载次数: 0

作者	单位	邮编
何仝^*	长沙理工大学	410114

基金项目:国家自然科学基金项目（面上项目，重点项目，重大项目）;湖南省科技服务平台专项(No.2012TP1001)；湖南省教育厅重点项目(No.17A007)；综合交通运输大数据智能处理湖南省重点实验室项目(2015TP1005)；长沙市科技计划项目(No.KQ1703018, No.KQ1706064)

中文摘要:基于密度峰值的聚类算法(DPC)是最近提出的一种高效密度聚类算法。该算法可以对非球形分布的数据聚类，有待调节参数少、聚类速度快等优点，但在计算每个数据对象的密度值和高密度最邻近距离时，需要进行距离度量，其时间复杂度为。在大数据时代，由尤其是处理海量高维数据时，该算法的效率会受到很大的影响。为了提高该算法的效率和扩展性，利用 Spark 在内存计算以及迭代计算上的优势,提出一种高效的基于E2LSH分区的聚类算法ELSDPC(an efficient distributed density peak clustering algorithm based on E2LSH partition with spark)。算法利用DPC算法的局部特性，引入LSH实现将邻近点集划分到一个区域。通过实验分析表明：该算法可在满足较高准确率的同时有效提高聚类算法的扩展性和时间效率。

中文关键词:聚类 DPC 大数据 LSH spark

An efficient distributed clustering algorithm based on peak density

Abstract:The density peak clustering algorithm (DPC) is a recently proposed efficient density clustering algorithm. The algorithm can cluster the data of non-spherical distribution, which needs less adjustment parameters and fast clustering speed. But when calculating the density and exclusion value of each data object, the distance measure needs to be measured, and its time complexity is . When dealing with big data, especially high-dimension data ,the efficiency of the algorithm will be greatly affected. In order to improve the efficiency and scalability of the algorithm, take the advantages of Spark in memory calculation and iterative computing,we propose an efficient clustering algorithm based on E2LSH partition–ELSDPC. Using the local characteristics of the DPC algorithm, the LSH implementation is introduced to divide the adjacent point set into a region. The experimental analysis shows that the algorithm can effectively improve the scalability and time efficiency of the clustering algorithm while satisfying the high accuracy.

keywords:clustering DPC big data LSH spark

查看全文 查看/发表评论 下载pdf阅读器