文本分类中TF-IDF算法的改进研究
投稿时间:2019-10-06  修订日期:2019-10-22  点此下载全文
引用本文:
摘要点击次数: 250
全文下载次数: 0
作者单位邮编
吴宗卓* 陕西国防工业职业技术学院 陕西 710300 710300
中文摘要:文本分类中的一个主要问题是如何提高分类准确性。在本文中,为了提高分类准确性,提出了一种基于TF-IDF的新的加权方法TF-IDF-IF。此方法引入了一个新的参数来表示类内特征,它用来计算一个类中文档中的术语频率。实验使用CHI卡方统计特征选择方法在数据集中选择1000个特征,然后分别使用TF-IDF,TF-IDF-CF,LTC,TFC方法在一些常用的分类器如朴素贝叶斯,贝叶斯网络,KNN,SVM中进行实验。从实验结果来看,可以看到这种方法可以取得很好的效果。
中文关键词:文本分类  特征选择  CHI平方统计  TFIDF  分类准确性。
 
Research on Improvement of TF-IDF Algorithm in Text Classification
Abstract::A major issue in text categorization is how to improve classification accuracy. In this paper, in order to improve the classification accuracy, a new weighting method TF-IDF-IF based on TF-IDF is proposed. This method introduces a new parameter to represent in-class features, which is used to calculate the frequency of terms in a document in a class. The experiment uses the CHI chi-square statistical feature selection method to select 1000 features in the data set, and then use TF-IDF, TF-IDF-CF, LTC, and TFC methods respectively in some commonly used classifiers such as Na?ve Bayes, Bayesian networks. , KNN, SVM experiments. From the experimental results, we can see that this method can achieve good results.
keywords:text categorization  feature selection  CHI square statistics  TFIDF  categorization accuracy
查看全文   查看/发表评论   下载pdf阅读器