文本分类中TF-IDF算法的改进研究

吴宗卓

文本分类中TF-IDF算法的改进研究

投稿时间：2019-10-06 修订日期：2019-10-22 点此下载全文

引用本文：

摘要点击次数: 250

全文下载次数: 0

作者	单位	邮编
吴宗卓^*	陕西国防工业职业技术学院陕西 710300	710300

中文摘要:文本分类中的一个主要问题是如何提高分类准确性。在本文中,为了提高分类准确性,提出了一种基于TF-IDF的新的加权方法TF-IDF-IF。此方法引入了一个新的参数来表示类内特征,它用来计算一个类中文档中的术语频率。实验使用CHI卡方统计特征选择方法在数据集中选择1000个特征,然后分别使用TF-IDF,TF-IDF-CF,LTC,TFC方法在一些常用的分类器如朴素贝叶斯,贝叶斯网络,KNN,SVM中进行实验。从实验结果来看,可以看到这种方法可以取得很好的效果。

中文关键词:文本分类特征选择 CHI平方统计 TFIDF 分类准确性。

Research on Improvement of TF-IDF Algorithm in Text Classification

Abstract::A major issue in text categorization is how to improve classification accuracy. In this paper, in order to improve the classification accuracy, a new weighting method TF-IDF-IF based on TF-IDF is proposed. This method introduces a new parameter to represent in-class features, which is used to calculate the frequency of terms in a document in a class. The experiment uses the CHI chi-square statistical feature selection method to select 1000 features in the data set, and then use TF-IDF, TF-IDF-CF, LTC, and TFC methods respectively in some commonly used classifiers such as Na?ve Bayes, Bayesian networks. , KNN, SVM experiments. From the experimental results, we can see that this method can achieve good results.

keywords:text categorization feature selection CHI square statistics TFIDF categorization accuracy

查看全文 查看/发表评论 下载pdf阅读器