面向油田领域的中文真词错误自动校对方法研究

王辉; Marius. Petrescu; 潘俊辉; 王浩畅; 张强; 张岩

面向油田领域的中文真词错误自动校对方法研究

投稿时间：2020-02-03 修订日期：2020-03-04 点此下载全文

引用本文：

摘要点击次数: 59

全文下载次数: 0

作者	单位	邮编
王辉^*	东北石油大学计算机与信息技术学院	163318
Marius. Petrescu	普罗莱斯蒂石油天然气大学
潘俊辉	东北石油大学计算机与信息技术学院
王浩畅	东北石油大学计算机与信息技术学院
张强	东北石油大学计算机与信息技术学院
张岩	东北石油大学计算机与信息技术学院

基金项目:国家自然科学基金(61402099,61702093)；黑龙江省教育科学“十三五”规划课题(GJB1319025)；东北石油大学青年科学基金(2018QNL-49)；东北石油大学重点建设课程《Data Mining》

中文摘要:中文真词错误自动校对是自然语言理解的一项重要的基础研究课题,油田数字化过程中利用图像识别及人工录入产生的中文真词错误会直接影响后期数据综合分析准确度。本文对中文真词错误成因和统计语言模型进行分析,提出一种面向油田领域的中文真词错误自动校对方法。该方法首先构建通用领域和油田领域混淆集,再引入同义词集丰富知识库,对语料分词后,综合统计分析目标词与混淆词、周边词的同义词之间关系,自动校对真词错误。实验表明, 本文提出的方法能有效校对油田领域的中文真词错误。

中文关键词:真词错误 N-gram 文本自动校对知识库构建

Research on Chinese Real-word Error Automatic Proofreading For Oilfield

Abstract:Automatic proofreading of Chinese real-word errors is an important and basic research issus in NLP, in the process of oil field digitization, Chinese real-word errors generated by image recognition and manual input, which will directly affect the accuracy of later data comprehensive analysis. This paper analyzes the cause of Chinese real-word errors and statistical language models, and proposes an automatic proofreading method of Chinese real-word errors for oilfield. First the confusion sets of general domain and oilfield domain are constructed, then the knowledge base is enriched by adding synonyms set. After word segmentation, the relationship between target word and synonyms words of confosed words and peripheral words was analyzed by comprehensive statistics, real-word errors are automatically checked. Experimental results show that the method proposed in this paper can effectively proofread Chinese real-word errors in oilfield.

keywords:real-word error N-gram text automatic proofreading knowledge base construction

查看全文 查看/发表评论 下载pdf阅读器