一种结合TF-IDF和Simhash的科技项目文本相似性度量方法-AET-电子技术应用

一种结合TF-IDF和Simhash的科技项目文本相似性度量方法

电子技术应用

孙北宁1，2，吕维新3，曾俊4，肖衡4

（1.云南电网有限责任公司科数部，云南昆明 650011；2.西南林业大学大数据与智能工程学院，云南昆明 650224； 3.云南电网有限责任公司昆明供电局，云南昆明 650011；4.云南云电同方科技有限公司，云南昆明 650214）

摘要： 为了提高科技项目文本相似性度量的准确性和性能，将TF-IDF和Simhash相结合，提出了一种新的科技项目文本相似性度量方法。首先，该方法对科技项目文本进行预处理得到词项集合，再使用TF-IDF计算词项集合中每个词项的权重值，并选取具有较高权重值的重要词项；其次，使用Simhash把重要词项映射为固定长度的二进制串，并求和得到文本的Simhash签名；最后，使用汉明距离计算两个Simhash签名间的相似性。实验结果表明，所提方法在查准率、召回率和F度量值方面优于传统的Simhash算法和TF-IDF方法。

关键词： 科技项目文本文本相似度 TF-IDF Simhash算法

中图分类号：TP311
文献标志码：A
DOI: 10.16157/j.issn.0258-7998.223379
中文引用格式： 孙北宁，吕维新，曾俊，等. 一种结合TF-IDF和Simhash的科技项目文本相似性度量方法[J]. 电子技术应用，2023，49(6)：89-93.
英文引用格式： Sun Beining，Lv Weixin，Zeng Jun，et al. An approach for text similarity measurement of science and technology projects combing TF-IDF and Simhash[J]. Application of Electronic Technique，2023，49(6)：89-93.

An approach for text similarity measurement of science and technology projects combing TF-IDF and Simhash

Sun Beining1，2，Lv Weixin3，Zeng Jun4，Xiao Heng4

(1.Department of Science Technology and Data， Yunnan Power Grid Co.， Ltd.， Kunming 650011， China； 2.School of Big Data and Intelligent Engineering， Southwest Forestry University， Kunming 650224， China； 3.Kunming Power Supply Bureau， Yunnan Power Grid Co.， Ltd.， Kunming 650011， China； 4.Yunnan Yundian Tongfang Technology Co.， Ltd.， Kunming 650214， China)

Abstract： To enhance the accuracy and performance of text similarity measurement of science and technology projects, this paper proposes a new approach for measuring text similarity of science and technology projects by combining TF-IDF and Simhash. Firstly, this method uses natural language processing technology to preprocess science and technology project texts to get a term set, then uses the TF-IDF method to calculate the TF-IDF value of each term in the term set, and selects the important term with higher TF-IDF value. Secondly, this method uses the Simhash algorithm to get the Simhash signature of the text through mapping the selected important terms into fixed binary strings. Finally, Hamming distance is used to calculate the similarity between two Simhash signatures. Experimental results show that compared to the traditional Simhash and TF-IDF, the proposed method can promote the evaluation metrics of precision, recall and F-measure.

Key words : science and technology project text；text similarity；TF-IDF；Simhash

0　引言

随着国家对科技事业经费的大量投入，少数科研单位或个人为了获取更多的科研经费，出现了重复申报的现象。文本相似性度量被认为是检测文本重复的最好方法之一，可以用来自动检测科技项目文本的相似性和重复性。

TF-IDF是一种经典的文本相似性度量方法，将文本视为词项的集合，并通过词频信息将文本表示为一个向量，以此计算文本的相似性。但是，该方法并没有降低文本模型的维度。对于科技项目文本，由于词项数目巨大，因此，基于词频向量模型的文本表示是高维、稀疏的，这将导致低效的计算性能。

Simhash是一种局部敏感哈希方法，将高维数据降维到具有固定长度的二进制串（Simhash签名），再通过对二进制串进行相似性计算来比较文本的相似度。这种方法在高维数据空间具有优异的计算性能。但是，该方法未考虑科技项目文本中词项的重要性，存在准确率不高的问题。

本文详细内容请下载：https://www.chinaaet.com/resource/share/2000005355

作者信息：

孙北宁1，2，吕维新3，曾俊4，肖衡4

（1.云南电网有限责任公司科数部，云南昆明 650011；2.西南林业大学大数据与智能工程学院，云南昆明 650224；
3.云南电网有限责任公司昆明供电局，云南昆明 650011；4.云南云电同方科技有限公司，云南昆明 650214）

原创声明：此内容为AET网站原创，未经授权禁止转载。

相关内容