一种自适应网页结构化信息提取方法-AET-电子技术应用

一种自适应网页结构化信息提取方法

2020年电子技术应用第12期

淮晓永，韩晓东，高若辰，高焕新

华北计算机系统工程研究所，北京100083

摘要： 面向互联网信息采集挖掘应用，针对传统的网站信息整页采集方式存在采集信息混杂、无法直接使用，而人工结构化采集方式成本高、工作效率低的问题，研究提出了一种自适应网页结构化信息提取方法，实现了网页分类算法、基于子树的标题项、内容项的结构化信息提取算法。基于典型网站网页分类标注数据集进行分类模型的学习建模，可以自适应不同网站的差异，对网页进行分类，按照网页分类分别提取出网页中的列表项结构化信息、内容项结构化信息。该技术对提高网站信息结构化采集处理的自动化水平及处理效率具有重要作用。

关键词： 信息提取结构化信息分类模型自适应

中图分类号： TN919.5；TP391.1
文献标识码： A
DOI：10.16157/j.issn.0258-7998.200160
中文引用格式： 淮晓永，韩晓东，高若辰，等. 一种自适应网页结构化信息提取方法[J].电子技术应用，2020，46(12)：97-102.
英文引用格式： Huai Xiaoyong，Han Xiaodong，Gao Ruochen，et al. An adaptive method for extracting structured information from web pages[J]. Application of Electronic Technique，2020，46(12)：97-102.

An adaptive method for extracting structured information from web pages

Huai Xiaoyong，Han Xiaodong，Gao Ruochen，Gao Huanxin

National Computer System Engineering Research Institute of China，Beijing 100083，China

Abstract： In order to meet the needs of Internet information collection and mining, aiming at the problems of traditional web site information collection methods, such as mixed collection information, unable to be used directly, and the high cost and low efficiency of manual structured collection method, this paper proposes an adaptive method for extracting structured information from web pages. We implement web page classification algorithm, subtree based title item and content item structured information extraction algorithm. Based on the classification annotated dataset of typical website pages, the classification model can adapt to the differences of various web sites, classify the web pages, and extract the list structured information and content structured information in the web pages according to the web page classification. This technology plays an important role in improving the automation level and processing efficiency of website structured information collection and processing.

Key words : information extraction；structured information；classification model；adaptive

0 引言

在互联网大数据时代，互联网信息呈现爆炸式增长，其中蕴藏着很多有价值的重要信息需要处理与利用。通过智能化的大数据信息挖掘处理，可以从中分析把握技术发展的方向态势，迅速发现高价值的科技信息。

从关注的Internet网站源自动采集收集新发布的信息，并提取出其中的结构化信息，是建立互联网大数据系统的基础。通过网络爬虫系统可以从各类网站爬取大量的网页数据，但传统的网站信息整页采集方式信息混杂，无法直接进行大数据挖掘处理，而人工从网页中提取结构化的文本信息又存在成本高、工作效率低的问题。如何通过自动化的网页数据结构化信息采集技术实现自动从网页中提取结构化的信息，是进行互联网大数据挖掘处理的关键预处理技术。

本文研究针对传统的网站信息整页采集方式存在采集信息混杂、无法直接使用，而人工结构化采集方式成本高、工作效率低的问题，研究实现了一种基于DOM树的网页结构化信息提取方法(DOM based Web-page Structured Information Extraction，DWSIE)，实现了一个网页结构化信息提取服务工具包，该工具包极大地提高了网站结构化信息采集处理的自动化水平和处理效率。

本文详细内容请下载:http://www.chinaaet.com/resource/share/2000003263

作者信息:

淮晓永，韩晓东，高若辰，高焕新

(华北计算机系统工程研究所，北京100083)

原创声明：此内容为AET网站原创，未经授权禁止转载。

相关内容