Framework for Building a High-Quality Web Page Collection Considering Page Group Structure

概要

論文の詳細を見る
We proposed a framework for building a high-quality web page collection considering page group structure with two step processes: the rough filtering and the accurate classification. In both processes, we apply the idea of local page group structure that is represented by the relation between a target page and a surrounding page based on the connection types and the relative URL hierarchy. In this paper, we use researchers' homepages as an example of target categories. In the rough filtering, we proposed a method for comprehensively gathering all potential researchers' homepages from the web with as few noise pages as possible by using property-based keyword lists according to four page group models (PGMs) based on the page group structure. The experiment results show that it reduces the increase of gathered page amount to an allowable level and gathers a significant number of positive pages that could not be gathered with a single-page-based method. In the accurate classification, we proposed a textual feature set for support vector machine (SVM). The surrounding pages are grouped based on the page group structure, an independent feature subset is generated from each group, and then the feature set is composed by concatenating the feature subsets. An evident improvement of classification performance is shown by an experiment. Using in combination a recall-assured classifier and a precision-assured classifier each of which is obtained by tuning the SVM with the proposed feature set, we next build a three-way classifier to accurately select the pages that need manual assessment to assure the required quality. The effectiveness is shown with the reduction of the manual assessment page number.
一般社団法人情報処理学会の論文
2006-11-16

Framework for Building a High-Quality Web Page Collection Considering Page Group Structure

スポンサーリンク

概要

著者

関連論文

スポンサーリンク