Automatic Extraction of Academic Research Information from Higher Education Institution Websites Using Anchor Texts and Link Structures

概要

論文の詳細を見る
The present study is a part of broader studies aimed at developing a system designed to classify and search for information on the Web so as to benefit university faculty and students in their teaching, research and learning. Through the use of link structures of Web pages, academic research information including useful pages for education, such as research descriptions and lecture notes, was extracted automatically from university Web pages. A new technique was applied for the purpose of automatic extraction, that is, the collection of pages to which links are provided by anchor text from html pages containing a distinctive word, and additional collection of groups of linked pages from the collected pages. More specifically, laboratory Websites were extracted automatically from Websites of the University of Tsukuba with an exceptionally high recall factor and relevance ratio. This extraction method using Web page link structures has been proven to be effective in automatically extracting information where the terms of high appearance rate in the page are not found and therefore it is difficult to implement the automatic extraction of information through natural language processing or where the page structure lacks regularity.