Experiments on Automatic Web Page Categorization for Information Retrieval System (Special Issue on Multimedia Network System)
スポンサーリンク
概要
- 論文の詳細を見る
Our goal is to embed keyword-based categorization technique into information retrieval systems for Web pages to facilitate the end-users' search task. Then, search results must be categorized faster, while keeping accuracy high. Typical keyword-based categorization systems use a knowledge base (KB) to assign categories. The KB contains keywords with weights by category, and generate KB automatically from training texts. With this keyword-based approach, the algorithms to extract keywords and assign weights to them should be considered, because they affect strongly accuracy and processing speed. Furthermore, we must take two characteristics of Web pages into account: (1) the text length is variable, which makes it harder to use statistics to calculate keyword weights, and (2) too many distinct words are used, which makes the KB bigger and therefore processing speed lower. We propose five kinds of methods to normalize word frequency distribution for higher accuracy, and three kinds of methods to filter out non-important words from the KB for faster processing. We performed experiments to compare these methods from viewpoints of accuracy and KB size. The results show that the accuracy improvement by combining our normalization methods and filtering methods is statistically significant. The results also shows that the KBs with various accuracy values and sizes could be generated and that end-users could select appropriate KB according to their preferences in accuracy and speed.
- 一般社団法人情報処理学会の論文
- 2001-02-15
著者
-
Tsuji H
Systems Development Laboratory Hitachi Ltd.
-
TSUJI HIROSHI
Systems Development Laboratory, Hitachi, Ltd.
-
MASE HISAO
Systems Development Laboratory, Hitachi, Ltd.
-
Mase Hisao
Systems Development Laboratory Hitachi Ltd.
-
Tsuji Hiroshi
Systems Development Laboratory Hitachi Ltd.
関連論文
- Distribution of a Major Lysosomal Membrane Glycoprotein, LGP85/LIMP II, in Rat Tissues
- Identification and Characterization of a Major Lysosomal Membrane Glycoprotein, LGP85/ LIMP II in Mouse Liver
- ELISE: Office Procedures Automation Tool By State-Transition Model
- Chemiluminescence Emission of C_2, CH and OH Radicals from Opposed Jet Burner Flames
- Structure of Opposed Jet Flame and NO_x Formation
- Characteristics of Mixture Turbulence and Structure of Opposed Jet Burner Flames
- Experiments on Automatic Web Page Categorization for Information Retrieval System (Special Issue on Multimedia Network System)
- Expert System for Transferring Programming Knowhow from Skilled to Unskilled Programmers