Clustering Approaches to Text Categorization(Learning & Discovery)(<Special Issue>Doctorial Theses on Aritifical Intelligence)
スポンサーリンク
概要
- 論文の詳細を見る
The aim of this thesis is to improve accuracy of text categorization, which is the basis for various applications such as e-mail classification and Web-page classification. Among the various possible approaches to this aim, two clusteling approaches and an application of a new kernel (similarity function) are discussed in this thesis. Although clustering is usually regarded as an unsupervised learning method and categorization as a supervised learning, we show that clustering can be used to improve accuracy of text categorization. The first clustering approach proposed is co-clustering of words and texts. In a number of previous probabilistic approaches, texts in the same category are implicitly assumed to have an identical distribution over words. We empincally show that this assumption is not accurate, and propose a new framework based on a co-clustering technique to alleviate this problem. In this method, training texts are clustered so that the assumption is more likely to be true, and at the same time, features are also clustered in order to tackle the data sparseness problem. We succeeded in improving accuracy of text categorization using the co-clustering method. The second approach is constructive induction based on clustering. In this approach, Support Vector Machines (SVMs) are combined with constructive induction using dimension reduction methods, such as Latent Semantic Indexing (LSI). New features derived by dimension reduction are added to the feature space. Using this method, we succeeded in improving the categorization performance of SVMs in text categorization, especially when a number of extra unlabeled examples other than the given labeled examples are used in the dimesion reduction step. Lastly we discuss the use of a kernel function based on probabilistic models. The TOP kernel is a kernel which can be used with discriminative classifiers on the basis of a probabilistic model. We first view clustering-based constructive induction from the theory of the TOP kernel. We then propose a new TOP kernel which is based on hyperplanes generated by SVMs.
- 社団法人人工知能学会の論文
- 2004-01-01
著者
関連論文
- 2000-NL-138-4 構文情報の定量化とそれを用いた言語比較
- NLC2000-13 構文情報の定量化とそれを用いた言語比較
- 素性空間再構成によるWord-Sense Disambiguation
- 素性空間再構成によるWord-Sense Disambiguation
- E-9 SVMとクラスタリングを用いた文書分類のための能動学習(文書分類,E.自然言語・文書)
- Clustering Approaches to Text Categorization(Learning & Discovery)(Doctorial Theses on Aritifical Intelligence)
- SVMを用いた文書分類と構成的帰納学習法
- 機械学習によるゼロ代名詞同定の一方法
- 文書分類のための共クラスタリング(自然言語)(コラボレーションアートとネットワークエンターテイメント)
- 文書分類への二次元クラスタリングの適用
- 独立成分分析を用いた文書分類 : SVMのための素性空間再構成