Clustering Approaches to Text Categorization(Learning & Discovery)(<Special Issue>Doctorial Theses on Aritifical Intelligence)

概要

論文の詳細を見る
The aim of this thesis is to improve accuracy of text categorization, which is the basis for various applications such as e-mail classification and Web-page classification. Among the various possible approaches to this aim, two clusteling approaches and an application of a new kernel (similarity function) are discussed in this thesis. Although clustering is usually regarded as an unsupervised learning method and categorization as a supervised learning, we show that clustering can be used to improve accuracy of text categorization. The first clustering approach proposed is co-clustering of words and texts. In a number of previous probabilistic approaches, texts in the same category are implicitly assumed to have an identical distribution over words. We empincally show that this assumption is not accurate, and propose a new framework based on a co-clustering technique to alleviate this problem. In this method, training texts are clustered so that the assumption is more likely to be true, and at the same time, features are also clustered in order to tackle the data sparseness problem. We succeeded in improving accuracy of text categorization using the co-clustering method. The second approach is constructive induction based on clustering. In this approach, Support Vector Machines (SVMs) are combined with constructive induction using dimension reduction methods, such as Latent Semantic Indexing (LSI). New features derived by dimension reduction are added to the feature space. Using this method, we succeeded in improving the categorization performance of SVMs in text categorization, especially when a number of extra unlabeled examples other than the given labeled examples are used in the dimesion reduction step. Lastly we discuss the use of a kernel function based on probabilistic models. The TOP kernel is a kernel which can be used with discriminative classifiers on the basis of a probabilistic model. We first view clustering-based constructive induction from the theory of the TOP kernel. We then propose a new TOP kernel which is based on hyperplanes generated by SVMs.
社団法人人工知能学会の論文
2004-01-01

Clustering Approaches to Text Categorization(Learning & Discovery)(<Special Issue>Doctorial Theses on Aritifical Intelligence)

スポンサーリンク

概要

著者

関連論文

スポンサーリンク