Using Topic Keyword Clusters for Automatic Document Clustering(Document Clustering, <Special Section>Document Image Understanding and Digital Documents)
スポンサーリンク
概要
- 論文の詳細を見る
Data clustering is a technique for grouping similar data items together for convenient understanding. Conventional data clustering methods, including agglomerative hierarchical clustering and partitional clustering algorithms, frequently perform unsatisfactorily for large text collections, since the computation complexities of the conventional data clustering methods increase very quickly with the number of data items. Poor clustering results degrade intelligent applications such as event tracking and information extraction. This paper presents an unsupervised document clustering method which identifies topic keyword clusters of the text corpus. The proposed method adopts a multi-stage process. First, an aggressive data cleaning approach is employed to reduce the noise in the free text and further identify the topic keywords in the documents. All extracted keywords are then grouped into topic keyword clusters using the k-nearest neighbor approach and the keyword clustering technique. Finally, all documents in the corpus are clustered based on the topic keyword clusters. The proposed method is assessed against conventional data clustering methods on a web news corpus. The experimental results show that the proposed method is an efficient and effective clustering approach.
- 社団法人電子情報通信学会の論文
- 2005-08-01
著者
-
Hsu Chiun-chieh
Department Of Information Management National Taiwan University Of Science And Technology
-
Chang Hsi-cheng
Department Of Information Management National Taiwan University Of Science And Technology:department
関連論文
- Using Topic Keyword Clusters for Automatic Document Clustering(Document Clustering, Document Image Understanding and Digital Documents)
- Scheduling Real-Time Multi-Processor Systems with Distance-Constrained Tasks Using the Early-Release-Fair Model(Digital Signal Processing)