A Corpus-Based Approach for Automatic Thai Unknown Word Recognition Using Boosting Techniques
スポンサーリンク
概要
- 論文の詳細を見る
While classification techniques can be applied for automatic unknown word recognition in a language without word boundary, it faces with the problem of unbalanced datasets where the number of positive unknown word candidates is dominantly smaller than that of negative candidates. To solve this problem, this paper presents a corpus-based approach that introduces a so-called group-based ranking evaluation technique into ensemble learning in order to generate a sequence of classification models that later collaborate to select the most probable unknown word from multiple candidates. Given a classification model, the group-based ranking evaluation (GRE) is applied to construct a training dataset for learning the succeeding model, by weighing each of its candidates according to their ranks and correctness when the candidates of an unknown word are considered as one group. A number of experiments have been conducted on a large Thai medical text to evaluate performance of the proposed group-based ranking evaluation approach, namely V-GRE, compared to the conventional naïve Bayes classifier and our vanilla version without ensemble learning. As the result, the proposed method achieves an accuracy of 90.93±0.50% when the first rank is selected while it gains 97.26±0.26% when the top-ten candidates are considered, that is 8.45% and 6.79% improvement over the conventional record-based naïve Bayes classifier and the vanilla version. Another result on applying only best features show 93.93±0.22% and up to 98.85±0.15% accuracy for top-1 and top-10, respectively. They are 3.97% and 9.78% improvement over naive Bayes and the vanilla version. Finally, an error analysis is given.
- (社)電子情報通信学会の論文
- 2009-12-01
著者
-
Nattee Cholwich
Information Computer And Communication Technology School Sirindhorn International Institute Of Techn
-
Theeramunkong Thanaruk
Information And Computer Technology School Sirindhorn International Institute Of Technology Thammasa
-
Theeramunkong Thanaruk
Information Computer And Communication Technology School Sirindhorn International Institute Of Techn
-
TECHO Jakkrit
Information, Computer and Communication Technology School, Sirindhorn International Institute of Tec
-
Techo Jakkrit
Information Computer And Communication Technology School Sirindhorn International Institute Of Techn
関連論文
- A Family-Based Evolutional Approach for Kernel Tree Selection in SVMs
- Kernel Trees for Support Vector Machines(Knowledge, Information and Creativity Support System)
- Pattern-Based Features vs. Statistical-Based Features in Decision Trees for Word Segmentation(Natural Language Processing)
- Speech Clarity Index (Ψ) : A Distance-Based Speech Quality Indicator and Recognition Rate Prediction for Dysarthric Speakers with Cerebral Palsy
- A Corpus-Based Approach for Automatic Thai Unknown Word Recognition Using Boosting Techniques
- Effects of Term Distributions on Binary Classification(Knowledge, Information and Creativity Support System)