Pattern-Based Features vs. Statistical-Based Features in Decision Trees for Word Segmentation(Natural Language Processing)
スポンサーリンク
概要
- 論文の詳細を見る
This paper proposes two alternative approaches that do not make use of a dictionary but instead utilizes different types of learned features to segment words in a language that has no explicit word boundary. Both methods utilize decision trees as knowledge representation acquired from a training corpus in the segmentation process. The first method, a language-dependent technique, applies a set of constructed features patterns based on character types to generate a set of heuristic segmentation rules. It separates a running text into a sequence of small chunks based on the given patterns, and constructs a decision tree for word segmentation. The second method extracts statistics of character sequences from a training corpus and uses them as features for the process of constructing a set of rules by decision tree induction. The latter needs no linguistic knowledge. By experiments on Thai language, both methods achieve relatively high accuracy but the latter performs much better.
- 社団法人電子情報通信学会の論文
- 2004-05-01
著者
-
Theeramunkong Thanaruk
Information And Computer Technology School Sirindhorn International Institute Of Technology Thammasa
-
Theeramunkong Thanaruk
Information Technology Program Sirinhorn International Institue Of Technology Thammasat University
-
TANHERMHONG Thanasan
Information Technology Program, Sirinhorn International Institue of Technology, Thammasat University
-
Tanhermhong Thanasan
Information Technology Program Sirinhorn International Institue Of Technology Thammasat University
関連論文
- A Family-Based Evolutional Approach for Kernel Tree Selection in SVMs
- Kernel Trees for Support Vector Machines(Knowledge, Information and Creativity Support System)
- Pattern-Based Features vs. Statistical-Based Features in Decision Trees for Word Segmentation(Natural Language Processing)
- Speech Clarity Index (Ψ) : A Distance-Based Speech Quality Indicator and Recognition Rate Prediction for Dysarthric Speakers with Cerebral Palsy
- A Corpus-Based Approach for Automatic Thai Unknown Word Recognition Using Boosting Techniques
- Effects of Term Distributions on Binary Classification(Knowledge, Information and Creativity Support System)