タイ語タグ付きコーパス「ORCHID」の構築
スポンサーリンク
概要
- 論文の詳細を見る
ORCHID (Open linguistic Resources CHanelled toward InterDisciplinary research) is an initiative project aimed at building linguistic resources to support research in, but not limited to, natural language processing. Based on the concept of an open architecture design, the resources must be fully compatible with similar resources, and software tools must also be made available. This paper presents one result of the project, the construction of a Thai part-of-speech (POS) tagged corpus, which is a preliminary stage in the construction of a Thai speech corpus. The POS-tagged corpus is the result of collaborative research between the Communications Research Laboratory (CRL) in Japan and the National Electronics and Computer Technology Center (NECTEC) in Thailand, with technical support from the Electrotechnical Laboratory (ETL) in Japan. In this paper, we propose a new tagset, based on the results of a prior multilingual machine translation project. The corpus is annotated on three levels: the paragraph, sentence, and word levels. Text information is maintained in the form of the text information lines and the number lines, which are both utilized in data retrieval. Both word segmentation and POS tagging were carried out by way of a probabilistic trigram model. Rules for syllable demarkation were, additionally used to reduce the number of candidates in computing tagging probabilities. Some typical problems in POS assignment are also formalized to resolve ambiguity.
- 社団法人日本音響学会の論文
著者
-
ソンラートラムワーニッチ ウィラット
Department Of Computer Science Graduate School Of Information Science And Engineering Tokyo Institut
-
Sornlertlamvanich Virach
Department Of Computer Science Graduate School Of Information Science And Engineering Tokyo Institut
-
高橋 直人
Electrotechnical Laboratory
-
伊佐 原均
Intelligent Processing Section, Kansai Advanced Research Center, Communications Research Laboratory,
-
伊佐 原均
Intelligent Processing Section Kansai Advanced Research Center Communications Research Laboratory Mi
関連論文
- Incorporating Probabilistic Parsing into an LR Parser : LR Table Engineering (4)
- タイ語タグ付きコーパス「ORCHID」の構築
- Extracting Open Compounds from Text Corpora