A Web Corpus and Word Sketches for Japanese
スポンサーリンク
概要
- 論文の詳細を見る
Of all the major world languages, Japanese is lagging behind in terms of publicly accessible and searchable corpora. In this paper we describe the development of JpWaC (Japanese Web as Corpus), a large corpus of 400 million words of Japanese web text, and its encoding for the Sketch Engine. The Sketch Engine is a web-based corpus query tool that supports fast concordancing, grammatical processing, ‘word sketching’ (one-page summaries of a words grammatical and collocational behaviour), a distributional thesaurus, and robot use. We describe the steps taken to gather and process the corpus and to establish its validity, in terms of the kinds of language it contains. We then describe the development of a shallow grammar for Japanese to enable word sketching. We believe that the Japanese web corpus as loaded into the Sketch Engine will be a useful resource for a wide number of Japanese researchers, learners, and NLP developers.
著者
-
Erjavec Irena
Tokyo Institute Of Technology
-
ERJAVEC Tomaz
Jozef Stefan Institute
-
KILGARRIFF Adam
Lexical Computing Ltd.
関連論文
- ウェブコーパスと検索システムを利用した推量副詞とモダリティ形式の遠隔共起抽出と日本語教育への応用
- コーパスに基づいた語彙シラバス作成に向けて--推量的副詞と文末モダリティの共起を中心にして
- コーパス検索ツールSketch Engineの日本語版とその利用方法
- Synonyms according to situational types
- A Web Corpus and Word Sketches for Japanese
- A Web Corpus and Word Sketches for Japanese