Counting documents that contain substrings more than k times.
スポンサーリンク
概要
- 論文の詳細を見る
The statistics we compute is <I>dfκ</I>: the number of documents which contain certain strings more than κ times.We can hardly keep the statistics of all substrings because we need 0 (<I>N</I><SUP>2</SUP>) space where <I>N</I> is the size of corpus.Yamamoto et al.show that it is possible to produce a table for κ=1 in 0 (N) space using Suffix Array and the concept of "class of string".However, this method cannot solve the problem where κ≥2.We present an algorithm that can be used for <I>κ</I>≥2 and we can compute the statistics by using the table.In this report, we explain <I>dfκ</I> and compare the proposed algorithm with simple methods.This algorithm takes O (<I>N</I> log <I>N</I>) time and O (<I>N</I>) space to produce the table and O (log <I>N</I>) time to obtain statistics from the table.
- 言語処理学会の論文
言語処理学会 | 論文
- 複合語の分野連想語の効率的決定法
- クラス指向事例収集手法による言い換えコーパスの構築
- 動詞項構造辞書への大規模用例付与
- 言い換え技術に関する研究動向
- Morpho-Syntactic Rules for Detecting Japanese Term Variation: Establishment and Evaluation