Counting documents that contain substrings more than k times.

概要

論文の詳細を見る
The statistics we compute is dfκ: the number of documents which contain certain strings more than κ times.We can hardly keep the statistics of all substrings because we need 0 (N2) space where N is the size of corpus.Yamamoto et al.show that it is possible to produce a table for κ=1 in 0 (N) space using Suffix Array and the concept of "class of string".However, this method cannot solve the problem where κ≥2.We present an algorithm that can be used for κ≥2 and we can compute the statistics by using the table.In this report, we explain dfκ and compare the proposed algorithm with simple methods.This algorithm takes O (N log N) time and O (N) space to produce the table and O (log N) time to obtain statistics from the table.
言語処理学会の論文