Unsupervised Spam Detection by Document Probability Estimation with Maximal Overlap Method
スポンサーリンク
概要
- 論文の詳細を見る
In this paper, we study content-based spam detection for spams that are generated by copying a seed document with some random perturbations. We propose an unsupervised detection algorithm based on an entropy-like measure called document complexity, which reflects how many similar documents exist in the input collection of documents. As the document complexity, however, is an ideal measure like Kolmogorov complexity, we substitute an estimated occurrence probability of each document for its complexity. We also present an efficient algorithm that estimates the probabilities of all documents in the collection in linear time to its total length. Experimental results showed that our algorithm especially works well for word salad spams, which are believed to be difficult to detect automatically.
論文 | ランダム
- 群馬県議会図書室の現状と課題 (昭和63年度〔専門図書館協議会〕全国研究集会) -- (第4分科会 地方自治)
- 今後の見通し--広く深く根を張るために (国際生物学オリンピック(IBO)--初参加と今後 国際生物学オリンピック日本委員会を中心に)
- 2SB-34 大学教養レベルの生物学教科書を検討する視点
- 科学教育論文について思うこと
- Web学習の試み--内容のあるシラバス (特集:大学の生物学教育とシラバス)