Unsupervised Spam Detection by Document Probability Estimation with Maximal Overlap Method
スポンサーリンク
概要
- 論文の詳細を見る
In this paper, we study content-based spam detection for spams that are generated by copying a seed document with some random perturbations. We propose an unsupervised detection algorithm based on an entropy-like measure called document complexity, which reflects how many similar documents exist in the input collection of documents. As the document complexity, however, is an ideal measure like Kolmogorov complexity, we substitute an estimated occurrence probability of each document for its complexity. We also present an efficient algorithm that estimates the probabilities of all documents in the collection in linear time to its total length. Experimental results showed that our algorithm especially works well for word salad spams, which are believed to be difficult to detect automatically.
論文 | ランダム
- 圃場整備 棚田式ほ場の整備と生態系配慮--山形県上山市 鴫谷地(しぎのやち)地区
- 現地事例 棚田から環境保全型農業--新潟県東頸城郡安塚町の外立(はしだて)さんの取組事例 (特集 見直そう、日本の水田農業の良さを--世界に誇るわが国の水田農業)
- 現地報告 レンズから見た東南アジアの棚田風景 (小特集 中山間地域の多様な農地管理と活性化)
- 棚田保全の現状と課題--オーナー制度を中心にして (小特集 中山間地域の多様な農地管理と活性化)
- 随想 棚田の維持管理作業と整備--棚田ブームの陰で