A Study of Link Farm Evolution Using a Time-series of Web Snapshots
スポンサーリンク
概要
- 論文の詳細を見る
Web spamming has emerged to deceive search engines and obtain a higher ranking in search result lists which brings more traffic and profits to web sites. Link farm is one of the major spamming techniques, which creates a large set of densely inter-linked spam pages to deceive link-based ranking algorithms that regard incoming links to a page as endorsements to it. Those link farms need to be eliminated when we are searching, analyzing and mining the Web, but they are also interesting social activities in the cyberspace. Our purpose is to understand dynamics of link farms, such as, how much they are growing or shrinking, and how their topics change over time. Such information is helpful in developing new spam detection techniques and tracking spam sites for observing their topics. Especially, we are interested in where we can find emerging spam sites that is useful for updating spam classifiers. In this paper, we study overall size/topic distribution and evolution of link farms in large-scale Japanese web archives for three years containing four million hosts and 83 million links. As far as we know, the overall characteristics of link farms in a time-series of web snapshots of this scale have never been explored. We propose a method for extracting link farms and investigate their size distribution and topics. We observe the evolution of link farms from the perspective of size growth and change in topic distribution. We recursively decomposed host graphs into link farms and found that from 4% to 7% of hosts were members of link farms. This implies we can remove quite a number of spam hosts without contents analysis. We also found the two dominant topics, “Adult” and “Travel”, accounted for over 60% of spam hosts in link farms. The size evolution of link farms showed that many link farms maintained for years, but most of them did not grow. The distribution of topics in link farms was not significantly changed, but hosts and keywords related to each topic dynamically changed. These results suggest that we can observe topic changes in each link farm, but we cannot efficiently find emerging spam sites by monitoring link farms. This implies that to detect newly created spam sites, monitoring current link farm is not enough. Detecting sites that generate links to spam sites would be an effective approach.
著者
-
Toyoda Masashi
Institute Of Industrial Science The University Of Tokyo
-
Kitsuregawa Masaru
Institute Of Industrial Science The University Of Tokyo
-
Chung Young-joo
Institute Of Industrial Science The University Of Tokyo
関連論文
- Display Wall Empowered Visual Mining for CEOP Data Archive(Coordinated Enhanced Observing Period(CEOP))
- Data Analysis System Attached to the CEOP Centralized Data Archive System(Coordinated Enhanced Observing Period(CEOP))
- QUASUR : Web-based Quality Assurance System for CEOP Reference Data(Coordinated Enhanced Observing Period(CEOP))
- Initial CEOP-based Review of the Prediction Skill of Operational General Circulation Models and Land Surface Models(Coordinated Enhanced Observing Period(CEOP))
- 5ZN-9 A Topical Study on the Web Spam
- Overview of the Super Database Computer (SDC-I) (Special Issue on Super Chip for Intelligent Integrated Systems)
- Mining Communities on the Web Using a Max-Flow and a Site-Oriented Framework(Data Mining)
- Compact Encoding of the Web Graph Exploiting Various Power Distributions(Discrete Mathematics and Its Applications)
- Finding Neighbor Communities in the Web Using an Inter-Site Graph(Database)
- Speculative Transaction Processing Approach for Database Systems
- An Economic Dynamic Replication Model for Mobile-P2P networks (夏のデータベースワークショップDBWS 2006)
- An Economic Dynamic Replication Model for Mobile-P2P networks
- Performance Evaluation of Flash SSDs in a Transaction Processing System
- Rank Optimization of Personalized Search
- High Performanee Parallel Query Processing on a 100 Node ATM Connected PC Cluster (Special Issue on New Generation Database Technologies)
- Web Community Chart : A Tool for Navigating the Web and Observing Its Evolution
- Detecting Hijacked Sites by Web Spammer Using Link-Based Algorithms
- A Study of Link Farm Evolution Using a Time-series of Web Snapshots
- A Study of Link Farm Evolution Using a Time-series of Web Snapshots
- Efficient Analyzing General Dominant Relationship Based on Partial Order Models
- Examination of Criterion for Choosing a Run Time Method in GN Hash Join Algorithm
- Finding Web Communities by Maximum Flow Algorithm Using Well-Assigned Edge Capacities(Information Processing Technology for Web Utilization)
- D-3 An Link-Contents Coupled Clustering for Web Search Results
- Speculative Transaction Processing in Distributed Database Systems
- Foreword to the Special Issue on Japanese Microprocessors
- Virtual Striping: A Storage Management Scheme with Dynamic Striping (Special Issue on Architectures, Algorithms and Networks for Massively parallel Computing)
- A Study on Characteristics of Topic-Specific Information Cascade in Twitter (データ工学)
- A Study on Efficient Searching Top-k Semantic Similar Sentences (データ工学)
- Efficient Classification with Conjunctive Features
- A Study on Characteristics of Topic-Specific Information Cascade in Twitter
- A Study on Efficient Searching Top-k Semantic Similar Sentences
- A Study on Graph Similarity Search
- Semi-supervised Sentiment Classification in Resource-Scarce Language : A Comparative Study
- A Study on Graph Similarity Search
- Exploration on Efficient Similar Sentences Extraction
- A Study on Similar Words Searching (データ工学)
- Semi-supervised Sentiment Classification in Resource-Scarce Language : A Comparative Study
- A Study on Graph Similarity Search
- Collective Sentiment Classification Based on User Leniency and Product Popularity
- A Study on Similar Words Searching