Efficient Distributed Web Crawling Utilizing Internet Resources
スポンサーリンク
概要
- 論文の詳細を見る
Internet computing is proposed to exploit personal computing resources across the Internet in order to build large-scale Web applications at lower cost. In this paper, a DHT-based distributed Web crawling model based on the concept of Internet computing is proposed. Also, we propose two optimizations to reduce the download time and waiting time of the Web crawling tasks in order to increase the systems throughput and update rate. Based on our contributor-friendly download scheme, the improvement on the download time is achieved by shortening the crawler-crawlee RTTs. In order to accurately estimate the RTTs, a network coordinate system is combined with the underlying DHT. The improvement on the waiting time is achieved by redirecting the incoming crawling tasks to light-loaded crawlers in order to keep the queue on each crawler equally sized. We also propose a simple Web site partition method to split a large Web site into smaller pieces in order to reduce the task granularity. All the methods proposed are evaluated through real Internet tests and simulations showing satisfactory results.
- 2010-10-01
著者
-
Xu Xiao
Harbin Institute Of Technology
-
ZHANG Weizhe
Harbin Institute of Technology
-
ZHANG Hongli
Harbin Institute of Technology
-
FANG Binxing
Harbin Institute of Technology
-
Zhang Weizhe
School Of Computer Science And Technology Harbin Institute Of Technology
-
Zhang Weizhe
Harbin Inst. Technol. Harbin Chn
関連論文
- Exploring Web Partition in DHT-Based Distributed Web Crawling
- Efficient Distributed Web Crawling Utilizing Internet Resources
- Exploring Social Relations for Personalized Tag Recommendation in Social Tagging Systems
- A User-Habit Property: Haunting of Users in IP Networks(Networks)