基于协程模型的分布式爬虫框架

杨济运; 刘建勋; 姜磊; 彭桃; 文一凭; 卢厅

基于协程模型的分布式爬虫框架

引用本文：杨济运，刘建勋，姜磊，彭桃，文一凭，卢厅.基于协程模型的分布式爬虫框架[J].计算技术与自动化,2014,(3):126-133

摘要点击次数: 1586

全文下载次数: 56

作者	单位
杨济运，刘建勋，姜磊，彭桃，文一凭，卢厅	（湖南科技大学计算机科学与工程学院知识处理与网络化制造湖南省普通高校重点实验室，湖南湘潭411201)

中文摘要:网络爬虫主要受到网络延迟和本地运行效率的限制，传统的基于多线程的网络爬虫架构主要为了消除网络延迟而没有考虑到本地运行效率。在高并发的条件下,多线程架构爬虫由于上下文切换开销增大而导致本地运行效率降低，同时使得网络利用率下降，如何能够在最大化利用网络资源的情况下减小系统本地开销是一个需要研究的问题。针对以上问题，本文提出基于协程的分布式网络爬虫框架来解决，从开销、资源利用率、网络利用率上对协程框架和多线程框架进行了分析，并基于协程实现了一个分布式网络爬虫。实验表明该框架无论从开销、资源利用率和网络利用率上相对于多线程框架有比较明显的优势。

中文关键词:协程分布式高性能爬虫

A Distributed Crawler Framework Based on Coroutine Model

Abstract:Web crawler is mainly limited by the network latency and local resource. The traditional framework of web crawler, which is based on multi-threads, is mainly to eliminate the network latency but failed to take the local resource limitation into account. Under the high concurrent, multi-threads architecture will result in a poor running efficiency because of the increasing of the context switch. So studying on how to make maximum usage of network resources and also considering the local resource limitation becomes a necessary. To solve the above problems, this paper will propose a distributed crawler framework based on coroutine. First we have analyzed the overhead, resource utilization and network utilization between coroutines and threads, and implemented a web crawler based on coroutine. Experiments had shown that our architecture for a distributed web crawler based on coroutine is better than threads-based web crawler.

keywords:coroutine distribution high-performance web crawler

查看全文 查看/发表评论 下载pdf阅读器