Scrapy-Reids-爬虫

本文最后更新于:2020年9月27日 晚上

信息

Scrapy本身并不支持分布式。要做分布式爬虫,就需要借助Scrapy-Redis组件。
这个组件利用了Redis可以分布式的功能,使得Scrapy能够进行分布式爬取,提高爬虫效率。

分布式爬虫的优点
 可以充分利用多台机器的IP,带宽,CPU等资源
分布式爬虫的问题
 如何保证不会出现重复爬取。
 如何正确的将数据整合到一起。

运行流程

  1. EngineSpider中得到第一个Requests进行爬取
  2. ENGINRequest放入SCHEDULER调度器,并且获取下个Request
  3. 为保证不会进行重复爬取,SCHEDULER调度器Requests发送去Redis
  4. Redis中无爬虫记录,返回RequestsSCHEDULER调度器
  5. SCHEDULERRequests返回给ENGINE(因为ENGING进行任务调度)
  6. ENGINE发送RequestsDownoader,通过Downloader Middlewares 进行处理(这一步进行Http请求,返回response
  7. 通过Downloader Middleware进行资源下载(就是html信息),如果下载完成,通过Dowloader生成一个Resonse并且发送给ENGINE
  8. ENGINEDOWNLOADER接收 Resonse,并将Resonse发送给Spider进行处理。Spider通过Spider Middleware进行处理Response
  9. Spider处理Response 并且返回items和新的RequestsENGINE,这部分处理通过Spilder Middleware进行处理
  10. ENGINESpider接收 items,并将items发送给Item Pipeline进行处理
  11. Item Pipelineitems发送给redis保存下来

一般部署

Redis服务器:
 内存要大,只用作记录爬取下来的数据和URL去重
爬虫服务器:
 执行爬虫代码,进行爬取,获取数据发送给Redis服务器

文档

github地址:https://github.com/rmax/scrapy-redis
文档及其简单,只给了基础设置和一个例子项目。

基础

安装

1
pip install scrapy-redis

setting

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# Enables scheduling storing requests queue in redis.
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# Default requests serializer is pickle, but it can be changed to any module
# with loads and dumps functions. Note that pickle is not compatible between
# python versions.
# Caveat: In python 3.x, the serializer must return strings keys and support
# bytes as values. Because of this reason the json or msgpack module will not
# work by default. In python 2.x there is no such issue and you can use
# 'json' or 'msgpack' as serializers.
#SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"

# Don't cleanup redis queues, allows to pause/resume crawls.
#SCHEDULER_PERSIST = True

# Schedule requests using a priority queue. (default)
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'

# Alternative queues.
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'

# Max idle time to prevent the spider from being closed when distributed crawling.
# This only works if queue class is SpiderQueue or SpiderStack,
# and may also block the same time when your spider start at the first time (because the queue is empty).
#SCHEDULER_IDLE_BEFORE_CLOSE = 10

# Store scraped item in redis for post-processing.
ITEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline': 300
}

# The item pipeline serializes and stores the items in this redis key.
#REDIS_ITEMS_KEY = '%(spider)s:items'

# The items serializer is by default ScrapyJSONEncoder. You can use any
# importable path to a callable object.
#REDIS_ITEMS_SERIALIZER = 'json.dumps'

# Specify the host and port to use when connecting to Redis (optional).
#REDIS_HOST = 'localhost'
#REDIS_PORT = 6379

# Specify the full Redis URL for connecting (optional).
# If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings.
#REDIS_URL = 'redis://user:pass@hostname:9001'

# Custom redis client parameters (i.e.: socket timeout, etc.)
#REDIS_PARAMS = {}
# Use custom redis client class.
#REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient'

# If True, it uses redis' ``SPOP`` operation. You have to use the ``SADD``
# command to add URLs to the redis queue. This could be useful if you
# want to avoid duplicates in your start urls list and the order of
# processing does not matter.
#REDIS_START_URLS_AS_SET = False

# Default start urls key for RedisSpider and RedisCrawlSpider.
#REDIS_START_URLS_KEY = '%(name)s:start_urls'

# Use other encoding than utf-8 for redis.
#REDIS_ENCODING = 'latin1'

本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!