爬虫实践---Scrapy-豆瓣电影影评深度爬取

Link Extractors

Link Extractors 是那些目的仅仅是从网页(scrapy.http.Response 对象)中抽取最终将会被follow链接的对象｡

Scrapy提供了 scrapy.linkextractors import LinkExtractor , 但你通过实现一个简单的接口创建自己定制的Link Extractor来满足需求｡

每个link extractor有唯一的公共方法是 extract_links ,它接收一个 Response 对象,并返回一个 scrapy.link.Link 对象｡Link Extractors,要实例化一次并且 extract_links 方法会根据不同的response调用多次提取链接｡

Link Extractors在 CrawlSpider 类(在Scrapy可用)中使用, 通过一套规则,但你也可以用它在你的Spider中, 即使你不是从 CrawlSpider 继承的子类, 因为它的目的很简单: 提取链接｡

内置Link Extractor 参考

Scrapy提供的Link Extractor类在 scrapy.linkextractors 模块提供｡默认的link extractor是 LinkExtractor , 其实就是 LxmlLinkExtractor:

from scrapy.linkextractors import LinkExtractor

例如,从这段代码中提取链接:<a href="javascript:goToPage('../other/page.html'); return false">Link text</a>你可以使用下面的这个 process_value 函数:def process_value(value):m = re.search("javascript:goToPage\('(.*?)'", value)if m:return m.group(1)

正则表达式中---

‘.’匹配任意除换行符意外的字符

'*'匹配前一个字符0次或无限次

'?'匹配前一个字符0次或1次

LxmlLinkExtractorclass scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href', ), canonicalize=True, unique=True, process_value=None)---allow（正则表达式（或列表）） - （绝对）URL必须匹配才能被提取的单个正则表达式（或正则表达式列表）。如果没有给出（或空），它将匹配所有链接。

首先建立一个项目，项目列表如下：

$ tree.├── douban_Music│ ├── __init__.py│ ├── items.py│ ├── middlewares.py│ ├── pipelines.py│ ├── __pycache__│ │ ├── __init__.cpython-36.pyc│ │ ├── items.cpython-36.pyc│ │ ├── pipelines.cpython-36.pyc│ │ └── settings.cpython-36.pyc│ ├── settings.py│ └── spiders│├── __init__.py│├── __pycache__││ ├── __init__.cpython-36.pyc││ └── reviemspider.cpython-36.pyc│└── reviemspider.py├── Movie.txt #这个是最终生成的txt文档├── Music.txt #音乐抓取，我后来改为电影影评抓取了└── scrapy.cfg4 directories, 16 files

最近《战狼2》比较燃，就是它了，抓取它的热门影评---/subject/26363254/

看了这个评价条数啊！头有点大，所以就初步想只扣取影评三星以上的。

$ cat items.py # -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# /en/latest/topics/items.htmlfrom scrapy import Item,Field# 影评class MovieReviewItem(Item):review_movie = Field()review_title = Field() # 评论标题review_content = Field() # 评论正文review_author = Field() #评论IDreview_useful = Field() # 评论有用数review_rating = Field() # 影评星级review_time = Field() # 评论时间review_url = Field() # 评论链接

老套路了，下来就是配置文件

BOT_NAME = 'douban_Music'SPIDER_MODULES = ['douban_Music.spiders']NEWSPIDER_MODULE = 'douban_Music.spiders'DOWNLOAD_DELAY = 3DEPTH_LIMIT = 4USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'ITEM_PIPELINES = {'douban_Music.pipelines.DoubanMusicPipeline': 300,}

上面出现了一个之前从来没用过的

DEPTH_LIMIT = 4

我的理解是，/subject/26363254/，这个影评界面是四级深度链接，就在这个基础上进行链接，我的理解是这样，不知道对不对？

$ cat reviemspider.py #!/usr/bin/env python# coding=utf-8from scrapy.spiders import CrawlSpider, Rulefrom scrapy.linkextractors import LinkExtractorfrom douban_Music.items import MovieReviewItemfrom scrapy import logimport reimport osclass ReviewSpider(CrawlSpider):name = 'review'allowed_domains = ['']start_urls = ['/subject/26363254/']rules = (Rule(LinkExtractor(allow=r"/subject/\d+/reviews$")), # 开始页面的下一个页面---影评Rule(LinkExtractor(allow=r"/subject/\d+/reviews\?sort=hotest$")),# 选择最受欢迎选项Rule(LinkExtractor(allow=r"/subject/\d+/reviews\?sort=hotest\&start=\d+$")),# 遍历页面Rule(LinkExtractor(allow=r"/review/\d+/$"), callback="parse_review", follow=True), # 影评全文界面)def parse_review(self, response):try:# 碰见影评下边有推荐自己其他影评链接的导致爬虫错误识别,判断电影名称movie_name = response.xpath('//*[@class="main-hd"]/a[2]/text()').extract()rating = response.xpath('//*[@property ="v:rating"]/text()').extract()name = "战狼2"print(movie_name[0])if (movie_name[0] == name)&(int(rating[0]) > 3) :item = MovieReviewItem()item['review_movie'] = "".join(response.xpath('//*[@class="main-hd"]/a[2]/text()').extract())item['review_title'] = "".join(response.xpath('//*[@property="v:summary"]/text()').extract())content = "".join(response.xpath('//*[@id="link-report"]/div[@property="v:description"]/text()').extract()[0])item['review_rating'] = "".join(response.xpath('//*[@property ="v:rating"]/text()').extract())item['review_content'] = content.lstrip().rstrip().replace("\n"," ")item['review_author'] = "".join(response.xpath('//*[@property = "v:reviewer"]/text()').extract())useful = "".join(response.xpath('//*[@class="main-panel-useful"]/button[1]/text()').extract())item['review_useful'] = useful.lstrip().rstrip().replace("\n","")item['review_time'] = "".join(response.xpath('//*[@property="v:dtreviewed"]/text()').extract())item['review_url'] = response.urlyield itemelse:print("电影：{}\t 星级:{}".format(movie_name[0],rating[0]))print("链接错误影评！矫正!")except Exception as error:log(error)

# 碰见影评下边有推荐自己其他影评链接的导致爬虫错误识别,判断电影名称

movie_name = response.xpath('//*[@class="main-hd"]/a[2]/text()').extract()

rating = response.xpath('//*[@property ="v:rating"]/text()').extract()

name = "战狼2"

print(movie_name[0])

if (movie_name[0] == name)&(int(rating[0]) > 3) :

由于部分影评结尾处，存在自己其他电影影评的链接，所以进行判断，防止读取到其他的电影影评，但是能否在网站访问前就进行判断呢？还没有解决这个顾虑。

有的影评存在图片或者是</pr>的情况，存在影评读取错误的情况。甚是尴尬。下来慢慢填坑吧！

$ cat pipelines.py # -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: /en/latest/topics/item-pipeline.htmlimport osclass DoubanMusicPipeline(object):def process_item(self, item, spider):base_dir = os.getcwd()file_name = base_dir + '/Movie.txt'with open(file_name,'a') as f:f.write(item['review_movie']+'\n')f.write(item['review_title']+'\t')f.write(item['review_author']+'\t')f.write(item['review_time']+'\n')f.write(item['review_rating']+'颗星\t')f.write(item['review_useful']+'\n')#f.write(item['review_recommend']+'\n')f.write(item['review_content']+'\n')f.write(item['review_url']+'\n\n')return item

上面就是简单+极简略的影评爬取，后来再陆续完成其他功能。