国外做游戏评测的视频网站,营销软文200字,国外做网站推广,网站建设环境scrapy 框架 01. Scrapy 链接02. Scrapy 的爬虫流程03. Scrapy入门04. setting.py文件中的常用设置4.1. logging模块的使用4.2. scrapy项目中的setting.py常用配置内容#xff08;待续#xff09; 05. scrapy框架糗事百科爬虫案例06. scrapy.Request知识点07. 思考 parse()方… scrapy 框架 01. Scrapy 链接02. Scrapy 的爬虫流程03. Scrapy入门04. setting.py文件中的常用设置4.1. logging模块的使用4.2. scrapy项目中的setting.py常用配置内容待续 05. scrapy框架糗事百科爬虫案例06. scrapy.Request知识点07. 思考 parse()方法的工作机制08. CrawlSpider爬虫微信小程序crawlspider爬虫 09. Scrapy 发送post请求案例(人人网登录案例)10. scrapy框架豆瓣网登录案例(验证码识别技术)待爬11. scrapy 下载图片和文件方法汽车之家宝马五系高清图片下载12. crawl spider 下载图片和文件方法汽车之家宝马五系高清图片下载13. 下载器中间件-设置随机请求头14. [ip代理中间件(快代理)](https://pan.baidu.com/s/1U6KnIFOYhS9NT7iXd4t84g)15. Scrapy Shell16. 攻克Boss直聘反爬虫待调整17. 动态网页的数据爬取17.1.安装Selenium17.2. 安装chromedriver17.3 第一个小案例17.4. 定位元素17.5. selenium 操作表单元素17.6. 行为链17.7. cookie的操作17.8. 页面等待17.9. 切换页面17.10. selenium 使用代理WebElement元素 18. Selenium 拉勾网爬虫19. ScrapySelenium爬取简书网整站并且存入到mysql当中20. selenium设置代理和UserAgent21. [http://httpbin.org 测试接口解析](https://blog.csdn.net/chang995196962/article/details/91362364) 01. Scrapy 链接
Scrapy中文维护站点Scrapy框架官方网址
02. Scrapy 的爬虫流程 Scrapy Engine引擎 总指挥 负责数据和信号的在不同模块之间的传递Scrapy已经实现 Scheduler调度器 一个队列 存放引擎发过来的request请求Scrapy已经实现 Downloader下载器 下载把引擎发过来的requests请求并发回给引擎Scrapy已经实现 Spider爬虫 处理引擎发来的response提取数据 提取url 并交给引擎需要手写 Item Pipeline管道 处理引擎传过来的数据 比如存储需要手写 Downloader Middlewares(下载中间件) 可以自定义的下载扩展比如设置代理 请求头cookie等信息 Spider Middlewares中间件 可以自定义requests请求和进行response过滤
03. Scrapy入门 安装 conda install scrapy 创建一个scrapy项目 scrapy startproject mySpider 生成一个爬虫 scrapy genspider xiaofan xiaofan.comscrapy genspider 爬虫的名字 允许爬取的范围 提取数据 完善spider使用xpath等方法 保存数据 pipeline中保存数据 运行爬虫命令行形式 scrapy crawl 爬虫的名字 通过脚本运行爬虫 在项目根目录新建脚本start.py运行start.py文件即可 from scrapy import cmdlinecmdline.execute(scrapy crawl qsbk_spider.split())python爬虫scrapy之如何同时执行多个scrapy爬行任务 from scrapy import cmdlinecmdline.execute(scrapy crawlall.split())scrapy保存信息的最简单的方法主要有四种-o 输出指定格式的文件命令如下 # json格式默认为Unicode编码
scrapy crawl itcast -o teachers.json# json lines格式默认为Unicode编码
scrapy crawl itcast -o teachers.jsonlines# csv 逗号表达式可用Excel打开
scrapy crawl itcast -o teachers.csv# xml格式
scrapy crawl itcast -o teachers.xml项目结构截图及主要文件的作用
04. setting.py文件中的常用设置
4.1. logging模块的使用
scrapy项目 settings中设置LOG_LEVEL“WARNING”settings中设置LOG_FILE./a.log # 设置日志保存的位置设置后终端不会显示日志内容import logging. 实例化logger的方式在任何文件中使用logger输入内容 普通项目中 import logginglogging.basicConfig(…) # 设置日志输出的样式格式实例化一个loggerlogging.getLogger(name)在任何py文件中调用logger即可
4.2. scrapy项目中的setting.py常用配置内容待续
# 1.导包
import logging
import datetime
import os# 2.项目名称 TODO 需要修改
BOT_NAME position_project# 3.模块名称
SPIDER_MODULES [{}.spiders.format(BOT_NAME)]
NEWSPIDER_MODULE {}.spiders.format(BOT_NAME)# 4.遵守机器人协议默认为True
ROBOTSTXT_OBEY False# 5.用户代理使用的浏览器类型
USER_AGENT Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 \Safari/537.36 # 6.默认请求头信息USER_AGENT 单独配置
DEFAULT_REQUEST_HEADERS {authority: www.zhipin.com,method: GET,path: /c101010100/?querypythonpage1,scheme: https,accept:text/html,application/xhtmlxml,application/xml;q0.9,image/webp,image/apng,*/*;q0.8,application/signed-exchange;vb3,accept-encoding:gzip, deflate, br,accept-language:zh-CN,zh;q0.9,cache-control:max-age0,sec-fetch-mode:navigate,sec-fetch-site:none,sec-fetch-user:?1,upgrade-insecure-requests:1,cookie:_uab_collina155192752626463196786582; lastCity101010100; _bl_uidnCk6U2X3qyL0knn41r97gqj6tbaI; __c1577356639; __g-; __ll%2Fwww.zhipin.com%2Fweb%2Fcommon%2Fsecurity-check.html%3Fseed%3D4xwicvOb7q2EkZGCt80nTLZ0vDg%252BzlibDrgh%252F8ybn%252BU%253D%26name%3D89ea5a4b%26ts%3D1577356638307%26callbackUrl%3D%252Fc101010100%252F%253Fquery%253Dpython%2526page%253D1%26srcReferer%3Drfriend_source0friend_source0; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a1577356640; toUrlhttps%3A%2F%2Fwww.zhipin.com%2Fc101010100%2F%3Fquery%3Dpython%26page%3D1%26ka%3Dpage-1; __a29781409.1551927520.1573210066.1577356639.145.7.53.84; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a1577413477; __zp_stoken__7afdOJ%2Bdzh7nyTlE0EwBT40ChjblHK0zWyGrgNKjNseeImeToJrFVjotrvwrJmc4SAz4ALJJLFiwM6VXR8%2FhRZvbdbnbdscb5I9tbPbE0vSsxADMIDYNDK7qJTzOfZJNR7%2BP,referer:https://www.zhipin.com/c101010100/?querypythonpage1,}# 7.格式化日志输出的格式日志文件每分钟生成一个文件
time_str datetime.datetime.strftime(datetime.datetime.now(), %Y-%m-%d %H-%M)
LOG_FILE {}\\{}\\logs\\{}.log.format(os.getcwd(), BOT_NAME, time_str)
LOG_LEVEL DEBUG# 8.设置运行多个爬虫的自定义命令
COMMANDS_MODULE {}.commands.format(BOT_NAME)# 9.scrapy输出的json文件中显示中文https://www.cnblogs.com/linkr/p/7995454.html
FEED_EXPORT_ENCODING utf-8# 10.管道pipeline配置后面的值越小越先经过这根管道 TODO 需要修改
ITEM_PIPELINES {{}.pipelines.PositionProjectPipeline.format(BOT_NAME): 300,
}# 11.限制爬虫的爬取速度 单位为秒
DOWNLOAD_DELAY 1# 12. 下载中间件 TODO 需要修改
DOWNLOADER_MIDDLEWARES {{}.middlewares.RandomUserAgent.format(BOT_NAME): 1,
}# 13. 禁用cookie
COOKIES_ENABLED False05. scrapy框架糗事百科爬虫案例
qsbk_spider.py
# -*- coding: utf-8 -*-
import scrapyfrom qsbk.items import QsbkItemclass QsbkSpiderSpider(scrapy.Spider):name qsbk_spiderallowed_domains [qiushibaike.com]start_urls [https://www.qiushibaike.com/text/page/1/]base_domain https://www.qiushibaike.comdef parse(self, response):duanzidivs response.xpath(//div[idcontent-left]/div)for duanzidiv in duanzidivs:author duanzidiv.xpath(.//h2/text()).extract_first().strip()content duanzidiv.xpath(.//div[classcontent]//text()).extract()item QsbkItem(authorauthor, contentcontent)yield item# 爬取下一页next_url response.xpath(//ul[classpagination]/li[last()]/a/href).get()if not next_url:returnelse:yield scrapy.Request(self.base_domain next_url, callbackself.parse)
items.py
# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass QsbkItem(scrapy.Item):author scrapy.Field()content scrapy.Field()
pipelines.py低级方式
# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Dont forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import jsonclass QsbkPipeline(object):def __init__(self):self.fp open(duanzi.json, w, encodingutf-8)def open_spider(self, spider):print(爬虫开始了...)def process_item(self, item, spider):item_json json.dumps(dict(item), indent4, ensure_asciiFalse)self.fp.write(item_json\n)return itemdef close_spider(self, spider):self.fp.close()print(爬虫结束了...)
pipelines.py高级方式一比较耗内存
# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Dont forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
from scrapy.exporters import JsonItemExporterclass QsbkPipeline(object):def __init__(self):self.fp open(duanzi.json, wb)self.exporter JsonItemExporter(self.fp, ensure_asciiFalse, encodingutf-8, indent4)self.exporter.start_exporting()def open_spider(self, spider):print(爬虫开始了...)def process_item(self, item, spider):self.exporter.export_item(item)return itemdef close_spider(self, spider):self.exporter.finish_exporting()self.fp.close()print(爬虫结束了...)
pipelines.py高级方式二
# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Dont forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.exporters import JsonLinesItemExporterclass QsbkPipeline(object):def __init__(self):self.fp open(duanzi.json, wb)self.exporter JsonLinesItemExporter(self.fp, ensure_asciiFalse, encodingutf-8, indent4)def open_spider(self, spider):print(爬虫开始了...)def process_item(self, item, spider):self.exporter.export_item(item)return itemdef close_spider(self, spider):self.fp.close()print(爬虫结束了...)
导出为csv文件
# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Dont forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.exporters import JsonLinesItemExporter, CsvItemExporterclass QsbkPipeline(object):def __init__(self):self.fp open(qsbk.csv, wb)self.exporter CsvItemExporter(self.fp, encodingutf-8)def open_spider(self, spider):print(爬虫开始了...)def process_item(self, item, spider):self.exporter.export_item(item)return itemdef close_spider(self, spider):print(爬虫结束了...)self.fp.close()06. scrapy.Request知识点
Request 部分源码
# 部分代码
class Request(object_ref):def __init__(self, url, callbackNone, methodGET, headersNone, bodyNone, cookiesNone, metaNone, encodingutf-8, priority0,dont_filterFalse, errbackNone):self._encoding encoding # this one has to be set firstself.method str(method).upper()self._set_url(url)self._set_body(body)assert isinstance(priority, int), Request priority not an integer: %r % priorityself.priority priorityassert callback or not errback, Cannot use errback without a callbackself.callback callbackself.errback errbackself.cookies cookies or {}self.headers Headers(headers or {}, encodingencoding)self.dont_filter dont_filterself._meta dict(meta) if meta else Nonepropertydef meta(self):if self._meta is None:self._meta {}return self._meta其中比较常用的参数
url: 就是需要请求并进行下一步处理的urlcallback: 指定该请求返回的Response由那个函数来处理。method: 请求一般不需要指定默认GET方法可设置为GET, POST, PUT等且保证字符串大写headers: 请求时包含的头文件。一般不需要。内容一般如下# 自己写过爬虫的肯定知道Host: media.readthedocs.orgUser-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0Accept: text/css,*/*;q0.1Accept-Language: zh-cn,zh;q0.8,en-us;q0.5,en;q0.3Accept-Encoding: gzip, deflateReferer: http://scrapy-chs.readthedocs.org/zh_CN/0.24/Cookie: _gaGA1.2.1612165614.1415584110;Connection: keep-aliveIf-Modified-Since: Mon, 25 Aug 2014 21:59:35 GMTCache-Control: max-age0meta: 比较常用在不同的请求之间传递数据使用的。字典dict型request_with_cookies Request(urlhttp://www.example.com,cookies{currency: USD, country: UY},meta{dont_merge_cookies: True})encoding: 使用默认的 utf-8 就行。dont_filter: 表明该请求不由调度器过滤。这是当你想使用多次执行相同的请求,忽略重复的过滤器。默认为False。errback: 指定错误处理函数07. 思考 parse()方法的工作机制
因为使用的yield而不是return。parse函数将会被当做一个生成器使用。scrapy会逐一获取parse方法中生成的结果并判断该结果是一个什么样的类型如果是request则加入爬取队列如果是item类型则使用pipeline处理其他类型则返回错误信息。scrapy取到第一部分的request不会立马就去发送这个request只是把这个request放到队列里然后接着从生成器里获取取尽第一部分的request然后再获取第二部分的item取到item了就会放到对应的pipeline里处理parse()方法作为回调函数(callback)赋值给了Request指定parse()方法来处理这些请求 scrapy.Request(url, callbackself.parse)Request对象经过调度执行生成 scrapy.http.response()的响应对象并送回给parse()方法直到调度器中没有Request递归的思路取尽之后parse()工作结束引擎再根据队列和pipelines中的内容去执行相应的操作程序在取得各个页面的items前会先处理完之前所有的request队列里的请求然后再提取items。这一切的一切Scrapy引擎和调度器将负责到底。
08. CrawlSpider爬虫
创建命令scrapy genspider -t crawl 爬虫的名字 爬虫的域名
微信小程序crawlspider爬虫
# -*- coding: utf-8 -*-
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rulefrom wxapp.items import WxappItemclass WxappSpiderSpider(CrawlSpider):name wxapp_spiderallowed_domains [wxapp-union.com]start_urls [http://www.wxapp-union.com/portal.php?modlistcatid2page2]rules (# 指定规则爬取列表上的详情链接并不需要解析Rule(LinkExtractor(allowr.modlistcatid2page1), followTrue),# 指定爬取详情页面的规则不需要递归找防止重复Rule(LinkExtractor(allowr.article-.\.html), callbackparse_detail, followFalse))def parse_detail(self, response):title response.xpath(//div[classcl]/h1/text()).get()item WxappItem(titletitle)return item
·注意千万记住 callback 千万不能写 parse再次强调由于CrawlSpider使用parse方法来实现其逻辑如果覆盖了 parse方法crawl spider将会运行失败。
09. Scrapy 发送post请求案例(人人网登录案例)
可以使用 yield scrapy.FormRequest(url, formdata, callback)方法发送POST请求。如果希望程序执行一开始就发送POST请求可以重写Spider类的start_requests(self) 方法并且不再调用start_urls里的url。
# -*- coding: utf-8 -*-
import scrapyclass RenrenSpider(scrapy.Spider):name renrenallowed_domains [renren.com]start_urls [http://renren.com/]def start_requests(self):重写了start_requests方法模拟人人网的登录url http://www.renren.com/PLogin.dodata {email: 594042358qq.com, password: fanjianhaiabc123}# post请求得用FormRqeust模拟登录request scrapy.FormRequest(url, formdatadata, callbackself.parse_page)yield requestdef parse_page(self, response):登录成功之后访问个人主页面# get请求 获取个人主页信息request scrapy.Request(urlhttp://www.renren.com/446858319/profile, callbackself.parse_profile)yield requestdef parse_profile(self, response):with open(profile.html, w, encodingutf-8) as fp:fp.write(response.text)10. scrapy框架豆瓣网登录案例(验证码识别技术)待爬
11. scrapy 下载图片和文件方法汽车之家宝马五系高清图片下载
方式一传统的下载方式 bmw5_spider.py
# -*- coding: utf-8 -*-
import scrapy
from bmw5.items import Bmw5Itemclass Bmw5SpiderSpider(scrapy.Spider):name bmw5_spiderallowed_domains [car.autohome.com.cn]start_urls [https://car.autohome.com.cn/pic/series/65.html]def parse(self, response):uiboxs response.xpath(//div[classuibox])[1:]for uibox in uiboxs:category uibox.xpath(.//div[classuibox-title]/a/text()).get()print(category)urls uibox.xpath(.//ul/li/a/img/src).getall()urls list(map(lambda url: response.urljoin(url), urls))item Bmw5Item(categorycategory, urlsurls)yield item
pipelines.py
# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Dont forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import os
from urllib import requestclass Bmw5Pipeline(object):def __init__(self):self.path os.path.join(os.path.dirname(os.path.dirname(__file__)), images)if not os.path.exists(self.path):os.mkdir(self.path)def process_item(self, item, spider):category item[category]urls item[urls]category_path os.path.join(self.path, category)if not os.path.exists(category_path):os.mkdir(category_path)for url in urls:image_name url.split(_)[-1]request.urlretrieve(url, os.path.join(category_path, image_name))return item
items.py
# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass Bmw5Item(scrapy.Item):category scrapy.Field()urls scrapy.Field()方式2 下载图片的Images Pipeline 定义好一个Item 然后再这个Item中定义两个属性 分别为image_urls以及imagesimage_urls是用来存储需要下载的图片的url链接 需要给一个列表 当文件下载完成后会吧文件下载的相关信息存储到item的images属性中 比如下载路径、下载的url和图片的校验码等。 在配置文件settings.py中配置IMAGES_STORE, 这个配置属性是用来设置图片下载下来的路径。 启动pipeline, 在ITEM_PIPELIES中设置scrapy.pipelines.images.ImagesPipeline:1
下载文件的Files Pipeline 定义好一个Item 然后再这个Item中定义两个属性 分别为file_urls以及filesfile_urls是用来存储需要下载的图片的url链接 需要给一个列表 当文件下载完成后会吧文件下载的相关信息存储到item的files属性中 比如下载路径、下载的url和图片的校验码等。 在配置文件settings.py中配置FILES_STORE, 这个配置属性是用来设置图片下载下来的路径。 启动pipeline, 在ITEM_PIPELIES中设置scrapy.pipelines.files.FilesPipeline:1
自定义图片下载 Images Pipeline bmw5_spider.py
# -*- coding: utf-8 -*-
import scrapy
from bmw5.items import Bmw5Itemclass Bmw5SpiderSpider(scrapy.Spider):name bmw5_spiderallowed_domains [car.autohome.com.cn]start_urls [https://car.autohome.com.cn/pic/series/65.html]def parse(self, response):uiboxs response.xpath(//div[classuibox])[1:]for uibox in uiboxs:category uibox.xpath(.//div[classuibox-title]/a/text()).get()print(category)urls uibox.xpath(.//ul/li/a/img/src).getall()urls list(map(lambda url: response.urljoin(url), urls))item Bmw5Item(categorycategory, image_urlsurls)yield item
pipelines.py
# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Dont forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import osfrom scrapy.pipelines.images import ImagesPipelinefrom bmw5.settings import IMAGES_STOREclass BMWImagesPipeline(ImagesPipeline):自定义图片下载器def get_media_requests(self, item, info):# 这个方法是在发送下载请求之前调用# 其实这个方法本身就是去发送下载请求的request_objs super(BMWImagesPipeline, self).get_media_requests(item, info)for request_obj in request_objs:request_obj.item itemreturn request_objsdef file_path(self, request, responseNone, infoNone):# 这个方法是在图片将要存储的时候调用 来获取这个图片的存储路径path super(BMWImagesPipeline, self).file_path(request, response, info)# 获取categorycategory request.item[category]image_store IMAGES_STOREcategory_path os.path.join(image_store, category)if not os.path.exists(category_path):os.mkdir(category_path)image_name path.replace(full/, )image_path os.path.join(category_path, image_name)return image_path
items.py
# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass Bmw5Item(scrapy.Item):category scrapy.Field()image_urls scrapy.Field()images scrapy.Field()settings.py
ITEM_PIPELINES {# bmw5.pipelines.Bmw5Pipeline: 300,# scrapy.pipelines.images.ImagesPipeline: 1,bmw5.pipelines.BMWImagesPipeline: 1,
}12. crawl spider 下载图片和文件方法汽车之家宝马五系高清图片下载
# -*- coding: utf-8 -*-
from scrapy.linkextractors import LinkExtractor
from scrapy.spider import CrawlSpider, Rule
from bmw5.items import Bmw5Itemclass Bmw5SpiderSpider(CrawlSpider):name bmw5_spiderallowed_domains [car.autohome.com.cn]start_urls [https://car.autohome.com.cn/pic/series/65.html]rules {Rule(LinkExtractor(allowrhttps://car.autohome.com.cn/pic/series/65.), callbackparse_page, followTrue),}def parse_page(self, response):category response.xpath(//div[classuibox]/div/text()).get()srcs response.xpath(//div[contains(class,uibox-con)]/ul/li//img/src).getall()srcs list(map(lambda url: response.urljoin(url.replace(240x180_0_q95_c42, 1024x0_1_q95)), srcs))item Bmw5Item(categorycategory, image_urlssrcs)yield item
13. 下载器中间件-设置随机请求头
设置随机请求头谷歌火狐SafariUser-Agent 字符串连接
httpbin.py
# -*- coding: utf-8 -*-
import scrapy
import jsonclass HttpbinSpider(scrapy.Spider):name httpbinallowed_domains [httpbin.org]start_urls [http://httpbin.org/user-agent]def parse(self, response):useragent json.loads(response.text)[user-agent]print( * 30)print(useragent)print( * 30)yield scrapy.Request(self.start_urls[0], dont_filterTrue)
middlewares.py
# -*- coding: utf-8 -*-# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.htmlimport randomclass UserAgentDownloadMiddleware(object):USER_AGENTS [Mozilla/5.0 (compatible; U; ABrowse 0.6; Syllable) AppleWebKit/420 (KHTML, like Gecko),Mozilla/5.0 (compatible; ABrowse 0.4; Syllable),Mozilla/4.0 (compatible; MSIE 7.0; America Online Browser 1.1; Windows NT 5.1; (R1 1.5); .NET CLR 2.0.50727; InfoPath.1)]def process_request(self, request, spider):这个方法是下载器在发送请求之前会执行的。 一般可以在这个里面设置随机代理IP请求头等信息request 发送请求的request对象spider发送请求的spider对象返回值1. 如果返回NoneScrapy将继续处理改request执行其他中间件2. 返回response对象Scrapy将不会调用其他的process_request方法 将直接返回这个response对象。已经激活的中间件process_response()方法则会在每个response对象返回时被调用3. 返回request对象不再使用之前的request对象下载数据使用返回的这个4. 如果这个方法中出现了异常则会调用process_exception方法useragent random.choice(self.USER_AGENTS)request.headers[User-Agent] useragent
middlewares.py改进版 注意:USER_AGENT_LIST抽出来了参考下面设置随机请求头
import random
from position_project.conf.user_agent import USER_AGENT_LISTclass RandomUserAgent(object):def process_request(self, request, spider):request.headers[User-Agent] random.choice(USER_AGENT_LIST)14. ip代理中间件(快代理)
开放代理 class IPProxyDownloadMiddleware(object):开放代理(不是免费代理哦)PROXIES [178.44.170.152:8080, 110.44.113.182:8000]def process_request(self, request, spider):proxy random.choice(self.PROXIES)request.meta[proxy] proxy独享代理
import base64
class IPPxoxyDownloadMiddleware(object):独享代理def process_request(self,request, spider):proxy 121.199.6.124:16816user_password 970138074:rcdj35urrequest.meta[proxy] proxy# bytesb64_user_password base64.b64encode(user_password.encode(utf-8))request.headers[Proxy-Authorization] Basic b64_user_password.decode(utf-8)15. Scrapy Shell
启动命令
scrapy shell http://www.itcast.cn/channel/teacher.shtml通过response 写xpath进行调试
16. 攻克Boss直聘反爬虫待调整
spiders.py
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from boss.items import BossItemclass ZhipingSpider(CrawlSpider):name zhipinallowed_domains [zhipin.com]start_urls [http://www.zhipin.com/c101010100/?querypythonpage1]rules (# 匹配列表页规则https://www.zhipin.com/c101010100/?querypythonpage1Rule(LinkExtractor(allowr.\?querypythonpage\d), followTrue),# 匹配详情页规则Rule(LinkExtractor(allowr.job_detail/.\.html), callbackparse_job, followFalse),)def parse_job(self, response):print(* * 100)name response.xpath(//div[classname]/h1/text()).get()salary response.xpath(//div[classname]/span[classsalary]/text()).get()job_info response.xpath(//div[classjob-sec]//text()).getall()job_info list(map(lambda x: x.strip(), job_info))job_info .join(job_info)job_info job_info.strip()print(job_info)item BossItem(namename, salarysalary, job_infojob_info)yield item
settings.py
DEFAULT_REQUEST_HEADERS {accept:text/html,application/xhtmlxml,application/xml;q0.9,image/webp,image/apng,*/*;q0.8,application/signed-exchange;vb3,accept-encoding:gzip, deflate, br,accept-language:zh-CN,zh;q0.9,cache-control:max-age0,user-agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36,sec-fetch-mode:navigate,sec-fetch-site:none,sec-fetch-user:?1,upgrade-insecure-requests:1,cookie:_uab_collina155192752626463196786582; lastCity101010100; __c1565492379; toUrl/; __zp_stoken__a32dy4M8VTtvU41ADf0l5K0oReZKFror7%2F2qFAGN5RbBdirT9P%2F2zhugmroLb2ZzmyLVH7BYC%2B3ELS5F05bZCcNIRA%3D%3D; sidsem; __gsem; __ll%2Fwww.zhipin.com%2F%3Fsid%3Dsem_pz_bdpc_dasou_titlerhttps%3A%2F%2Fsp0.baidu.com%2F9q9JcDHa2gU2pMbgoY3K%2Fadrc.php%3Ft%3D06KL00c00fDIFkY0IWPB0KZEgsAN9DqI00000Kd7ZNC00000LI-XKC.THdBULP1doZA80K85yF9pywdpAqVuNqsusK15ynsmWIWry79nj0snynYPvD0IHY3rjm3nDcswWDzPHwaP1RYPRPAPjN7PRPafRfYwD77nsK95gTqFhdWpyfqn1czPjmsPjnYrausThqbpyfqnHm0uHdCIZwsT1CEQLILIz4lpA-spy38mvqVQ1q1pyfqTvNVgLKlgvFbTAPxuA71ULNxIA-YUAR0mLFW5Hb4rHf%26tpl%3Dtpl_11534_19713_15764%26l%3D1511867677%26attach%3Dlocation%253D%2526linkName%253D%2525E6%2525A0%252587%2525E5%252587%252586%2525E5%2525A4%2525B4%2525E9%252583%2525A8-%2525E6%2525A0%252587%2525E9%2525A2%252598-%2525E4%2525B8%2525BB%2525E6%2525A0%252587%2525E9%2525A2%252598%2526linkText%253DBoss%2525E7%25259B%2525B4%2525E8%252581%252598%2525E2%252580%252594%2525E2%252580%252594%2525E6%252589%2525BE%2525E5%2525B7%2525A5%2525E4%2525BD%25259C%2525EF%2525BC%25258C%2525E6%252588%252591%2525E8%2525A6%252581%2525E8%2525B7%25259F%2525E8%252580%252581%2525E6%25259D%2525BF%2525E8%2525B0%252588%2525EF%2525BC%252581%2526xp%253Did(%252522m3224604348_canvas%252522)%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FH2%25255B1%25255D%25252FA%25255B1%25255D%2526linkType%253D%2526checksum%253D8%26wd%3Dboss%25E7%259B%25B4%25E8%2581%2598%26issp%3D1%26f%3D8%26ie%3Dutf-8%26rqlang%3Dcn%26tn%3Dbaiduhome_pg%26inputT%3D3169g%2Fwww.zhipin.com%2Fuser%2Fsem7.html%3Fsid%3Dsem%26qudao%3Dbaidu3%26plan%3DPC-%25E9%2580%259A%25E7%2594%25A8%25E8%25AF%258D%26unit%3DPC-zhaopin-hexin%26keyword%3Dboss%25E7%259B%25B4%25E8%2581%2598%25E4%25BC%2581%25E4%25B8%259A%25E6%258B%259B%25E8%2581%2598; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a1565493077,1565494665,1565494677,1565504545; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a1565505516; __a29781409.1551927520.1553506739.1565492379.86.5.40.25
}17. 动态网页的数据爬取
直接分析ajax调用的接口然后通过代码请求这个接口使用Selenium Chromedriver模拟浏览器行为获取数据 selenium 常用操作Selenium-Python中文文档链接
17.1.安装Selenium
conda install selenium
17.2. 安装chromedriver
下载链接下载完成后放到不需要权限的纯英文目录下就可以了注意chromedriver的版本要和浏览器的版本一致64位的也可以用32位的
17.3 第一个小案例
from selenium import webdriver
import timedriver_path rD:\chromedriver\chromedriver.exedriver webdriver.Chrome(executable_pathdriver_path)driver.get(http://www.baidu.com)
# 通过page_source获取网页的源代码
print(driver.page_source)time.sleep(3)
driver.close()17.4. 定位元素
如果只是想要解析网页中的数据那么推荐将网页源代码扔给lxml来解析 因为lxml底层使用的是c怨言 所以解析效率会高一点如果是想要对元素进行一些操作比如给一个文本框输入或者点击某个按钮那么就必须使用selenium给我们提供的查找元素的额方法
from selenium import webdriver
import time
from lxml import etreedriver_path rD:\chromedriver\chromedriver.exedriver webdriver.Chrome(executable_pathdriver_path)driver.get(http://www.baidu.com)
# 通过page_source获取网页的源代码
print(driver.page_source)# inputTag driver.find_element_by_id(kw)
# inputTag driver.find_element_by_name(wd)
# inputTag driver.find_element_by_class_name(s_ipt)
inputTag driver.find_element_by_xpath(//input[classs_ipt])
inputTag.send_keys(迪丽热巴)
htmlE etree.HTML(driver.page_source)print(htmlE)
time.sleep(3)
driver.close()
17.5. selenium 操作表单元素
文本框的操作
inputTag driver.find_element_by_xpath(//input[classs_ipt])
inputTag.send_keys(迪丽热巴)time.sleep(3)inputTag.clear()checkbox的操作
inputTag driver.find_element_by_name(remember)
inputTag.click()select的操作按钮的操作
17.6. 行为链
from selenium import webdriver
import time
from selenium.webdriver.common.action_chains import ActionChainsdriver_path rD:\chromedriver\chromedriver.exe
driver webdriver.Chrome(executable_pathdriver_path)
driver.get(http://www.baidu.com)inputTag driver.find_element_by_xpath(//input[classs_ipt])
submitBtn driver.find_element_by_id(su)actions ActionChains(driver)
actions.move_to_element(inputTag)
actions.send_keys_to_element(inputTag, 黄渤)
actions.move_to_element(submitBtn)
actions.click(submitBtn)
actions.perform()time.sleep(6)inputTag.clear()driver.close()
17.7. cookie的操作
17.8. 页面等待
隐式等待
driver.implicitly_wait(10)显示等待
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import Bydriver_path rD:\chromedriver\chromedriver.exe
driver webdriver.Chrome(executable_pathdriver_path)
driver.get(http://www.douban.com)
element WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, app-title))
)
print(element)
17.9. 切换页面
from selenium import webdriverdriver_path rD:\chromedriver\chromedriver.exe
driver webdriver.Chrome(executable_pathdriver_path)
driver.get(http://www.jd.com)driver.execute_script(window.open(https://www.douban.com/))
print(driver.window_handles)
driver.switch_to.window(driver.window_handles[1])print(driver.current_url)
17.10. selenium 使用代理
from selenium import webdriverdriver_path rD:\chromedriver\chromedriver.exeoptions webdriver.ChromeOptions()
options.add_argument(--proxy-serverhttp://60.17.239.207:31032)
driver webdriver.Chrome(executable_pathdriver_path, chrome_optionsoptions)driver.get(http://www.jd.com)WebElement元素
18. Selenium 拉勾网爬虫
传统方式
import requests
from lxml import etree
import time
import re# 请求头
HEADERS {Accept: application/json, text/javascript, */*; q0.01,User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36,Referer: https://www.lagou.com/jobs/list_python?labelWords$fromSearchtruesuginput,Host: www.lagou.com,
}def request_list_page():url1 https://www.lagou.com/jobs/list_python?labelWords$fromSearchtruesuginputurl https://www.lagou.com/jobs/positionAjax.json?needAddtionalResultfalse# 通过data来控制翻页for page in range(1, 2):data {first: false,pn: page,kd: python}s requests.Session() # 建立sessionresponse s.get(urlurl1, headersHEADERS, timeout3)cookie s.cookies # 获取cookierespon s.post(urlurl, headersHEADERS, datadata, cookiescookie, timeout3)time.sleep(7)result respon.json()positions result[content][positionResult][result]for position in positions:positionId position[positionId]position_url https://www.lagou.com/jobs/{}.html.format(positionId)parse_position_detail(position_url, s)breakdef parse_position_detail(url, s):response s.get(url, headersHEADERS)text response.texthtmlE etree.HTML(text)position_name htmlE.xpath(//div[classjob-name]/title)[0]job_request_spans htmlE.xpath(//dd[classjob_request]//span)salary job_request_spans[0].xpath(./text())[0].strip()education job_request_spans[3].xpath(./text())[0]education re.sub(r[/ \s], , education)print(education)job_detail htmlE.xpath(//div[classjob-detail]//text())job_detail .join(job_detail).strip()print(job_detail)if __name__ __main__:request_list_page()
Selenium Chromedriver方式
import re
import timefrom lxml import etree
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWaitclass LagouSpider(object):Selenium ChromeDriver 拉钩爬虫driver_path rD:\chromedriver\chromedriver.exedef __init__(self):self.driver webdriver.Chrome(executable_pathLagouSpider.driver_path)# 这个链接并不是真正招聘职位信息的链接self.url https://www.lagou.com/jobs/list_python?labelWords$fromSearchtruesuginput# 职位信息列表self.positions []def run(self):self.driver.get(self.url)while True:WebDriverWait(self.driver, 10).until(# 这里只能追踪的元素不能追踪到元素的具体属性EC.presence_of_element_located((By.XPATH, //div[classpager_container]/span[last()])))source self.driver.page_sourceself.parse_list_page(source)next_btn self.driver.find_element_by_xpath(//div[classpager_container]/span[last()])if pager_next_disabled in next_btn.get_attribute(class):breakelse:next_btn.click()def parse_list_page(self, source):htmlE etree.HTML(source)links htmlE.xpath(//a[classposition_link]/href)for link in links:self.request_detail_page(link)time.sleep(1)def request_detail_page(self, url):# self.driver.get(url)self.driver.execute_script(window.open({}).format(url))self.driver.switch_to.window(self.driver.window_handles[1])WebDriverWait(self.driver, 10).until(# EC.presence_of_element_located((By.XPATH, //div[classjob-name]/title))# 这里只能追踪到元素追踪不到元素下的具体属性EC.presence_of_element_located((By.XPATH, //div[classjob-name])))page_srouce self.driver.page_sourceself.parse_detail_page(page_srouce)# 关闭这个详情页self.driver.close()# 继续切换到职位列表页面self.driver.switch_to.window(self.driver.window_handles[0])def parse_detail_page(self, source):htmlE etree.HTML(source)position_name htmlE.xpath(//div[classjob-name]/h2/text())[0]company htmlE.xpath(//div[classjob-name]/h4/text())[0]job_request_spans htmlE.xpath(//dd[classjob_request]//span)salary job_request_spans[0].xpath(./text())[0].strip()salary re.sub(r[/ \s], , salary)city job_request_spans[1].xpath(./text())[0].strip()city re.sub(r[/ \s], , city)experience job_request_spans[2].xpath(./text())[0].strip()experience re.sub(r[/ \s], , experience)education job_request_spans[3].xpath(./text())[0]education re.sub(r[/ \s], , education)type job_request_spans[4].xpath(./text())[0]type re.sub(r[/ \s], , type)job_detail htmlE.xpath(//div[classjob-detail]//text())job_detail .join(job_detail).strip()print(职位%s % position_name)print(单位%s % company)print()print(salary / city / experience / education / type)print()print(job_detail)position {name: position_name,company: company,salary: salary,city: city,experience: experience,education: education,desc: job_detail}self.positions.append(position)# print(position)print( * 100)if __name__ __main__:spider LagouSpider()spider.run()
19. ScrapySelenium爬取简书网整站并且存入到mysql当中
目前这个网站有css加密
item目标字段类
# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass JianshuProjectItem(scrapy.Item):title scrapy.Field()content scrapy.Field()article_id scrapy.Field()origin_url scrapy.Field()author scrapy.Field()avatar scrapy.Field()pub_time scrapy.Field()
setting设置类
# 1.导包
import logging
import datetime
import os# 2.项目名称 TODO 需要修改
BOT_NAME jianshu_project# 3.模块名称
SPIDER_MODULES [{}.spiders.format(BOT_NAME)]
NEWSPIDER_MODULE {}.spiders.format(BOT_NAME)# 4.遵守机器人协议默认为True
ROBOTSTXT_OBEY False# 5.用户代理使用的浏览器类型
USER_AGENT Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 \Safari/537.36 # 6.默认请求头信息USER_AGENT 单独配置
DEFAULT_REQUEST_HEADERS {accept:text/html,application/xhtmlxml,application/xml;q0.9,image/webp,image/apng,*/*;q0.8,application/signed-exchange;vb3,accept-encoding:gzip, deflate, br,accept-language:zh-CN,zh;q0.9,
}# 7.格式化日志输出的格式日志文件每分钟生成一个文件
time_str datetime.datetime.strftime(datetime.datetime.now(), %Y-%m-%d %H-%M)
LOG_FILE {}\\{}\\logs\\{}.log.format(os.getcwd(), BOT_NAME, time_str)
LOG_LEVEL DEBUG# 8.设置运行多个爬虫的自定义命令
COMMANDS_MODULE {}.commands.format(BOT_NAME)# 9.scrapy输出的json文件中显示中文https://www.cnblogs.com/linkr/p/7995454.html
FEED_EXPORT_ENCODING utf-8# 10.管道pipeline配置后面的值越小越先经过这根管道 TODO 需要修改
ITEM_PIPELINES {# {}.pipelines.JianshuProjectPipeline.format(BOT_NAME): 300,{}.pipelines.JianshuTwistedPipeline.format(BOT_NAME): 300,
}# 11.限制爬虫的爬取速度 单位为秒
DOWNLOAD_DELAY 1# 12. 下载中间件 TODO 需要修改
DOWNLOADER_MIDDLEWARES {{}.middlewares.RandomUserAgent.format(BOT_NAME): 1,{}.middlewares.SeleniumDownloadMiddleware.format(BOT_NAME): 2
}# 13. 禁用cookie
COOKIES_ENABLED Falsespider爬虫类
# -*- coding: utf-8 -*-
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rulefrom jianshu_project.items import JianshuProjectItemclass JianshuSpider(CrawlSpider):name jianshuallowed_domains [jianshu.com]start_urls [https://www.jianshu.com/]rules (Rule(LinkExtractor(allowr.*/p/[0-9a-z]{12}.*), callbackparse_detail, followTrue),)def parse_detail(self, response):title response.xpath(//div[id__next]/div[1]/div/div/section[1]/h1/text()).extract_first()avatar response.xpath(//div[class_2mYfmT]//a[class_1OhGeD]/img/src).extract_first()author response.xpath(//span[classFxYr8x]/a/text()).extract_first()pub_time response.xpath(//div[id__next]/div[1]/div/div/section[1]/div[1]/div/div/div[2]/time/text()).extract_first()url response.urlurl1 url.split(?)[0]article_id url1.split(/)[-1]content response.xpath(//article[class_2rhmJa]).extract_first()item JianshuProjectItem(titletitle,avataravatar,authorauthor,pub_timepub_time,origin_urlurl,article_idarticle_id,contentcontent)yield item
pipeline管道
import pymysql
from pymysql import cursors
from twisted.enterprise import adbapiclass JianshuProjectPipeline(object):同步入庫def __init__(self):dbparams {host: mini1,port: 3306,user: root,password: 123456,database: db_jianshu,charset: utf8}self.conn pymysql.connect(**dbparams)self.cursor self.conn.cursor()self._sql Nonedef process_item(self, item, spider):print(* * 300)print(item)self.cursor.execute(self.sql, (item[title], item[content],item[author], item[avatar],item[pub_time], item[article_id],item[origin_url]))self.conn.commit()return itempropertydef sql(self):if not self._sql:self._sql insert into tb_article (id,title,content,author,avatar,pub_time,article_id, origin_url) values(null,%s,%s,%s,%s,%s,%s,%s)return self._sqlreturn self._sqlclass JianshuTwistedPipeline(object):异步入库def __init__(self):dbparams {host: mini1,port: 3306,user: root,password: 123456,database: db_jianshu,charset: utf8,cursorclass: cursors.DictCursor}self.dbpool adbapi.ConnectionPool(pymysql, **dbparams)self._sql Nonepropertydef sql(self):if not self._sql:self._sql insert into tb_article (id,title,content,author,avatar,pub_time,article_id, origin_url) values(null,%s,%s,%s,%s,%s,%s,%s)return self._sqlreturn self._sqldef process_item(self, item, spider):defer self.dbpool.runInteraction(self.insert_item, item)defer.addErrback(self.handle_error, item, spider)return itemdef insert_item(self, cursor, item):cursor.execute(self.sql, (item[title], item[content],item[author], item[avatar],item[pub_time], item[article_id],item[origin_url]))def handle_error(self, error, item, spider):print(* * 100)print(error:, error)print(* * 100)
middleware中间件
import random
from jianshu_project.conf.user_agent import USER_AGENT_LIST
from selenium import webdriver
import time
from scrapy.http.response.html import HtmlResponseclass RandomUserAgent(object):def process_request(self, request, spider):request.headers[User-Agent] random.choice(USER_AGENT_LIST)class SeleniumDownloadMiddleware(object):def __init__(self):self.driver webdriver.Chrome(executable_pathrD:\chromedriver\chromedriver.exe)def process_request(self, request, spider):self.driver.get(request.url)time.sleep(1)try:while True:loadMore self.driver.find_element_by_class_name(load-more)loadMore.click()time.sleep(0.3)if not loadMore:breakexcept Exception as e:passsource self.driver.page_sourceresponse HtmlResponse(urlself.driver.current_url, bodysource, requestrequest, encodingutf-8)return response
其他确实部分都可以在前面部分找到相关叙述
20. selenium设置代理和UserAgent
import random
from useragent_demo.conf.user_agent import USER_AGENT_LIST
from selenium import webdriver
import time
from scrapy.http.response.html import HtmlResponseclass RandomUserAgent(object):def process_request(self, request, spider):request.headers[User-Agent] random.choice(USER_AGENT_LIST)class SeleniumDownloadMiddleware1(object):def process_request(self, request, spider):options webdriver.ChromeOptions()options.add_argument(user-agent{}.format(request.headers[User-Agent])) # 设置随机请求头# self.options.add_argument(--proxy-server{}.format(request.headers[proxy])) # 设置代理driver webdriver.Chrome(chrome_optionsoptions, executable_pathrD:\chromedriver\chromedriver.exe)driver.get(request.url)source driver.page_sourceresponse HtmlResponse(urldriver.current_url, bodysource, requestrequest, encodingutf-8)driver.close()return responseclass SeleniumDownloadMiddleware(object):def __init__(self):self.driver webdriver.Chrome(executable_pathrD:\chromedriver\chromedriver.exe)self.options self.driver.create_options()def process_request(self, request, spider):self.options.add_argument(user-agent{}.format(request.headers[User-Agent]))self.driver.get(request.url)time.sleep(1)source self.driver.page_sourceresponse HtmlResponse(urlself.driver.current_url, bodysource, requestrequest, encodingutf-8)return response
21. http://httpbin.org 测试接口解析