一木.溪桥学爬虫-03：请求模块urllib urllib.request urllib.parse.urlencode urllib.parse.quote(str) .unquote()

一木.溪桥在Logic Education跟Jerry学爬虫

07期：Python 爬虫

一木.溪桥学爬虫-03：请求模块urllib、 urllib.request、urllib.parse.urlencode、urllib.parse.quote(str)、parse.unquote()

日期：1月26日

学习目标：

请求模块urllib

urllib.request

urllib.parse.urlencode

urllib.parse.quote(str)

parse.unquote()

urllib post 案例

学习内容：

爬虫请求模块

urllib

为什么学习 urllib?

有的一些比较老的爬虫项目用的是urllib有时我们在做一些爬虫的时候往往需要requests + urllib 一起使用是python内置的模块urllib在某些方面还是非常强大

urllib的快速入门

eg. 下载网上的一张图片

# 方法1--open, closeimport requestsurl = '/70cFuXSh_Q1YnxGkpoWK1HF6hhy/it/u=1603365312,' \'3218205429&fm=26&gp=0.jpg'req = requests.get(url)fn = open('code.png', 'wb')# 文件命名为code.png，wb 写入二进制数据fn.write(req.content)# content中间存的是字节码（此处图片存储的就是二进制数据），而text中存的是Beautifulsoup根据猜测的编码方式将content内容编码成字符串。fn.close()

# 方法2--with open, 可以不用close()import requestsurl = '/70cFuXSh_Q1YnxGkpoWK1HF6hhy/it/u=1603365312,' \'3218205429&fm=26&gp=0.jpg'req = requests.get(url)with open('code2.png', 'wb') as file_obj:file_obj.write(req.content)

# 方法3-- 用python内置模块 urllib 中的 request 方法from urllib import requesturl = '/70cFuXSh_Q1YnxGkpoWK1HF6hhy/it/u=1603365312,' \'3218205429&fm=26&gp=0.jpg'request.urlretrieve(url, 'code3.jpg') # url网址，文件名code3.jpg

urllib.request 模块

版本

python2 ：urllib2、urllib

python3 ：把urllib和urllib2合并

常用的方法：

urllib.request.urlopen(“网址”) 作用：向网站发起一个请求并获取响应字节流 = response.read()字符串 = response.read().decode(“utf-8”)urllib.request.Request"网址",headers=“字典”) urlopen()不支持重构User-Agent

响应对象

read() 读取服务器响应的内容getcode() 返回HTTP的响应码geturl() 返回实际数据的URL(防止重定向问题)

import urllib.requesturl = '/'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'} # User-Agent模拟浏览器# 1 创建请求对象urllib.request.Request() (构造user-agent 反反爬)req = urllib.request.Request(url, headers=headers) # 2 获取响应对象urllib.request.urlopen()res = urllib.request.urlopen(req)# 3 读取响应对象中内容 read().decode('utf-8') bytes --> strhtml = res.read().decode('utf-8') # 拿到原始数据print(html)# 打印原始数据print(res.getcode()) # 返回状态码print(res.geturl()) # 返回请求的网址(防止重定向问题)

**总结：**urllib.request用法

1 创建请求对象 urllib.request.Request() 构建user-agent2 发起请求获取响应对象 urllib.request.urlopen()3 读取响应对象的内容 read().decode(‘utf-8’) bytes --> str

常用方法

urlencode(字典)quote(字符串) (这个里面的参数是个字符串)urllib.parse模块

请求方式

GET 特点：查询参数在URL地址中显示POST在Request方法中添加data参数 urllib.request.Request(url,data=data,headers=headers)data ：表单数据以bytes类型提交,不能是str

urllib.parse.urlencode

请求中有汉字的处理方法1~3

方法1：先urllib.parse.urlencode(dict字典) 转换成了%+十六进制，再去拼接。

import urllib.requestimport urllib.parseurl = '/s?wd=%E6%B5%B7%E8%B4%BC%E7%8E%8B'# wd=%E6%B5%B7%E8%B4%BC%E7%8E%8Burl2 = '/s?wd=海贼王'# 3个%是一个汉字# 如果我请求的url地址中出现了中文字样，我们的思路就是把中文转换成%+十六进制的样式# res = urllib.request.urlopen(url2)# 报错# 第一种方式 urllib.parse.urlencode(dict字典) 转换成了%+十六进制r = {'wd': '海贼王'} # 字典格式result = urllib.parse.urlencode(r)print(result) # wd=%E6%B5%B7%E8%B4%BC%E7%8E%8Bprint(type(result))# <class 'str'># 拼接：url3 = '/s?' + resultprint(url3) # /s?wd=%E6%B5%B7%E8%B4%BC%E7%8E%8B

urllib.parse.quote(str)

方法2： urllib.parse.quote(str)

import urllib.requestimport urllib.parseurl = '/s?wd=%E6%B5%B7%E8%B4%BC%E7%8E%8B'# %E6%B5%B7%E8%B4%BC%E7%8E%8Burl2 = '/s?wd=海贼王'# 第二种方式 urllib.parse.quote(str)r = '海贼王'result = urllib.parse.quote(r)print(result)# %E6%B5%B7%E8%B4%BC%E7%8E%8Burl4 = '/s?wd=' + result # 拼接：print(url4)# /s?wd=%E6%B5%B7%E8%B4%BC%E7%8E%8B

url 里面的数据包含有 %+十六进制parse.unquote()

下载王者荣耀壁纸

反爬的小细节

如果以后遇到了 url 里面的数据包含有 %+十六进制这样的url我们是无法进行一个正常的请求，解决办法通过 parse.unquote() 来进行处理

from urllib import parsefrom urllib import requestimg = parse.unquote('http%3A%2F%2Fshp%2Eqpic%2Ecn%2Fishow%2F2735012617%2F1611652313%5F84828260%5F14368%5FsProdImgNo%5F2%2Ejpg%2F200')print(img)# 转换后的图片地址， /ishow/2735012617/1611652313_84828260_14368_sProdImgNo_2.jpg/200request.urlretrieve(img, 'code3.jpg') # 图片下载， img网址，文件名code3.jpg

练习1：在百度输入您要搜索的内容，例如：美女结果保存成一个html文件

# 需求：在百度输入您要搜索的内容，例如：美女结果保存成一个html文件import urllib.requestimport urllib.parseheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ''Chrome/87.0.4280.141 Safari/537.36'}# 常规的格式：/s?wd=%E5%A6%B9%E5%AD%90key = input('请输入您要搜索的内容:')base_url = '/s?'wd = {'wd': key}result = urllib.parse.urlencode(wd)# 把中文转换成%+十六进制的样式url = base_url + result# 拼接url# print(url)# 构建请求对象req = urllib.request.Request(url, headers=headers)# 获取响应对象res = urllib.request.urlopen(req)# 读取响应的数据html = res.read().decode('utf-8')# 保存数据with open('搜索.html', 'w', encoding='utf-8') as file_obj:file_obj.write(html)

练习2：爬取贴吧中想要的主题

# 爬取贴吧中想要的主题import urllib.requestimport urllib.parse# /f?kw=%E5%AD%A6%E7%94%9F&pn=0headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/''87.0.4280.141 Safari/537.36'}# 贴吧的主题name = input('请输入您要爬取的贴吧主题:')# 爬取的起始页和终止页begin = int(input('请输入起始页:'))end = int(input('请输入终止页:'))# 对name进行处理kw = {'kw': name}result = urllib.parse.urlencode(kw)# 拼接目标url kw=%E5%AD%A6%E7%94%9F 是要动态的去替换的 pn值 (page - 1) * 50# range()函数的特点 range(5) range(0,5) range(0,5,1) list(range(5)) 0 1 2 3 4for i in range(begin, end+1):pn = (i - 1) * 50base_url = '/f?'url = base_url + result + '&pn=' + str(pn)# 发起请求获得响应req = urllib.request.Request(url, headers=headers)res = urllib.request.urlopen(req)html = res.read().decode('utf-8')# 写入文件filename = '第' + str(i) + '页.html'with open(filename, 'w', encoding='utf-8') as f:print('正在爬取第%d页' %i)f.write(html)

RUN:请输入您要爬取的贴吧主题:美女请输入起始页:1请输入终止页:3正在爬取第1页正在爬取第2页正在爬取第3页

练习3：以函数形式，爬取贴吧中想要的主题

import urllib.requestimport urllib.parse# 读取页面def readPage(url):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/''87.0.4280.141 Safari/537.36'}req = urllib.request.Request(url, headers=headers)res = urllib.request.urlopen(req)html = res.read().decode('utf-8')return html# 写入文件def writePage(filename, html):with open(filename, 'w', encoding='utf-8') as f:f.write(html)print('写入成功')# 主函数 1 调用前2个函数 2 其它的逻辑放到main()函数中def main():name = input('请输入您要爬取的贴吧主题:')begin = int(input('请输入起始页:'))end = int(input('请输入终止页:'))kw = {'kw': name}result = urllib.parse.urlencode(kw)for i in range(begin, end + 1):pn = (i - 1) * 50base_url = '/f?'url = base_url + result + '&pn=' + str(pn)# 调用函数html = readPage(url)filename = '第' + str(i) + '页.html'writePage(filename, html)if __name__ == '__main__':main()

RUN:请输入您要爬取的贴吧主题:美女请输入起始页:1请输入终止页:4写入成功写入成功写入成功写入成功

练习4：以面向对象形式，爬取贴吧中想要的主题

import urllib.requestimport urllib.parseclass BaiduSpider:def __init__(self):self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/''87.0.4280.141 Safari/537.36'}self.base_url = '/f?'def readPage(self, url):req = urllib.request.Request(url, headers=self.headers)res = urllib.request.urlopen(req)html = res.read().decode('utf-8')return htmldef writePage(self, filename, html):with open(filename, 'w', encoding='utf-8') as f:f.write(html)print('写入成功')def main(self):name = input('请输入您要爬取的贴吧主题:')begin = int(input('请输入起始页:'))end = int(input('请输入终止页:'))kw = {'kw': name}result = urllib.parse.urlencode(kw)for i in range(begin, end + 1):pn = (i - 1) * 50url = self.base_url + result + '&pn=' + str(pn)# 调用函数html = self.readPage(url)filename = '第' + str(i) + '页.html'self.writePage(filename, html)if __name__ == '__main__':spider = BaiduSpider()spider.main()

RUN:请输入您要爬取的贴吧主题:美女请输入起始页:1请输入终止页:2写入成功写入成功

urllib post 案例

需求：

利用有道在线翻译，完成一个小翻译

向 url发起请求当中需要携带数据是我们要翻译的内容

Request Method: POST

Form data Post请求

import urllib.requestimport urllib.parseimport json# json.loads()把json类型的字符串就可以转换成python类型的字典# 请输入您要翻译的内容content = input('请输入您要翻译的内容:')# Form data 复制检查中Form data 中的全部数据data = {'i': content, # 要翻译的内容'from': 'AUTO','to': 'AUTO','smartresult': 'dict','client': 'fanyideskweb','salt': '15880623642174','sign': 'c6c2e897040e6cbde00cd04589e71d4e','Its': '1588062364217','bv': '42160534cfa82a6884077598362bbc9d','doctype': 'json','version': '2.1','keyfrom': 'fanyi.web','action': 'FY_BY_CLICKBUTTION'}data = urllib.parse.urlencode(data)# print(type(data)) strdata = bytes(data, 'utf-8') # 1.,str 转换为 bytes . 2.加encoding编码，TypeError: string argument without an encoding# 目标url 去掉translate_o中的_o , 解决报错{"errorCode":50}的问题url = '/translate?smartresult=dict&smartresult=rule'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'}req = urllib.request.Request(url, data=data, headers=headers)res = urllib.request.urlopen(req)html = res.read().decode('utf-8')# print(html) # json类型的字符串# 把json类型的字符串就可以转换成python类型的字典r_dict = json.loads(html)# 解析数据，取字典中的键translateResult的值[[{"src":"hello","tgt":"你好"}]]r = r_dict['translateResult'] # [[{"src":"hello","tgt":"你好"}]]# 取值列表中的列表中字典中的值result = r[0][0]['tgt'] # [{"src":"hello","tgt":"你好"}] ->{"src":"hello","tgt":"你好"} -> "你好"print(result)'''{"type":"EN2ZH_CN","errorCode":0,"elapsedTime":1,"translateResult":[[{"src":"hello","tgt":"你好"}]]}'''

tips：

注意点一data = urllib.parse.urlencode(data)# print(type(data))data = bytes(data,'utf-8') # TypeError: string argument without an encoding注意点二# 目标url 去掉_ourl = '/translate?smartresult=dict&smartresult=rule'注意点三r_dict = json.loads(html)# 解析数据r = r_dict['translateResult'] # [[{"src":"hello","tgt":"你好"}]]result = r[0][0]['tgt'] # [{"src":"hello","tgt":"你好"}] ->{"src":"hello","tgt":"你好"} -> "你好"print(result)

End !

Best wishes for you！