利用python3.7抓取天气网的历史数据破解网站的反爬

声明：本代码抓取数据可以使用研究学习之用，不要使用商业用途，否则由此产生商业纠纷由使用者这负责

最近需要使用环境pm2.5的全国省会城市的历史数据，正好天气网（）提供了历史数据的查询，因此再网上查询相关python抓取数据的代码，主要参考了这篇博文：/haha_point/article/details/77197230#commentsedit，但是这篇博文有以下两个问题：

1，用的是python2.7的版本，python3和python2还是有比较多的差别的，作为本抓取程序，主要是urllib的区别，2.7中直接importurllib就ok了，在3.7中需要import urllib.request；

2，就是网址增加了反爬措施，因为是爬取的是历史数据，其实用的是/（city）/（date）.html，如北京07的数据的连接是：/beijing/07.html，但是你再一个新浏览器上面直接打开上面的连接的话，会返回一个如下的页面：

这也是导致目前网上很多python爬虫无法抓取数据的原因，找个原因是你必须先访问我,然后再去上面的页面就返回ok了，猜测可能是在访问首页的时候，前端写cookie了，因此你可以，再访问的页面，按F12，看下页面的cookie数据，具体如下图：其中cookie部分，已经用红线标出了，再url请求的时候直接添加headers，用这个方法：req=urllib.request.Request(url=url,headers=my_headers),其中

my_headers = {"Host": "","Connection": "keep-alive","Upgrade-Insecure-Requests": "1","User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3","Accept-Language": "zh-CN,zh;q=0.8,en;q=0.6","Referer": "/Accept-Encoding: gzip, deflate","Cookie": "cityPy=xianqu; cityPy_expire=1565422933; UM_distinctid=16c566dd356244-05e0d9cb0c361-3f385c06-1fa400-16c566dd357642; Hm_lvt_ab6a683aa97a52202eab5b3a9042a8d2=1564818134; CNZZDATA1275796416=927309794-1564814113-%7C1564814113; Hm_lpvt_ab6a683aa97a52202eab5b3a9042a8d2=1564818280"}，具体代码见下面代码

代码如下：

import randomimport socketimport sysimport urllibimport urllib.requestfrom bs4 import BeautifulSoup#reload(sys)#sys.('utf8')socket.setdefaulttimeout(30.0)def parseTianqi(url):my_headers = {"Host": "","Connection": "keep-alive","Upgrade-Insecure-Requests": "1","User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3","Accept-Language": "zh-CN,zh;q=0.8,en;q=0.6","Referer": "/Accept-Encoding: gzip, deflate","Cookie": "cityPy=xianqu; cityPy_expire=1565422933; UM_distinctid=16c566dd356244-05e0d9cb0c361-3f385c06-1fa400-16c566dd357642; Hm_lvt_ab6a683aa97a52202eab5b3a9042a8d2=1564818134; CNZZDATA1275796416=927309794-1564814113-%7C1564814113; Hm_lpvt_ab6a683aa97a52202eab5b3a9042a8d2=1564818280"}req = urllib.request.Request(url=url, headers=my_headers)req.add_header("Content-Type", "application/json")fails = 0while True:try:if fails >= 3:breakreq_data = urllib.request.urlopen(req)response_data = req_data.read()response_data = response_data.decode('gbk').encode('utf-8')return response_dataexcept urllib.request.URLError as e:fails += 1print ('网络连接出现问题, 正在尝试再次请求: ', fails)else:breakdef witeCsv(data, file_name):file = open(file_name, 'w',-1,'utf-8')soup = BeautifulSoup(data, 'html.parser')weather_list = soup.select('div[class="tqtongji2"]')for weather in weather_list:weather_date = weather.select('a')[0].string.encode('utf-8')ul_list = weather.select('ul')i = 0for ul in ul_list:li_list = ul.select('li')str = ""for li in li_list:str += li.string.encode('utf-8').decode() + ','if i != 0:file.write(str + '\n')i += 1file.close()# 根据图片主页，抓取当前图片下面的相信图片if __name__ == "__main__":data = parseTianqi("/beijing/07.html");witeCsv(data, "beijing_07");

利用python3.7抓取天气网的历史数据 破解网站的反爬

利用python3.7抓取天气网的历史数据破解网站的反爬