Python中使用PhantomJS抓取Javascript网页数据

有些网页不是静态加载的，而是通过javascirpt函数动态加载网页，比如下面这个网页，表格中的看涨合约和看跌合约的数据都是通过javascirpt函数从后台加载。仅仅使用beautifulsoup并不能抓到这个表格中的数据。

查找资料，发现可以使用PhantomJS来抓取这类网页的数据。但PhantomJS主要用于Java，若要在python中使用，则要通过Selenium在python中调用PhantomJS。写代码时主要参考了这个网页：Is there a way to use PhantomJS in Python?

Selenium是一个浏览器虚拟器，可以通过Selenium在各种浏览器上模拟各种行为。python中通过Selenium使用PhantomJS抓取动态网页数据时需要安装以下库：

1. Beautifulsoup，用于解析网页内容

2. Node.js

3. 安装好Node.js之后通过Node.js安装PhantomJS。在Mac终端中输入npm -g install phantomjs即可（Windows下的cmd也是一样）

4. 安装Selenium

完成上述四个步骤后即可在python中使用PhantomJS。

代码如下：

# -*- coding: utf-8 -*-from bs4 import BeautifulSoupfrom selenium import webdriverfrom selenium.webdriver.support.ui import WebDriverWaitfrom mon.by import Byfrom selenium.webdriver.support import expected_conditions as ECimport urllib2import timebaseUrl = "http://stock./option/quotes.html"csvPath = "FinanceData.csv"csvFile = open(csvPath, 'w')def is_chinese(uchar):# 判断一个unicode是否是汉字if uchar >= u'\u4e00' and uchar<=u'\u9fa5':return Trueelse:return Falsedef readPage(url):webURL = urllib2.urlopen(baseUrl)content = webURL.read()soup = BeautifulSoup(content)return soupdef getFinance(soup, tableName):divs = soup.findAll('div', attrs={'class': tableName}) #看涨合约在这个div中if len(divs) < 0 or len(divs) == 0:print "No div class named " + str(tableName)returntbs = divs[0].findChildren('tbody') # 獲取tbody內容，在這個標籤下只有一個tbodyprint tbs[0]trs = tbs[0].findChildren('tr') # tr就是table中的每一行for tr in trs: tds = tr.findChildren('td') # td是表格中的內容content = list()string = ""print trindex = 0 # 判斷漢字出現的位置for td in tds:temp = td.textprint tempif index == 7 or index == 0:temp2 = ""for d in temp:if not is_chinese(d): # 去除漢字temp2 += dtemp = temp2string = string + tempstring = string + ","index += 1print stringcsvFile.write(string)csvFile.write('\n')tableName = "table_down fr" # 表格名稱driver = webdriver.PhantomJS(executable_path='/Users/Pan/node_modules/phantomjs/lib/phantom/bin/phantomjs')driver.get(baseUrl)data = driver.page_source # 获取整个页面的内容driver.quit#soup = readPage(loadUrl)soup = BeautifulSoup(data)getFinance(soup, tableName)print "Finished!"csvFile.close()

但是以上代码有问题！以上代码有时可以抓取到数据，有时抓取不到。原因时上述代码执行时有些数据还没有被加载。因此需要判断网页何时加载了想要的数据。为了解决这个问题，Selenium提供了Waits机制，可以等待一段时间再读取网页。waits机制分为Explicit Waits和Implicit Waits。Waits方法可以和ExpectedCondition结合，这样在抓取数据时可以等待一段时间，若这段时间满足ExpectedCondition中指定的条件则执行后面的代码；若超过时间后仍未满足指定的条件，则抛出异常。下面是使用Explicit Waits的一段示例代码：

from selenium import webdriverfrom mon.by import Byfrom selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0from selenium.webdriver.support import expected_conditions as EC # available since 2.26.0ff = webdriver.Firefox()ff.get("http://somedomain/url_that_delays_loading")try:# 等待10s，若10s内能获取到ID为myDynamicElement的内容，则执行后面的代码；否则抛出异常element = WebDriverWait(ff, 10).until(EC.presence_of_element_located((By.ID, "myDynamicElement")))finally:ff.quit()

本例中，看涨合约每次都可以抓取到，看跌合约则不稳定，大概网页是先执行看涨合约的函数，再执行看跌合约的函数（猜的，没有看代码），因此看跌合约数据的加载要慢一点。

修改后的代码如下：

# -*- coding: utf-8 -*-from bs4 import BeautifulSoupfrom selenium import webdriverfrom selenium.webdriver.support.ui import WebDriverWaitfrom mon.by import Byfrom selenium.webdriver.support import expected_conditions as ECimport urllib2import timebaseUrl = "http://stock./option/quotes.html"csvPath = "FinanceData.csv"csvFile = open(csvPath, 'w')def is_chinese(uchar):# 判断一个unicode是否是汉字if uchar >= u'\u4e00' and uchar<=u'\u9fa5':return Trueelse:return Falsedef readPage(url):webURL = urllib2.urlopen(baseUrl)content = webURL.read()soup = BeautifulSoup(content)return soupdef getFinance(soup, tableName):divs = soup.findAll('div', attrs={'class': tableName}) #看涨合约在这个div中if len(divs) < 0 or len(divs) == 0:print "No div class named " + str(tableName)returntbs = divs[0].findChildren('tbody') # 獲取tbody內容，在這個標籤下只有一個tbodyprint tbs[0]trs = tbs[0].findChildren('tr') # tr就是table中的每一行for tr in trs: tds = tr.findChildren('td') # td是表格中的內容content = list()string = ""print trindex = 0 # 判斷漢字出現的位置for td in tds:temp = td.textprint tempif index == 7 or index == 0:temp2 = ""for d in temp:if not is_chinese(d): # 去除漢字temp2 += dtemp = temp2string = string + tempstring = string + ","index += 1#content.append(td.text)#print content#string = string[:-1]print stringcsvFile.write(string)csvFile.write('\n')tableName = "table_down fr" # 表格名稱driver = webdriver.PhantomJS(executable_path='/Users/Pan/node_modules/phantomjs/lib/phantom/bin/phantomjs')driver.get(baseUrl)##################################################### wait 10s until the specified table name presentstry:# 看涨合约和看跌合约的表格是一个class，要用CLASS_NAME指定element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, tableName)))except Exception, e:print efinally:data = driver.page_source # 取到加載js後的頁面contentdriver.quit#####################################################soup = readPage(loadUrl)soup = BeautifulSoup(data)getFinance(soup, tableName)print "Finished!"csvFile.close()