zoukankan html css js c++ java

爬取淘宝商品

爬取淘宝商品

一、项目需求

1. 淘宝的整个页面都是由Ajax获取的，而且还包含加密参数，所以这里要使用 Selenium 来模拟浏览器爬取淘宝商品信息。

2. 将淘宝上关于ipad关键字的搜索结果爬取下来，并使用 MongoDB 储存数据。

3. 爬取的数据要包含商品的图片，名称，价格，购买人数，店铺名称和店铺地址。

二、项目分析

抓取入口是淘宝的搜索页面，URL：https://s.taobao.com/search?q=iPad，如下方截图：

　　可以发现，在页面下方有一个分页导航，其中既包括前5页的链接，也包括下一页的链接，同时还有一个输入任意页码跳转的链接，这里商品的搜索结果为100页，要获取每一页的内容，只需要将页码从1到100顺序遍历即可，页码数是确定的。所以，直接在页面跳转文本框中输入要跳转的页面，然后点击确定按钮即可跳转到页码对应的页面了。可能你会问为什么不直接点下一页，因为一旦爬取过程中出现异常退出，比如到50页退出了，此时点击下一页时，就无法快速切换到对应的后续页面了。此外，在爬取过程中，也需要记录当前的页码数，而且一旦点击下一页之后页面加载失败，还需要做异常检测，检测当前页面是加载到第几页，整个流程相对复杂，所以这里使用简单粗暴的方法，直接获取输入框然后在里面输入页码，最后通过点击按钮实现跳转。接下来就可以使用 Selenium 抓取了：

 1 from selenium import webdriver
 2 from selenium.common.exceptions import TimeoutException
 3 from selenium.webdriver.common.by import By
 4 from selenium.webdriver.support import expected_conditions as EC
 5 from selenium.webdriver.support.wait import WebDriverWait
 6 from urllib.parse import quote
 7 
 8 
 9 browser = webdriver.Chrome()
10 wait = WebDriverWait(browser,10)
11 KEYWORD = 'iPad'
12 
13 
14 def index_page(page):
15     """抓取索引页"""
16     print('正在抓取第' + page + '页')
17     try:
18         url = 'https://s.taobao.com/search?q=' + quote(KEYWORD)
19         browser.get(url)
20         if page > 1:
21             input = wait.until(
22                 EC.presence_of_element_located((By.CSS_SELECTOR,'#mainsrp-pager div.form > input')))
23             submit = wait.until(
24                 EC.presence_of_element_located((By.CSS_SELECTOR,'#mainsrp-pager div.form > span.btn.J_Submit')))
25             input.clear()
26             submit.click()
27         wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR,'#mainsrp-pager li.item.active > span'),str(page)))
28         wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'.m-itemlist .items .item')))
29         get_products()
30     except TimeoutException:
31         index_page(page)

　　这里先构造了一个WebDriver对象，指定关键字‘iPad’，接着定义了index_page()方法用于抓取商品页面。在该方法里，首先访问了搜索商品的链接，然后判断了当前的页码，如果大于1，就进行跳转页面操作，否则等待页面加载完成。等待加载时，使用了WebDriverWait对象，它可以指定等待条件，同时指定一个最长等待时间，这里指定为10秒。如果在这个时间内成功匹配了等待条件，也就是说页面元素成功加载出来了，就立即返回相应结果并继续向下执行，否则到了最大等待时间还没有加载出来时，就直接抛出超时异常。关于翻页的操作，这里首先获取页码输入框，赋值为input，然后获取确定按钮，赋值为submit。然后清空了输入框的内容，再调用send_keys()方法将页码填充到输入框中，然后点击确定按钮。然而这里有一个问题就是，我们怎么知道有没有跳转到对应的页码呢？可以注意到，如果我们在某一页，当前的页码是会高亮显示的，所以只需要判断当前高亮的页码数是当前的页码数即可，然而这里使用了另外一个等待条件text_to_be_present_in_element，它会等待指定的文本出现在某一个节点里面时即返回成功。这里我们将高亮的页码节点对应的CSS选择器和当前要跳转的页码通过参数传递给这个等待条件，这样就会检测当前高亮的页码节点是不是我们传过来的页码数，如果是，就证明页面跳转成功了。接下来就可以实现get_products()方法来解析商品了：

 1 def get_products():
 2     html = browser.page_source
 3     document = pq(html)
 4     items = document('#mainsrp-itemlist .items .item').items()
 5     for item in items:
 6         product = {
 7             'image':item.find('.pic .img').attr('data-src'),
 8             'price':item.find('.price').text(),
 9             'deal':item.find('.deal-cnt').text(),
10             'shop':item.find('.shop').text(),
11             'location':item.find('.location').text(),
12         }
13         print(product)
14         save_to_mongo(product)

首先，调用page_source属性获取页面源码，然后构造了PyQuery解析对象，接着提取了商品列表，此时使用的CSS选择器是#mainsrp-itemlist .items .item，它会匹配整个页面的每个商品。它的匹配结果是多个，所以这里我们又对它进行一次遍历，用for循环将每个结果分别进行解析，每次循环把它赋值为item变量，每个item变量都是一个PyQuery对象，然后再调用它的find()方法，传入CSS选择器，就可以获取单个商品的特定内容了。最后的工作就是讲我们需要的数据保存到 MongoDB 中了。

三、项目源码

 1 from selenium import webdriver
 2 from selenium.common.exceptions import TimeoutException
 3 from selenium.webdriver.common.by import By
 4 from selenium.webdriver.support import expected_conditions as EC
 5 from selenium.webdriver.support.wait import WebDriverWait
 6 from urllib.parse import quote
 7 from pyquery import PyQuery as pq
 8 import pymongo
 9 
10 
11 browser = webdriver.Chrome()
12 wait = WebDriverWait(browser,10)
13 KEYWORD = 'iPad'
14 MAX_PAGE = 100
15 
16 MONGO_URL = 'localhost'
17 MONGO_DB = 'taobao'
18 MONGO_COLLECTION = 'products'
19 client = pymongo.MongoClient(MONGO_URL)
20 db = client[MONGO_DB]
21 
22 
23 def index_page(page):
24     print('now is ',page)
25     try:
26         url = 'https://s.taobao.com/search?q=' + quote(KEYWORD)
27         browser.get(url)
28         if page > 1:
29             input = wait.until(
30                 EC.presence_of_element_located((By.CSS_SELECTOR,'#mainsrp-pager div.form > input')))
31             submit = wait.until(
32                 EC.element_to_be_clickable((By.CSS_SELECTOR,'#mainsrp-pager div.form > span.btn.J_Submit')))
33             input.clear()
34             input.send_keys(page)
35             submit.click()
36         wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR,'#mainsrp-pager li.item.active > span'),str(page)))
37         wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'.m-itemlist .items .item')))
38         get_products()
39     except TimeoutException:
40         index_page(page)
41 
42 
43 def get_products():
44     html = browser.page_source
45     document = pq(html)
46     items = document('#mainsrp-itemlist .items .item').items()
47     for item in items:
48         product = {
49             'image':item.find('.pic .img').attr('data-src'),
50             'price':item.find('.price').text(),
51             'deal':item.find('.deal-cnt').text(),
52             'shop':item.find('.shop').text(),
53             'location':item.find('.location').text(),
54         }
55         print(product)
56         save_to_mongo(product)
57 
58 
59 def save_to_mongo(result):
60     try:
61         if db[MONGO_COLLECTION].insert(result):
62             print('success')
63     except Exception:
64         print('fail')
65 
66 
67 def main():
68     for i in range(1,MAX_PAGE+1):
69         index_page(i)
70 
71 
72 if __name__ == '__main__':
73     main()

查看全文

相关阅读:
Hadoop 学习笔记（十） hadoop2.2.0 生产环境部署 HDFS HA Federation 含Yarn部署
 hadoop 2.x 安装包目录结构分析
 词聚类
 Hadoop 学习笔记（十一） MapReduce 求平均成绩
 Hadoop 学习笔记（十） MapReduce实现排序全局变量
 Hadoop 学习笔记（九） hadoop2.2.0 生产环境部署 HDFS HA部署方法
 Visual Studio Code 快捷键大全（Windows）
Eclipse安装教程 ——史上最详细安装Java &Python教程说明
 jquery操作select(取值，设置选中）
$.ajax 中的contentType

原文地址：https://www.cnblogs.com/jonas-von/p/9209981.html