zoukankan html css js c++ java

Python爬虫学习(三) ——————爬取外卖信息

距离上一次写博客已经差不多有半年了，深表惭愧..... 废话不多说，说说今天的任务，今天我们的目的爬取外卖信息，选择的平台是饿了吗。

第一步：饿了吗官网进去，定位中南海

第二步：随笔点进去一个商家

我们需要爬取的是每一种食品的名称、月销售量、评分、评论数

第三步：查看源代码发现根本找不到我们需要的元素，很显然这是一个动态页面，那我们可以通过抓包来查看请求过程，F12+F5,

很显然在这里找到了我们需要的东西，找到了入口之后直接上代码：

 1 # -*- coding: utf-8 -*-
 2 # @Time    : 2017/12/10 13:43
 3 # @Author  : Ricky
 4 # @FileName: elm.py
 5 # @Software: New_start
 6 # @Blog    ：http://www.cnblogs.com/Beyond-Ricky/
 7 
 8 import requests
 9 import json
10 restaurant_url = 'https://www.ele.me/restapi/shopping/v2/menu?restaurant_id=147207648'
11 web_data = requests.get(restaurant_url)
12 content = web_data.text
13 json_obj = json.loads(content)
14 for item in json_obj:
15     for food in item.get('foods'):
16         print(food.get('name'))
17         print(food.get('tips'))
18         print(food.get('rating'))

4.我们的目的是爬取中南海附近所有的外卖信息，这样一个个爬取肯定是浪费时间的，返回到上一页，我们再随便打开几个店铺，发现几个url只有后面一串数字不同，观察之后发现这就是店铺的id，因此我们只需要获取所有店铺的id就可以获取所有店铺的外卖信息了。爬取id的过程其实和上一个页面差不多，都是通过抓包完成的，这里不多做解释。直接上完整代码

 1 # -*- coding: utf-8 -*-
 2 # @Time    : 2017/12/10 15:35
 3 # @Author  : Ricky
 4 # @FileName: final_version.py
 5 # @Software: New_start
 6 # @Blog    ：http://www.cnblogs.com/Beyond-Ricky/
 7 
 8 import requests
 9 import json
10 import time
11 from bs4 import BeautifulSoup
12 import lxml
13 id_list = []#店铺的id列表
14 name_list = []#店铺的名称列表
15 address_list = []#店铺的地址列表
16 
17 def get_all_id():
18     for offset in range(0,985,24):
19         url='https://www.ele.me/restapi/shopping/restaurants?extras%5B%5D=activities&geohash=wx4g06hu38n&latitude=39.91406&limit=24&longitude=116.38477&offset={}&terminal=web'.format(offset)
20         web_data = requests.get(url)
21         soup=BeautifulSoup(web_data.text,'lxml')
22         content = soup.text
23         json_obj = json.loads(content)
24         for item in json_obj:
25             restaurant_address = item.get('address')
26             address_list.append(restaurant_address)
27             restaurant_name = item.get('name')
28             name_list.append(restaurant_name)
29             restaurant_id = item.get('id')
30             id_list.append(restaurant_id)
31     return name_list,address_list,id_list
32 get_all_id()
33 m=0#用来计数，第几个店铺
34 n=0#用来记录数据，第几条数据
35 for id in id_list:
36     m=m+1
37     restaurant_url = 'https://mainsite-restapi.ele.me/shopping/v2/menu?restaurant_id='+str(id)
38     print('*************************这里是店铺分界线******第{}个店铺*********************************************'.format(m))
39 
40     print(name_list[m])
41     print(address_list[m])
42     headers = {'User-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
43 
44     web_data = requests.get(restaurant_url,headers=headers)
45     #time.sleep(3)
46     content = web_data.text
47     json_obj = json.loads(content)
48     try:
49         for item in json_obj:
50             for food in item.get('foods'):
51 
52                 n +=1
53                 print('第%d条数据:' % n)
54                 print(food.get('name'),food.get('tips'),'评分',food.get('rating'))
55     except AttributeError as e :
56         pass
57     except IndexError as e1:
58         pass

至此我们的任务就完成了！写得不好的地方欢迎指正！后面还会有爬虫系列的文章，谢谢大家！

查看全文

相关阅读:
转：SQL Server 2005 Express附加数据库为“只读”的解决方法!
通过WPF模拟交通红绿灯（图文教程）
手把手教你怎样把文件保存到Oracle数据库
 已删除
 JavaScript精炼类(class)、构造函数(constructor)、原型(prototype)
Ext:RowLayout和ColumnLayout连用必须加panel的问题
 Ext:前台js往gridpanel动态添加记录
 "int i=1" "int i=new int() "和“String str = "a";” “String str = new String("a")”区别以及c#值类型和引用类型
 未能加载文件或程序集“Model Version=1.0.0.0, Culture=neutral, PublicKeyToken=null”或它的某一个依赖项。系统找不到指定的文件。
hibernate:inverse、cascade，一对多、多对多详解

原文地址：https://www.cnblogs.com/Beyond-Ricky/p/8075740.html