zoukankan html css js c++ java

xpath解析爱奇艺电影网页数据

 1 url='https://list.iqiyi.com/www/1/-------------11-1-1-iqiyi--.html'
 2     headers={
 3         'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
 4                      'Chrome/90.0.4430.93 Safari/537.36'
 5     }
 6     #获取爱奇艺电影大全主界面response
 7     response=requests.get(url=url,headers=headers)
 8     response.encoding='utf-8'
 9     #获取页面的全部html
10     page_text=response.text
11     # html=BeautifulSoup(page_text,"lxml")
12     # bs_li=html.find_all('li',class_="qy-mod-li")
13     # print(bs_li)
14     #打印输出页面html
15     # print(page_text)
16     #实例化xpath对象
17     etree_=etree.HTML(page_text)
18     #获取电影的所有li标签
19     ul_list=etree_.xpath('//ul[@class="qy-mod-ul"]/li')
20     # print(ul_list[0])
21     temp_list=[]        #声明一个list存储单部电影的所有信息
22     dataRes=[]         #声明一个总list存储所有的电影
23     #BeautifulSoup 解析播放状态
24     # findState=re.compile(r'"<img src="(.*?)"')
25     # for li in bs_li:
26     #     words=str(li)
27     #     print(words)
28     #     temp_state=re.findall(findState,words)
29     #     print(temp_state)
30 
31     for li in ul_list:
32         name=li.xpath('./div/div[2]/p[1]/a/@title')     #获取电影名字
33         score = li.xpath('./div/div[2]/p[1]/span/text()')  # 获取电影评分
34         link=li.xpath('./div/div[2]/p[1]/a/@href')     #获取电影链接
35         if(len(score)==0):                                  #如果评分信息没有 重新赋值
36             score="暂无评分"
37         if (len(link) == 0):                            # 如果链接信息没有 重新赋值
38             score = "#"
39         link=str.replace(link[0],"//","")           #因为链接是带有 这样 // 的两个斜杠 所以要替换一下
40         #解析播放状态
41         #//*[@id="block-D"]/ul/li[5]/div/div[1]/a/div[2]/img
42         state=li.xpath('./div/div[1]/a/div[2]/img/@src')
43         # print(state)
44         temp_list.append(name[0])
45         temp_list.append(score[0])
46         temp_list.append(link)
47 
48         # print(temp_list)
49         dataRes.append(temp_list)       #把爬取到的每一部电影存到总list中也就是 dataRes
50         temp_list=[]        #给单部电影list重新赋值为空list
51     print(dataRes)

查看全文

相关阅读:
12、多线程:Threading、守护线程
 11.1、socket连接中的粘包、精确传输问题
 python网络编程:socket、服务端、客户端
 python正则表达式模块re：正则表达式常用字符、常用可选标志位、group与groups、match、search、sub、split,findall、compile、特殊字符转义
 9.4、__del__、__doc__、__dict__、__module__、__getitem__、__setitem__、__delitem__、__str__、__repr__、__call__
python:异常处理、自定义异常、断言
 9.3、反射
 9.1.1、私有变量，私有方法
 关掉百度商桥请您留言和在线咨询
 飞飞影视cms标签

原文地址：https://www.cnblogs.com/rainbow-1/p/14726192.html