zoukankan      html  css  js  c++  java
  • python爬虫---抓取优酷的电影

    最近在学习爬虫,用的BeautifulSoup4这个库,设想是把优酷上面的电影的名字及链接爬到,然后存到一个文本文档中。比较简单的需求,第一次写爬虫。贴上代码供参考:

     1 # coding:utf-8
     2 
     3 import requests
     4 import os
     5 from bs4 import BeautifulSoup
     6 import re
     7 import time
     8 
     9 '''抓优酷网站的电影:http://www.youku.com/ '''
    10 
    11 url = "http://list.youku.com/category/show/c_96_s_1_d_1_u_1.html"
    12 h = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0"}
    13 
    14 
    15 
    16 
    17 #存到movie文件夹的文本文件中
    18 def write_movie():
    19     currentPath = os.path.dirname(os.path.realpath(__file__))
    20     #print(currentPath)
    21     moviePath = currentPath + "\" + "movie"+"\" + "youku_movie_address.text"
    22     #print(moviePath)
    23     fp = open(moviePath ,encoding="utf-8",mode="a")
    24 
    25     for x in list_a:
    26         text = x.get_text()
    27         if text == "":
    28             try:
    29                 fp.write(x["title"] + ":    " + x["href"]+"
    ")
    30             except IOError as msg:
    31                 print(msg)
    32 
    33     fp.write("-------------------------------over-----------------------------" + "
    ")
    34     fp.close()
    35 
    36 #第一页
    37 res = requests.get(url,headers = h)
    38 print(res.url)
    39 soup = BeautifulSoup(res.content,'html.parser')
    40 list_a = soup.find_all(href = re.compile("==.html"),target="_blank")
    41 write_movie()
    42 
    43 for num in range(2,1000):
    44 
    45     #获取“下一页”的href属性
    46     fanye_a = soup.find(charset="-4-1-999" )
    47     fanye_href = fanye_a["href"]
    48     print(fanye_href)
    49     #请求页面
    50     ee = requests.get("http:" + fanye_href,headers = h)
    51     time.sleep(3)
    52     print(ee.url)
    53 
    54     soup = BeautifulSoup(ee.content,'html.parser')
    55     list_a = soup.find_all(href = re.compile("==.html"),target="_blank")
    56 
    57     #调用写入的方法
    58     write_movie()
    59     time.sleep(6)

    运行后的txt内的文本内容:

  • 相关阅读:
    SAP PI 如何实现消息定义查询
    EWM与ERP交互程序
    ITS Mobile Template interpretation failed. Template does not exist
    SAP Material Flow System (MFS) 物料流系统简介
    SAP EWM Table list
    EWM RF 屏幕增强
    SAP EWM TCODE list
    SAP扩展仓库管理(SAPEWM)在线研讨会笔记
    ERP与EWM集成配置ERP端组织架构(二)
    EWM RF(Radio Frequency)简介
  • 原文地址:https://www.cnblogs.com/fukun/p/8651366.html
Copyright © 2011-2022 走看看