zoukankan      html  css  js  c++  java
  • 豆瓣即将上映电影爬虫作业

    https://study.163.com/course/courseLearn.htm?courseId=1005913008#/learn/text?lessonId=1053258283&courseId=1005913008

    把豆瓣“即将上映”的电影爬出来。

    很简陋的代码,现在的知识只能写这样了,做个记录吧。以后有能力再完善。

     1 # coding=gbk
     2 #  豆瓣即将上映的电影
     3 # https://movie.douban.com/cinema/later/changsha/爬取这个页面的信息吧
     4 
     5 import requests
     6 from bs4 import BeautifulSoup
     7 import json
     8 
     9 
    10 def get_page():
    11     url = 'https://movie.douban.com/cinema/later/changsha/'
    12     headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}
    13     response = requests.get(url, headers, verify=False)
    14     text = response.text
    15     return text
    16 
    17 
    18 def get_data(text):
    19     movies = []
    20     soup = BeautifulSoup(text, 'lxml')
    21     # class="item mod " class="item mod odd"有两个都是的。
    22     # 第一种思路,直接分开成两个来处理
    23     divList1 = soup.find_all('div', attrs={'class': "item mod "})
    24     divList2 = soup.find_all('div', attrs={'class': "item mod odd"})
    25     # 合并两个list
    26     divList = divList1+divList2
    27     '''
    28     <div class="item mod ">
    29 <a class="thumb" href="https://movie.douban.com/subject/26394152/">
    30 <img class="" src="https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2541662397.jpg"/>
    31 </a>
    32 <div class="intro">
    33 <h3>
    34 <a class="" href="https://movie.douban.com/subject/26394152/">大黄蜂</a>
    35 <span class="icon"></span>
    36 </h3>
    37 <li class="dt">01月04日</li>
    38 <li class="dt">动作 / 科幻 / 冒险</li>
    39 <li class="dt">美国</li>
    40 </div>'''
    41     for div in divList:
    42         movie = {}
    43         thumb = div.find('img')['src']
    44         # print(thumb)
    45         # 获取所有a标签,其中第二个a标签里是电影名
    46         title = div.find_all('a')[1].text
    47         # print(title)
    48         # 有些电影里没有类型这列,所有要考虑下,不能直接赋值类型
    49         li_tag = div.find_all('li')
    50         if len(li_tag)==4:
    51             date = li_tag[0].text
    52             type = li_tag[1].text
    53             area = li_tag[2].text
    54             want = li_tag[3].text
    55         else:
    56             date = li_tag[0].text
    57             type = ''
    58             area = li_tag[1].text
    59             want = li_tag[2].text
    60 
    61         # print(date)
    62         movie['title'] = title
    63         movie['thumb'] = thumb
    64         movie['date'] = date
    65         movie['type'] = type
    66         movie['area'] = area
    67         movie['want'] = want
    68         movies.append(movie)
    69     return movies
    70 
    71 
    72 def save_text(data):
    73     with open('douban2.json', 'w', encoding='utf-8') as fp:
    74         json.dump(data, fp, ensure_ascii=False)
    75 
    76 
    77 if __name__ == '__main__':
    78     text = get_page()
    79     data = get_data(text)
    80     save_text(data)

    两个li标签的区别

    或者以后可以参考https://www.jianshu.com/p/c64fe2a20bc9,这个代码写的可以借鉴。

  • 相关阅读:
    no module name cx_oracle 的解决方法
    开通博客
    普通用户启动Hadoop格式化namenode出现无法创建目录的问题
    改写文件权限时出现问题___2
    suse添加普通用户赋予root所有权限时出现问题___1
    suse系统vim未正常退出产生的问题(can't write viminfo file /home/zhaoy/.viminfo)
    intellij idea根据mvn仓库添加或改变scala-sdk
    git拉项目和上传项目时遇到的一些问题
    简单的clone项目fromGitHub
    初始机器学习
  • 原文地址:https://www.cnblogs.com/weiwei2016/p/10162660.html
Copyright © 2011-2022 走看看