zoukankan      html  css  js  c++  java
  • 动态加载数据抓取-Ajax

    特点:

    1、右键 -> 查看网页源码中没有具体数据
    2、滚动鼠标滑轮或其他动作时加载

    抓取:

    1、F12打开控制台,页面动作抓取网络数据包
    2、抓取json文件URL地址
    # 控制台中 XHR :异步加载的数据包
    # XHR -> QueryStringParameters(查询参数)

    豆瓣电影数据抓取案例

    1.目标

    1、地址: 豆瓣电影 - 排行榜 - 剧情
    2、目标: 电影名称、电影评分

    2.F12抓包(XHR)

    1、Request URL(基准URL地址) :https://movie.douban.com/j/chart/top_list?
    2、Query String(查询参数)
    # 抓取的查询参数如下:
    type: 13 # 电影类型
    interval_id: 100:90
    action: ''
    start: 0  # 每次加载电影的起始索引值
    limit: 20 # 每次加载的电影数量

    3.代码实现

    import requests
    import time
    from fake_useragent import UserAgent
    
    
    class DoubanSpider():
      def __init__(self):
        self.base_url = 'https://movie.douban.com/j/chart/top_list'
        self.i = 0
    
      def get_html(self, params):
        ua = UserAgent()
        headers = {'User-Agent': ua.random}
        res = requests.get(
          url=self.base_url,
          params=params,
          headers=headers
        )
        res.encoding = 'utf-8'
        html = res.json()
        # 直接调用解析函数
        self.parse_html(html)
    
      def parse_html(self, html):
        # html:[{电影1信息},{电影2信息},{}]
        item = {}
        for one in html:
          item['name'] = one['title']
          item['score'] = one['score']
          item['time'] = one['release_date']
          print(item)
          self.i += 1
    
      def get_total(self,typ):
        url = 'https://movie.douban.com/j/chart/top_list_count?type={}&interval_id=100%3A90'.format(typ)
        ua = UserAgent()
        html = requests.get(url=url, headers={'User-Agent': ua.random}).json()
        total = html['total']
        return total
    
      def main(self):
        ty = input("请输入电影类型(剧情|喜剧|动作)")
        typ_dict={'剧情':'11','喜剧':'24','动作':'5'}
        typ=typ_dict[ty]
        total = self.get_total(typ)
        for page in range(0, int(total), 20):
          params = {
            'type': typ,
            'interval_id': '100:90',
            'action': '',
            'start': str(page),
            'limit': '20'
          }
          self.get_html(params)
          time.sleep(1)
        print(self.i)
    
    
    if __name__ == '__main__':
      spider = DoubanSpider()
      spider.main()
    代码

     

  • 相关阅读:
    Linux Exploit系列之一 典型的基于堆栈的缓冲区溢出
    [Codeforces Round #433][Codeforces 853C/854E. Boredom]
    Educational Codeforces Round 4
    [Educational Round 3][Codeforces 609F. Frogs and mosquitoes]
    [ACM-ICPC 2018 徐州赛区网络预赛][D. Easy Math]
    Educational Codeforces Round 50
    [Codeforces Round #507][Codeforces 1039C/1040E. Network Safety]
    [Educational Round 3][Codeforces 609E. Minimum spanning tree for each edge]
    Educational Codeforces Round 3
    [Manthan, Codefest 18][Codeforces 1037E. Trips]
  • 原文地址:https://www.cnblogs.com/maplethefox/p/11352329.html
Copyright © 2011-2022 走看看