zoukankan      html  css  js  c++  java
  • Python 爬取 豆瓣

    ...

    import urllib.request
    import time
    from bs4 import BeautifulSoup
    
    def url_open(url):
        response = urllib.request.urlopen(url)
        return response
    def parse_html(response):
        html_content = response.read()
        html_soup = BeautifulSoup(html_content, 'html.parser', from_encoding='utf-8')
        tag_lis = html_soup.find_all('li')
        for li in tag_lis:
            em = li.find('em')
            title = li.find_all('span', class_='title')
            # other = li.find_all('span', class_='other')
            rating = li.find('span', class_='rating_num')
            if title != []:
                rank=em.get_text()
                print("排名:" + rank + "------评分:" + str(rating.get_text()) + "-------" + title[0].get_text())
                if rank==250:
                    return None
                if int(rank)%25==0:
                    url="https://movie.douban.com/top250?start="+rank+"&filter="
                    return url
    
    url = "https://movie.douban.com/top250?start=0&filter="
    if __name__=='__main__':
        response=url_open(url)
        start_time=time.time()
        print("开始:"+str(start_time))
        while 1:
            url=parse_html(response)
            if url==None:
                break
            response=url_open(url)
        end_time=time.time()
        print("结束:"+str(end_time))
        print("一共用了:"+str(end_time-start_time)+"")
  • 相关阅读:
    hadoop2.3.0cdh5.0.2 升级到cdh5.7.0
    strace
    ganglia3.7.2,web3.7.1安装
    hadoop balancer
    linux-小命令
    Ceph 架构以及原理分析
    Ceph 文件存储
    Ceph 对象存储
    Ceph 块存储
    Ceph 集群搭建
  • 原文地址:https://www.cnblogs.com/mysterious-killer/p/10156985.html
Copyright © 2011-2022 走看看