zoukankan      html  css  js  c++  java
  • Python-利用beautifulsoup写个豆瓣热门图书爬虫

      Anaconda3里边自带了bs4的包,省的我自己安装了。

      最近觉得模块化的写法可以让代码变得清晰易读。而且随着代码的增多,找bug也会更方便。(目前我还写不出这么多)而且模块化有种工具化的思想,拿来主义的思想在里面,使用工具可是人等少数智慧动物的专利啊。之后也要多学习使用[try - except]的写法,可以直观的看出错误。

      初学网页爬虫,目前只会爬取豆瓣这样清晰好看的静态网页,对于复杂的js控制的动态网页,我现在还束手无策。

     1 # -*- coding: utf-8 -*-
     2 """
     3 Created on Tue Jan  2 17:44:30 2018
     4 
     5 @author: xglc
     6 找到豆瓣图书的【新书速递】内容
     7 """
     8 import requests
     9 from bs4 import BeautifulSoup
    10 
    11 def _gethtml():
    12     try:
    13         req = requests.get('https://book.douban.com/')
    14         data1 = []
    15         data1.append(req.text)
    16     except Exception as e:  
    17         raise e 
    18     return data1
    19 
    20 def _getdata(html):
    21     title = []
    22     author = []
    23     data2 = {}
    24     soup = BeautifulSoup(html,'html.parser')
    25     for li in soup.find('ul',attrs={'class':'list-col list-col5 list-express slide-item'}).find_all("li"):
    26         title.append(li.find('div',class_='info').find('div',class_='title').text)
    27         author.append(li.find('div',class_='info').find('div',class_='author').text)
    28     data2['title'] = title
    29     data2['author'] = author
    30 #    print (data2)
    31     return data2
    32 
    33 def _txt(data3):
    34     with open('f://book.txt','w') as f:
    35         for title in data['title']:
    36             f.write(title)
    37         f.close
    38         
    39 if __name__ == '__main__':  
    40     htmls = _gethtml()  
    41     data = _getdata(htmls[0])
    42     _txt(data)
    43 #    print (data['title'])
    View Code
  • 相关阅读:
    1442. Count Triplets That Can Form Two Arrays of Equal XOR
    1441. Build an Array With Stack Operations
    312. Burst Balloons
    367. Valid Perfect Square
    307. Range Sum Query
    1232. Check If It Is a Straight Line
    993. Cousins in Binary Tree
    1436. Destination City
    476. Number Complement
    383. Ransom Note
  • 原文地址:https://www.cnblogs.com/aubucuo/p/doubanbook.html
Copyright © 2011-2022 走看看