zoukankan      html  css  js  c++  java
  • 网络爬虫练习之网络小说

     1 import requests
     2 import bs4
     3 
     4 #获取网页代码
     5 def gethtml(url):
     6     try:
     7         response = requests.get(url)
     8         response.raise_for_status()
     9         response.encoding = response.apparent_encoding
    10         return response.text
    11     except:
    12         return "禁止爬取本网站"
    13 
    14 #获取每一页中的文字
    15 def chapters(url,name):
    16     html = gethtml("http://www.bjkgjlu.com"+url)
    17     soup = bs4.BeautifulSoup(html,'html.parser')
    18     for i in soup.find_all("div",attrs={"class":"chapter_content"}):
    19         with open(name+".txt","wb") as f:
    20             f.write(i.text.split("&lt")[0].encode("utf-8"))
    21             print(name+"爬取结束,并存入文件")
    22 
    23 if __name__=="__main__":
    24     url = "http://www.bjkgjlu.com/303618kyi/catalog"
    25     chapter_name_list = []
    26     chapter_url_list = []
    27     html =gethtml(url)
    28     soup = bs4.BeautifulSoup(html, "html.parser")
    29 
    30     for i in soup.findAll("div", attrs={"class": "col-xs-120 col-sm-60 col-md-40 col-lg-30"}):
    31         for j in i.children:
    32             chapter_name_list.append(j.text)
    33             chapter_url_list .append(j.get("href"))
    34     print(chapter_name_list )
    35     for j in range(len(chapter_name_list)):
    36         chapters(chapter_url_list[j],chapter_name_list[j] )
                  
    申明:本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。
  • 相关阅读:
    微信小程序解析xml
    微信小程序获取openid
    PHPExcel-1.8导出
    期末复习--实用回归分析
    一元线性回归
    链表
    WSL 配置oh-my-zsh
    Introduction to Computer Science and Programming in Python chap2
    树莓派的一些记录
    Top
  • 原文地址:https://www.cnblogs.com/lsyb-python/p/11774319.html
Copyright © 2011-2022 走看看