zoukankan      html  css  js  c++  java
  • bs4实战之三国演义数据爬取


    # 需求:爬取三国演义小说中的章节标题与章节内容http://www.shicimingju.com/book/sanguoyanyi.html
    import requests
    from bs4 import BeautifulSoup
    if __name__ == "__main__":
    # 对首页数据进行爬取
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
    } # UA伪装
    url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
    page_text = requests.get(url=url,headers=headers).text

    # 在首页解析出章节的标题和详情页的url
    # 1实例化beautifulsoup对象,需要将页面源码数据加载到该对象中
    soup = BeautifulSoup(page_text,'lxml')
    # 在首页解析出章节的标题和详情页的url
    li_list=soup.select('.book-mulu > ul > li ')

    fp = open("./sanguo.txt",'w',encoding='utf-8')
    for li in li_list:
    title = li.a.string #todo
    detail_url = 'http://www.shicimingju.com'+li.a['href']
    # 对详情页发起请求,解析出章节内容
    detail_page_text = requests.get(url=detail_url,headers = headers).text
    # 解析出详情页中的相关内容
    detail_soup = BeautifulSoup(detail_page_text,'lxml')
    div_tag = detail_soup.find('div',class_= 'chapter_content')
    # 解析到了章节内容
    content = div_tag.text()
    fp.write(title +':'+ content+' ')
    print(title,"爬取成功")


  • 相关阅读:
    java冒泡算法
    java时间操作
    Java重写构造方法
    正则Sub用法
    Python正则反向引用
    Django发送邮件
    Django导出excel
    Nginx编译安装
    年薪20万Python工程师进阶(7):Python资源大全,让你相见恨晚的Python库
    Go语言学习笔记
  • 原文地址:https://www.cnblogs.com/huahuawang/p/12692354.html
Copyright © 2011-2022 走看看