警告:本文代码仅供学习,禁止违法使用或商用。
这里拿人气小说《黎明之剑》来举个栗子,喜欢小说《黎明之剑》的朋友们请支持正版阅读。
笔趣阁网站上的其他书籍基本上的都可以套用,其他盗版网站也基本上是差不多的思路就可以解决。
稍微改改就能很轻松的通过小说目录页下载全本,我这里就懒得弄了,有兴趣的朋友可以试一试。
# -*- coding:UTF-8 -*- # 作者博客:https://www.cnblogs.com/Raine/ # 2019-06-20 import requests from bs4 import BeautifulSoup class TheLatest(object): # 测试爬取笔趣阁《黎明之剑》最新章节 def __init__(self): self.url_dir = 'https://www.biqiuge.com/book/36438/' self.bookname = "" # 存放书籍名 self.chaptername = "" # 存放章节名 self.url_latest = "" # 存放最新章节链接 self.get_download_url() def get_download_url(self): # 直接从网页head标签内获取想要的内容 r1 = requests.get(self.url_dir) # 网页是GBK编码,需要转换 r1.encoding = 'GBK' html_1 = r1.text bs_div = BeautifulSoup(html_1, 'lxml') # 找到需要用到的标签然后提取属性 _bookname = bs_div.find('meta', property="og:novel:book_name") self.bookname = _bookname.get('content') _chaptername = bs_div.find('meta', property='og:novel:latest_chapter_name') self.chaptername = _chaptername.get('content') _url_latest = bs_div.find('meta', property='og:novel:latest_chapter_url') self.url_latest = _url_latest.get('content') def get_content(self): r2 = requests.get(self.url_latest) r2.encoding = 'GBK' html_content = r2.text bs_div = BeautifulSoup(html_content, 'lxml') txt = bs_div.find('div', 'showtxt') # 优化文字排版 txt = txt.text.replace(' ', ' ') txt = txt.replace('�6�1', '·') out_content = txt.split(self.url_latest)[0] return out_content if __name__ == '__main__': txt_content = TheLatest() filename = txt_content.bookname + txt_content.chaptername + '.txt' with open(filename, 'w', encoding='utf-8') as f: f.write(txt_content.get_content())
参考资料:
Python3网络爬虫快速入门实战解析 :https://cuijiahua.com/blog/2017/10/spider_tutorial_1.html
Python——爬虫【Requests设置请求头Headers】:https://blog.csdn.net/ysblogs/article/details/88530124
Python3.x爬虫教程:爬网页、爬图片、自动登录 :https://blog.csdn.net/Evankaka/article/details/46849095