python爬虫之小说爬取

zoukankan html css js c++ java

python爬虫之小说爬取
废话不多说，直接进入正题。

今天我要爬取的网站是起点中文网，内容是一部小说。

首先是引入库
from urllib.request import urlopen from bs4 import BeautifulSoup
然后将网址赋值
html=urlopen("http://read.qidian.com/chapter/dVQvL2RfE4I1/hJBflakKUDMex0RJOkJclQ2.html") //小说的第一章的网址 bsObj=BeautifulSoup(html) //创建beautifulsoup对象
首先尝试爬取该页的小说内容
firstChapter=bsObj.find("div",{"class","read-content"}) //find方法是beautifulsoup对象拥有的函数， print (firstChapter.read_text())
find方法也可以和正则表达式搭配使用，并且多用于图片，视频等资源的爬取

由于本次爬取内容全在一个class属性值为read-content的盒子中，所以采用了find方法，如果该网页中，文字被放在多个盒子里，则应采用findAll方法，并且返回值为一个集合，需要用循环遍历输出。

将代码整合运行，发现可以实现文章的爬取，但是现在的问题是，爬取了该小说的一章，那么，往后的几章该如何爬取呢？

由前面步骤可以得出，只要得知下一章的网址，即可进行爬取。首先，将打印文字的部分封装为函数，那么，每次取得新的地址，即可打印出对应文本
def writeNovel(html): bsObj=BeautifulSoup(html) chapter=bsObj.find("div",{"class","read-content"}) print (chapter.get_text())
现在的问题是如何爬取下一章的网址，观察网页结构可得知，下一章的按钮实质是一个id为j_chapterNext的a标签，那么，可由这个标签获得下一章的网址

重新包装函数，整理得：

from urllib.request import urlopen
from bs4 import BeautifulSoup
def writeNovel(html):
bsObj=BeautifulSoup(html)
chapter=bsObj.find("div",{"class","read-content"})
print (chapter.get_text())
bsoup=bsObj.find("",{"id":"j_chapterNext"})
html2="http:"+bsoup.get('href')+".html"
return (urlopen(html2))

html=urlopen("http://read.qidian.com/chapter/dVQvL2RfE4I1/hJBflakKUDMex0RJOkJclQ2.html")

i=1
while(i<10):
html=writeNovel(html)
i=i+1

将文本写入text文件中
from urllib.request import urlopen from bs4 import BeautifulSoup def writeNovel(html): bsObj=BeautifulSoup(html) chapter=bsObj.find("div",{"class","read-content"}) print (chapter.get_text()) fo=open("novel.text","a") fo.write(chapter.get_text()) fo.close bsoup=bsObj.find("",{"id":"j_chapterNext"}) html2="http:"+bsoup.get('href')+".html" return (urlopen(html2)) html=urlopen("http://read.qidian.com/chapter/dVQvL2RfE4I1/hJBflakKUDMex0RJOkJclQ2.html") i=1 while(i<8): html=writeNovel(html) i=i+1
查看全文

相关阅读:
2020年12月-第02阶段-前端基础-品优购项目规范
 2020年12月-第02阶段-前端基础-CSS Day07
2020年12月-第02阶段-前端基础-CSS Day06
2020年12月-第02阶段-前端基础-CSS Day05
2020年12月-第02阶段-前端基础-CSS Day04
2020年12月-第02阶段-前端基础-CSS Day03
2020年12月-第02阶段-前端基础-CSS Day02
2020年12月-第02阶段-前端基础-CSS字体样式
 2020年12月-第02阶段-前端基础-CSS基础选择器
 2020年12月-第02阶段-前端基础-CSS初识

原文地址：https://www.cnblogs.com/puffmoff/p/7147613.html