zoukankan html css js c++ java

爬虫之小说爬取

以笔趣阁网站为例，爬取一念永恒这本小说

链接如下：http://www.biqukan.com/1_1094

具体代码如下：

 1 from bs4 import BeautifulSoup
 2 from urllib import request
 3 import requests
 4 import re
 5 import sys
 6 def Down_this_chapter(chapter_url,name):#单章下载
 7     r = requests.get(chapter_url,timeout = 30)#防止爬取时间过长造成爬虫假死
 8     r.raise_for_status()#自动判断返回的状态码是不是200
 9     r.encoding = r.apparent_encoding#使用备用编码代替现在的编码，一般是'utf-8'
10     demo = r.text#获得页面文本信息
11     soup=BeautifulSoup(demo,'lxml')#解析页面
12     text=soup.find_all(id='content',class_='showtxt')#寻找特定标签下的内容
13     soup_text = BeautifulSoup(str(text), 'lxml')#重写解析页面
14     demo1=soup_text.div.text.replace('xa0','')#去除无用内容
15     print(name)
16     with open('D:一念永恒.txt','a',encoding='utf-8') as f:#将找到的内容写到D盘下的文件中
17         f.write('										'+name+'
')#处理章节名格式问题
18         f.write(''  +demo1)
19         f.write('

')
20         f.close()
21         
22 def Novel_url(novel_url):#章节链接下载
23     r = requests.get(novel_url,timeout = 30)
24     r.raise_for_status()
25     r.encoding = r.apparent_encoding
26     demo = r.text
27     soup = BeautifulSoup(demo,'lxml')
28     text = soup.find_all('div',class_ = 'listmain')
29     soup_url = BeautifulSoup(str(text),'lxml')
30     flag=False
31     numbers=(len(soup_url.dl.contents) - 1)#为查看下载进度服务
32     index=1
33     for child in soup_url.dl.children:#遍历章节
34         if child!='
':#过滤
35            if child.string ==u"《一念永恒》正文卷":#爬取正文卷
36                flag=True#标识符
37            if flag==True and child.a!=None:#爬取章节链接的条件
38                 download_url = "http://www.biqukan.com"+child.a.get('href')#获得爬取链接
39                 name = child.string
40                 Down_this_chapter(download_url,name)
41                 sys.stdout.write("已下载:%.3f%%" % float(index/numbers) + '
')
42                 sys.stdout.flush()
43                 index += 1
44                
45 def main ():
46     novel_url='http://www.biqukan.com/1_1094/'#获得笔趣阁要爬取的小说的地址
47     Novel_url(novel_url)#爬取章节的链接
48     print("爬取小说成功，请到D盘下查看")
49 main()
50 
51 """下面是部分爬取结果：
52 外传1 柯父。
53 已下载:0.000%
54 外传2 楚玉嫣。
55 已下载:0.001%
56 外传3 鹦鹉与皮冻。
57 已下载:0.001%
58 第一章 他叫白小纯
59 已下载:0.002%
60 第二章 火灶房
61 已下载:0.002%
62 第三章 六句真言
63 已下载:0.002%
64 第四章 炼灵
65 已下载:0.003%
66 第五章 万一丢了小命咋办
67 已下载:0.003%
68 第六章 灵气上头
69 已下载:0.003%
70 第七章 龟纹认主
71 已下载:0.004%
72 第八章 我和你拼了！
73 已下载:0.004%
74 第九章 延年益寿丹
75 已下载:0.005%
76 第十章 师兄别走
77 已下载:0.005%
78 第十一章 侯小妹
79 已下载:0.005%
80 
81 """

总结：一定要对将要爬取的网页的代码进行彻底的分析，不然可能得不到想要的效果

若有不足错误的地方，欢迎兄弟们拍砖指正，大家一起学习，一起进步！！！！

查看全文

相关阅读:
mac 命令大全
 GAME OF THRONES 2
GAME OF THRONES 1
软件工程-作业一
 猜数字游戏
 摘自－角田光代《对岸的她》
java复习总结
 艾米莉-狄金森
 初次接触软件工程
 Environment/reflection mapping & bump mapping

原文地址：https://www.cnblogs.com/yinbiao/p/8215174.html