zoukankan html css js c++ java

学习进度-10 python爬虫

学习爬虫的第一个案例是小说爬虫。

小说爬虫首先是解析小说页面源代码，在页面源代码中可以看到小说每章节的内容链接

爬虫的代码：

import requests
import re

url = 'http://www.92kshu.cc/69509/'
response = requests.get(url)
response.encoding = 'gbk'
html = response.text
title = re.findall(r'<meta property="og:novel:book_name" content="(.*?)"/>', html)[0]
fb = open('%s.txt' % title, 'w', encoding='utf-8')
# 获取每章的内容
# print(html)
dl = re.findall(r'<dl><dt><i class="icon"></i>正文</dt>(.*?)</dl>', html)[0]
print(dl)
chapter_info_list = re.findall(r'<dd><a href="(.*?)">(.*?)</a></dd>', dl)
print(chapter_info_list)
for chapter_info in chapter_info_list:
    chapter_url, chapter_title = chapter_info
    chapter_url = "http://www.92kshu.cc%s" % chapter_url
    # print(chapter_url)
    chapter_response = requests.get(chapter_url)
    chapter_response.encoding = 'gbk'
    chapter_html = chapter_response.text
    chapter_content = re.findall(r'<div class="chapter">(.*?)><br>', chapter_html)[0]
    # print(chapter_content)
    chapter_content = chapter_content.replace('<p>', '')
    chapter_content = chapter_content.replace('</p>', '')
    fb.write(chapter_title)
    fb.write(chapter_content)
    fb.write('
')
    print(chapter_url)

爬虫结果：

查看全文

相关阅读:
PHP面向对象——类
 PHP强大的数组函数
 php学习资源
 版本管理（二）之Git和GitHub的连接和使用
 版本管理（一）之Git和GitHub的区别（优点和缺点）
（win10）Wamp环境下php升级至PHP7.2
wamp3.1.0 X64下载链接
 4.总结近5周以来的github上的工作情况，以图表方式分析你小组的工作情况、存在的问题及解决的方案。（尤心心）
四则运算需求分析和功能实现--杨宇杰
 1.对四则运算软件需求的获取方式进行实践，例如使用调查问卷访问相关关系人等。

原文地址：https://www.cnblogs.com/zhaoxinhui/p/12291944.html