zoukankan html css js c++ java

python爬虫第二天

# 爬取糗事百科项目由于该网站正在维护我只是按照已经能够爬取出来的html代码手动观察修改正则表达式的内容

# 代码

# coding=utf-8
# 导入相应库文件
import requests
# from bs4 import BeautifulSoup
import re

# 设置请求头(不知道为什么我这个请求头访问不了搜狐浏览器)
# headers = { "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/
#            75.0.3770.100 Safari/537.36"
#           }
infoLists = []


def get_info(url):
    res = requests.get(url)
    # print(res.content.decode("utf-8")) # 输出已经爬取得网页html代码
    ids = re.findall("<title>(.*?)</title>", res.content.decode("utf-8"), re.S)
    # print(ids) # 查看爬取结果
    levels = re.findall("<h1>(.*?)</h1>", res.content.decode("utf-8"), re.S)
    # print(levels) # 查看爬取结果
    for id, level in zip(ids, levels):
        infos = {
            "id":id,
            "level":level
        }
        infoLists.append(infos)
    # 注意ids和levels不能为空列表


if __name__ == '__main__':
    urls = ["http://www.qiushibaike.com/text/page/{}/".format(i)for i in range(2, 3)]
    for url in urls:
        get_info(url)
    for infoList in infoLists:
        with open("qiushi.txt","a") as fileObject:
            try:
                fileObject.write(infoList["id"]+"
")
                fileObject.write(infoList["level"])
                print("写入成功")
            except Exception as e:
                print(e)
                print("输出出现错误")
            finally:
                pass

# 函数解析

re.findall(pattern, string, flags=0) 通过使用该函数可以匹配所有符合查询规律的内容通过查询API文档可以发现 pattern是需要查询的格式类型，string

是需要进行查询的字符串信息，flags是标志位可以对查询进行控制。

re.S 使匹配包括换行符在内的所有字符通过使用re.S可以进行多行匹配

查看全文

相关阅读:
vue点击实现箭头的向上与向下
 ES6中箭头函数加不加大括号的区别
 angular图片的两种表达方式
 通过添加类名实现样式的变化
 angular中路由跳转并传值四种方式
 Sublime Text 3 设置文件详解(settings文件)
Second：eclipse配置Tomcat
First：安装配置JDK and 部署Tomcat
本地环境代码一码云一服务器代码部署
 2.sublime设置本地远程代码同步

原文地址：https://www.cnblogs.com/walxt/p/11759164.html