zoukankan html css js c++ java

25 -2 正则爬虫例子

一、导入库

import re
from urllib.request import urlopen    # 内置的包 来获取网页的源代码 字符串

urlopen 来获取网页的源代码字符串

res = urlopen('https://www.cnblogs.com/zhuangdd/p/12644081.html')
print(res.read().decode('utf-8'))

——————————————————————————————
<!DOCTYPE html>
<html lang="zh-cn">
<head>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <meta name="referrer" content="origin" />
    <meta property="og:description" content="帮助学习的工具 http://tool.chinaz.com/regex/ 字符组 []在一个字符的位置上能出现的内容[1bc] 是一个范围[0-9][A-Z][a-z] 匹配三个字符[abc0-9]" />
    <meta http-equiv="Cache-Control" content="no-transform" />
    <meta http-equiv="Cache-Control" content="no-siteapp" />
    <meta http-equiv="X-UA-Compatible" content="IE=edge" />
    <title>25 -1  正则    re模块 （find
、、、、、、、、等等

flags有很多可选值：

re.I(IGNORECASE)忽略大小写，括号内是完整的写法
re.M(MULTILINE)多行模式，改变^和$的行为
re.S(DOTALL)点可以匹配任意字符，包括换行符
re.L(LOCALE)做本地化识别的匹配，表示特殊字符集 w, W, , B, s, S 依赖于当前环境，不推荐使用
re.U(UNICODE) 使用w W s S d D使用取决于unicode定义的字符属性。在python3中默认使用该flag
re.X(VERBOSE)冗长模式，该模式下pattern字符串可以是多行的，忽略空白字符，并可以添加注释

flags

def getPage(url):
    response = urlopen(url)
    return response.read().decode('utf-8')

def parsePage(s):   # s 网页源码
    ret = com.finditer(s)
    for i in ret:
        ret = {
            "id": i.group("id"),
            "title": i.group("title"),
            "rating_num": i.group("rating_num"),
            "comment_num": i.group("comment_num")
        }
        yield ret

def main(num):
    url = 'https://movie.douban.com/top250?start=%s&filter=' % num  # 0
    response_html = getPage(url)   # response_html是这个网页的源码 str
    ret = parsePage(response_html) # 生成器
    print(ret)
    f = open("move_info7", "a", encoding="utf8")
    for obj in ret:
        print(obj)
        data = str(obj)
        f.write(data + "
")
    f.close()

com = re.compile(
        '<div class="item">.*?<div class="pic">.*?<em .*?>(?P<id>d+).*?<span class="title">(?P<title>.*?)</span>'
        '.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)评价</span>', re.S)
count = 0
for i in range(10):
    main(count)  # count = 0
    count += 25

豆瓣250代码

查看全文

相关阅读:
Nginx软件优化
 分布式文件系统---GlusterFS
内建DNS服务器--BIND
ESXI 6.5 从载到安装
 在Linux下写一个简单的驱动程序
 TQ2440开发板网络配置方式
 虚拟机Linux下找不到/dev/cdrom
求最大公约数
 strcmp的源码实现
 转：嵌入式软件工程师经典笔试题

原文地址：https://www.cnblogs.com/zhuangdd/p/12644200.html