爬虫学习----正则表达式

zoukankan html css js c++ java

爬虫学习----正则表达式
1.

Python通过re模块提供对正则表达式的支持。

使用re的一般步骤是：

Step1：先将正则表达式的字符串形式编译为Pattern实例。

Step2：然后使用Pattern实例处理文本并获得匹配结果（一个Match实例）。

Step3：最后使用Match实例获得信息，进行其他的操作。

# -*- coding: utf-8 -*-
#一个简单的re实例，匹配字符串中的hello字符串

#导入re模块
import re

# 将正则表达式编译成Pattern对象，注意hello前面的r的意思是“原生字符串”
pattern = re.compile(r'hello')

# 使用Pattern匹配文本，获得匹配结果，无法匹配时将返回None
match1 = pattern.match('hello world!')
match2 = pattern.match('helloo world!')
match3 = pattern.match('helllo world!')

#如果match1匹配成功
if match1:
# 使用Match获得分组信息
print (match1.group())
else:
print ('match1匹配失败！')

#如果match2匹配成功
if match2:
# 使用Match获得分组信息
print (match2.group())
else:
print ('match2匹配失败！')

#如果match3匹配成功
if match3:
# 使用Match获得分组信息
print (match3.group())
else:
print ('match3匹配失败！')

2.
- re.I(全拼：IGNORECASE): 忽略大小写（括号内是完整写法，下同）
- re.M(全拼：MULTILINE): 多行模式，改变'^'和'$'的行为（参见上图）
- re.S(全拼：DOTALL): 点任意匹配模式，改变'.'的行为
- re.L(全拼：LOCALE): 使预定字符类 w W B s S 取决于当前区域设定
- re.U(全拼：UNICODE): 使预定字符类 w W B s S d D 取决于unicode定义的字符属性
- re.X(全拼：VERBOSE): 详细模式。这个模式下正则表达式可以是多行，忽略空白字符，并可以加入注释。
3.

# -*- coding: utf-8 -*-
#两个等价的re匹配,匹配一个小数
import re

a = re.compile(r"""d + # the integral part
. # the decimal point
d * # some fractional digits""", re.X)

b = re.compile(r"d+.d*")

match11 = a.match('3.1415')
match12 = a.match('33.')
match21 = b.match('3.1415')
match22 = b.match('.33')

if match11:
# 使用Match获得分组信息
print (match11.group())
else:
print (u'match11不是小数')

if match12:
# 使用Match获得分组信息
print (match12.group())
else:
print (u'match12不是小数')

if match21:
# 使用Match获得分组信息
print (match21.group())
else:
print (u'match21不是小数')

if match22:
# 使用Match获得分组信息
print (match22.group())
else:
print (u'match22不是小数')

以上两种写法一致，其实采用2中的方法

4.

1中的程序可以简写成：

# -*- coding: utf-8 -*-
#一个简单的re实例，匹配字符串中的hello字符串
import re

m = re.match(r'hello', 'hello world!')
print m.group()

re模块还提供了一个方法escape(string)，用于将string中的正则表达式元字符如*/+/?等之前加上转义符再返回

5.Match方法对象详解

Match对象是一次匹配的结果，包含了很多关于此次匹配的信息，可以使用Match提供的可读属性或方法来获取这些信息。

属性：
1. string: 匹配时使用的文本。
2. re: 匹配时使用的Pattern对象。
3. pos: 文本中正则表达式开始搜索的索引。值与Pattern.match()和Pattern.seach()方法的同名参数相同。
4. endpos: 文本中正则表达式结束搜索的索引。值与Pattern.match()和Pattern.seach()方法的同名参数相同。
5. lastindex: 最后一个被捕获的分组在文本中的索引。如果没有被捕获的分组，将为None。
6. lastgroup: 最后一个被捕获的分组的别名。如果这个分组没有别名或者没有被捕获的分组，将为None。
方法：
查看全文

相关阅读:
euler v10 dracut失败
 基于RYU应用开发之负载均衡
 4、网上收集Storm 讲解图
 3、SpringBoot 集成Storm wordcount
git常用
 3、SpringBoot集成Storm WorldCount
2、Storm中的一些概念理解
 1、Storm集群安装
 8、Spring-Kafka Recving Messages
7、Kafka、AMQ、RabbitMQ对比

原文地址：https://www.cnblogs.com/my-time/p/4505065.html