正则表达式 - 走看看

2019独角兽企业重金招聘Python工程师标准>>>

正则表达式为高级文本模式匹配，以及搜索-替代等功能提供了基础。正则表达式是一些由字符和特殊符号组成的字符串，它们描述了这些字符和字符的某种重复的方式，因此能按照某种模式匹配一个有相似特征的字符串的集合。

python中有两种主要方法完成模式匹配：搜索和匹配。搜索，即在字符串任意部分中搜索匹配的模式。匹配，指判断一个字符串能否从起始处全部或部分的匹配某个模式。搜索通过search()函数或方法来实现，而匹配是以调用match()函数或方法来实现。

1、正则表达式常见的特殊符号和字符：

记号	说明	正则表达式示例
literal	匹配字符串的值	foo
re1 \| re2	匹配正则表达式re1或re2	foo \| bar
.	匹配任意字符（换行符除外）	b.b
^	匹配字符串的开始	^Dear
$	匹配字符串的结尾	/bin/*sh$
*	匹配前面出现的正则表达式零次或多次	[A-Za-z0-9]*
+	匹配前面出现的正则表达式一次或多次	[a-z]+.com
？	匹配前面出现的正则表达式一次或零次	goo?
{N}	匹配前面出现的正则表达式N次	[0-9]{3}
{M,N}	匹配前面出现的正则表达式M到N次	[0-9]{5,9}
[……]	匹配字符组里出现的任意一个字符	[aeiou]
[..x-y..]	匹配从字符x到y中的任意一个字符	[0-9],[A-Za-z]
[^……]	不匹配此字符集中出现的任何一个字符，包括某一范围的字符(如果在此字符集中出现)	[^aeiou]
(*\|+\|?\|{})?
(……)	匹配封闭括号中正则表达式，并保存为子组

特殊字符
d	匹配任何数字，和[0-9]一样	datad+.txt
D	和d反义，匹配任何非数字字符
w	匹配任何数字字母字符，和[A-Za-z0-9_]相同	[A-Za-z]w+
W	与w反义
s	匹配任何空白字符，和[ vf]相同	ofsthe
S	与s反义
	匹配单词边界	The
B	与反义
n	匹配已保存的子组	price:16
c	逐一匹配特殊字符c(即取消它的特殊含义，按字面匹配)	. , \ , *
A()	匹配字符串的起始（结束）	ADear

2、re模块

re模块常用的函数和方法。

函数/方法	描述
模块的函数
compile(pattern, flags = 0)	对正则表达式模式pattern进行编译，flags是可选标识符，并返回一个regex对象
re模块的函数和regex对象的方法
match(pattern, string, flags=0)	尝试用正则表达式模式pattern匹配字符串string，flags是可选标识符，如果匹配成功，则返回一个匹配对象，否则返回None
search(pattern, string, flags=0)	在字符串string中搜索正则表达式模式pattern的第一次出现，flags是可选标识符，如果匹配成功，则返回一个匹配对象，否则返回None
findall(pattern, string[,flags])	在字符串string中搜索正则表达式模式pattern的所有(非重复)出现；返回一个匹配对象的列表
finditer(pattern, string[,flags])	和findall()相同，但返回的不是列表而是迭代器，对于每个匹配，该迭代器返回一个匹配对象
split(pattern, string, max=0)	根据正则表达式pattern中的分隔符把字符string分割为一个列表，返回成功匹配的列表，最多分割max次(默认是分割所有匹配的地方)
sub(pattern, repl, string, max=0)	把字符串string中所有匹配正则表达式pattern的地方替换成字符串repl，若果max值没有给出，则对所有匹配的地方进行替换
匹配对象的方法
group(num=0)	返回全部匹配对象(或指定编号是num的子组)
groups()	返回一个包含全部匹配的子组的元组(如果没有成功匹配，就返回一个空元组)

a、用match()匹配字符串

match()函数尝试从字符串的开头开始对模式进行匹配，如果匹配成功，就返回一个匹配对象，如果匹配失败，则返回None。匹配对象的group()方法可以用来显示那个成功的匹配。

>>> m = re.match('foo','foot')
>>> if m is not None:
...     m.group()
...
'foo'
>>>
>>> m = re.match('foo','bar')
>>> if m is not None:
...     m.group()
...
>>>
>>> re.match('foo','foot').group()
'foo'
>>> re.match('foo','bar').group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>>

注意，通常在写代码时，还是最好用if判断一下，以防出现AttributeError异常。

b、用search()在一个字符串中查找一个模式

如果要搜素的模式出现在一个字符串中间的机率比出现在字符串开头的机率更大，则使用search()比较方便。search()和match()的工作方式一样，不同之处在于search()会检查参数字符串任意位置的地方给定正则表达式模式的匹配情况。如果搜索到成功的匹配，会返回一个匹配对象，否则返回None。

>>> m = re.match('foo','seafood')
>>> if m is not None:
...     m.group()
...
>>>
>>> m = re.search('foo','seafood')
>>> if m is not None:
...     m.group()
...
'foo'
>>>

注意：match()尝试从字符串起始处进行匹配模式，而search()搜索字符串中模式首次出现的位置，而不是尝试(在起始处)匹配，严格来说，search()是从左往右进行搜索。

c、匹配多个字符串( | )

使用管道符号 | 匹配多个模式

>>> bt = 'bat|bet|bit'
>>> m = re.match(bt, 'bat')
>>> if m is not None:
...     m.group()
...
'bat'
>>>
>>> m = re.match(bt, 'bat')
>>> bt = 'bat|bet|bit'
>>> m = re.match(bt, 'He bit me')
>>> if m is not None:
...     m.group()
...
>>> m = re.search(bt, 'He bit me')
>>> if m is not None:
...     m.group()
...
'bit'
>>>

d、匹配任意单个字符（.）

>>> anyend = '.end'
>>> m = re.match(anyend, 'bend')
>>> if m is not None:
...     m.group()
...
'bend'
>>>

e、创建字符集合([ ])

>>> m = re.match('[cr][23][dp][o2]','c3po')
>>> if m is not None:
...     m.group()
...
'c3po'
>>>

f、重复、特殊字符和子组

>>> pat1='w+@w+.com'
>>> re.match(pat1, 'nobody@xxx.com').group()
'nobody@xxx.com'
>>> pat2='w+@(w+.)?w+.com'
>>> re.match(pat2, 'nobody@xxx.com').group()
'nobody@xxx.com'
>>> re.match(pat2, 'nobody@www.xxx.com').group()
'nobody@www.xxx.com'
>>> pat3='w+@(w+.)*w+.com'
>>> re.match(pat3, 'nobody@www.xxx.yyy.zzz.com').group()
'nobody@www.xxx.yyy.zzz.com'
>>>

用group()方法访问每个子组以及用groups()方法获取一个包含所有匹配子组的元组。

>>> m = re.match('(www)-(ddd)','abc-123')
>>> m.group()
'abc-123'
>>> m.group(1)
'abc'
>>> m.group(2)
'123'
>>> m.groups()
('abc', '123')
>>>
>>> m = re.match('ab','ab')
>>> m.group()
'ab'
>>> m.groups()
()
>>>
>>> m = re.match('(ab)','ab')
>>> m.group()
'ab'
>>> m.group(1)
'ab'
>>> m.groups()
('ab',)
>>>
>>> m = re.match('(a)(b)','ab')
>>> m.group()
'ab'
>>> m.group(1)
'a'
>>> m.group(2)
'b'
>>> m.groups()
('a', 'b')
>>>
>>> m = re.match('(a(b))','ab')
>>> m.group()
'ab'
>>> m.group(1)
'ab'
>>> m.group(2)
'b'
>>> m.groups()
('ab', 'b')
>>>

g、用findall()找到每个出现的匹配部分

findall()用于非重叠地搜索某字符串中一个正则表达式模式出现的情况。findall()和search()相似之处在于二者都执行字符串搜索，但findall()和match()与search()不同之处是，findall()总返回一个列表。如果findall()没有找到匹配的部分，会返回空列表，如果成功找到匹配部分，则返回所有匹配部分的列表。

>>> re.findall('car', 'carry the barcard to the car')
['car', 'car', 'car']
>>>

h、用sub()(和subn())进行搜索和替换

i、用split()分割