1.Special Symbols and Characters
1.1 single regex 1
. ,Match any character(except )
^ ,Match start of string
$ ,Match end of string
* ,Match 0 or more occurrences preceding regex
+ ,Match 1 or more occurrences preceding regex
? ,Match 0 or 1 occurrence preceding regex
{N} ,Match N occurrences preceding regex
{M,N} ,Match from M to N occurrences preceding regex
[...] ,Match any single character from character class
[..x-y..] ,Match any single character in the range from x to y ;["-a],In an ASCII system,all characters that fall between '"' and "a",that is ,between ordinals 34 and 97。
[^...] ,Do not match any character from character class ,including any ranges ,if present
(*|+|?|{})? ,Apply "non-greedy" versiongs of above occurrence/repetition symbols;默认情况下* + ? {}都是贪婪模式,在其后加上'?'就成了非贪婪模式。
(...) ,Match enclosed regex and save as subgroup .
1.2 single regex 2
d ,Match any decimal digit ,same as [0-9](D is inverse of d:do not match any numeric digit)
w ,Match any alphanumeric character,same as [A-Za-z0-9](W is inverse of w)
s ,Match any whitespace character,same as [
vf](S is inverse of s)
,Match any word boundary(B is inverse of )
N ,Match saved subgroup N(see (...) above) ;exam:print(1,3,16)
c ,transferred meaning ,without its special meaning;exam:.,\,*
A() ,Match start (end) fo string (also see ^ and $ above)
1.3 complex regex
(?=...) ,前向肯定断言。如果当前包含的正则表达式(这里以 ... 表示)在当前位置成功匹配,则代表成功,否则失败。一旦该部分正则表达式被匹配引擎尝试过,就不会继续进行匹配了;剩下的模式在此断言开始的地方继续尝试。举例:love(?=FishC) 只匹配后边紧跟着 FishC的字符串 love。
(?!...) ,前向否定断言。这跟前向肯定断言相反(不匹配则表示成功,匹配表示失败)。举例:FishC(?!.com)只匹配后边不是 .com& 的字符串 Fish。
(?<=...) ,后向肯定断言。跟前向肯定断言一样,只是方向相反。举例:(?<=love)FishC 只匹配前边紧跟着 love 的字符串 FishC。
(?<!...) ,后向否定断言。跟前向否定断言一样,只是方向相反。举例:(?<!FishC).com 只匹配前边不是 FishC的字符串 .com。
(?:) ,该子组匹配的字符串无法从后面获取。
(?(id/name)yes-pattern|no-pattern) ,1. 如果子组的序号或名字存在的话,则尝试 yes-pattern 匹配模式;否则尝试 no-pattern 匹配模式;
2. no-pattern 是可选的
举例:(<)?(w+@w+(?:.w+)+)(?(1)>|$) 是一个匹配邮件格式的正则表达式,可以匹配 <user@fishc.com>; 和 'user@fishc.com',但是不会匹配 '<user@fishc.com' 或 'user@fishc.com>'
1.4 匹配邮箱地址举例
import re
data = 'z843248880@163.com'
data1 = '<z843248880@163.com>'
data2 = '<z843248880@163.com'
data3 = 'z843248880@163.com>'
p1 = '(<)?(w+@w+(?:.w+)+)(?(1)>|$)'
p2 = 'w+@w+.w+'
p3 = '(<)?w+@w+.w+(?(1)>|$)'
m1 = re.match(p3, data3)
1.5 The re Modules:Core Functons and Methods
match(pattern,string,flags=0) ,Attempt to match pattern to string with optional flags;return match object on success,None on failure;it is start of the string to match.
search(pattern,string,flags=0) ,Search for first occurrence of pattern within string with optional flags;return match object on success,None on failure;it is start of the string to match.
findall(pattern,string[,flags=0]) ,Look for all occurrences of pattern in string;return a list of matches.
finditer(pattern,string[,flags=0]) ,Same as findall(),except returns an iterator instead of a list;for each match,the iterator returns a match object.
split(pattern,string,max=0) ,Split string into a list according to regex pattern delimiter and return list of successful matches,aplitting at most max times(split all occurrences is the default)
1.6 the usage of "?i" and "?m"
>>> import re
>>> re.findall(r'(?i)yes','yes Yes YES')
['yes', 'Yes', 'YES']
>>> re.findall(r'(?i)thw+','The quickest way is through to this tunnel.')
['The', 'through', 'this']
>>> re.findall(r'(?im)(^th[w ]+)',''')
... this line is the first,
... another line,
... that line,it's the best.
... ''')
['this line is the first', 'that line']
>>> re.findall(r'(?i)(^th[w ]+)','''
... this line is the first,
... another line,
... that line ,it's the best.
... ''')
>>> re.findall(r'(?i)(^th[w
... this line is th,
... anonjkl line,
... that line,it the best.
... ''')
By using "multiline" we can perform the search across multiple lines of the target string rather than treating the entire string as a single entity.
1.7 the usage of spilt
re.split(r'ss+',eachline) ,at least two whitespace.
re.split(r'ss+| ',eachline.rstrip()) ,at least two whitespace or one tablekey;rstrip(),delete the ' '.
1.8 one example
from random import randrange,choice
from string import ascii_lowercase as lc
from sys import maxsize
from time import ctime
tlds = ('com','org','net','gov','edu')
for i in range(randrange(5,11)):
dtint= randrange(1469880872)
dstr = ctime(dtint)
llen = randrange(4,8)
login = ''.join(choice(lc) for j in range(llen))
dlen = randrange(llen,13)
dom = ''.join(choice(lc) for j in range(dlen))
print('%s::%s@%s.%s::%d-%d-%d' % (dstr,login,dom,choice(tlds),dtint,llen,dlen))
Sat Nov 7 01:09:06 1998::hbtua@yzhnjyjanwuq.gov::910372146-5-12
Sat Oct 17 09:27:56 2015::djbljsf@uidicjppd.gov::1445045276-7-9
Sun Nov 18 06:10:07 1979::fkobvlf@zlnlyjej.org::311724607-7-8
Wed Jul 23 17:23:03 1986::hovwgi@wiidgvnng.net::522490983-6-9
Tue Feb 24 02:15:27 1998::xnuab@sgahgahv.gov::888257727-5-8
Thu Jun 1 14:20:55 1989::rdwqhu@xzazufffut.net::612681655-6-10
Mon Mar 6 14:36:59 1978::qabkezi@sehnxqcuxexf.net::258014219-7-12
Sun Apr 11 15:01:56 1982::agzp@sygikhagdasq.gov::387356516-4-12
1.9 Matching a string
import re
data = 'Wed Jul 22 08:42:15 2015::qaolc@ombddhysxuv.com::1437525736-347-28'
#pat_old = '^Mon|^Tue|^Wed|^Thu|^Fri|^Sta|^Sun'
pat = '^(Mon|Tue|Wed|Thu|Fri|Sta|Sun)'
m = re.match(pat, data)
pat2 = '^(w{3})'
m2 = re.match(pat2, data)
pa3 = '.+(d+-d+-d+)'
m3 = re.search(pa3, data)
m4 = re.match(pa3, data)
pa4 = '.+?(d+-d+-d+)'
m5 = re.match(pa4, data)
pa5 = '.+::(d+-d+-d+)'
m6 = re.match(pa5, data)
<class '_sre.SRE_Match' at 0x89df00>
<class '_sre.SRE_Match' at 0x89df00>
<class '_sre.SRE_Match' at 0x89df00>
Wed Jul 22 08:42:15 2015::qaolc@ombddhysxuv.com::1437525736-347-28
6-347-28 //greedy
1437525736-347-28 //because the '?' behind of '.+',so none-greedy;(see above in 1.1)
1.10 greedy and no-greedy
'.+' is greedy; '.+?' is not greedy.
import re str = 'python1班' print(re.search(r'(w+)(d)', str).group(0)) #取全部匹配的 print(re.search(r'(w+)(d)', str).group(1)) #取第一个括号匹配的 print(re.search(r'(w+)(d)', str).group(2)) #取第二个括号匹配的 结果: python1 python 1