zoukankan html css js c++ java

网络爬虫必备知识之正则表达式

就库的范围，个人认为网络爬虫必备库知识包括urllib、requests、re、BeautifulSoup、concurrent.futures，接下来将结对re正则表达式的使用方法进行总结

1. 正则表达式概念

　　正则表达式是对字符串操作的一种逻辑公式，就是用事先定义好的一些特定字符、及这些特定字符的组合，组成一个“规则字符串”，这个“规则字符串”用来表达对字符串的一种过滤逻辑。

　　许多程序设计语言都支持正则表达式进行字符串操作，并不是python独有，python的re模块提供了对正则表达式的支持。

　　正则表达式内容太过于"深奥"，以下内容仅总结我平时使用过程中认为相对重要的点：常用匹配模式、泛匹配、贪婪匹配、分组匹配(exp)和re库函数

2. python正则常用匹配模式

w      匹配字母数字及下划线
W      匹配f非字母数字下划线
s      匹配任意空白字符，等价于[	

f]
S      匹配任意非空字符
d      匹配任意数字
D      匹配任意非数字
A      匹配字符串开始
      匹配字符串结束，如果存在换行，只匹配换行前的结束字符串
z      匹配字符串结束
G      匹配最后匹配完成的位置

      匹配一个换行符
	      匹配一个制表符
^       匹配字符串的开头
$       匹配字符串的末尾
.       匹配任意字符，除了换行符，re.DOTALL标记被指定时，则可以匹配包括换行符的任意字符
[....]  用来表示一组字符，单独列出：[amk]匹配a,m或k
[^...]  不在[]中的字符：[^abc]匹配除了a,b,c之外的字符
*       匹配0个或多个的表达式
+       匹配1个或者多个的表达式
?       匹配0个或1个由前面的正则表达式定义的片段，非贪婪方式
{n}     精确匹配n前面的表示
{m,m}   匹配n到m次由前面的正则表达式定义片段，贪婪模式
a|b     匹配a或者b
()      匹配括号内的表达式，也表示一个组

2. re库使用说明

（1）match函数

　　函数原型：def match(pattern, string, flags=0):

　　尝试从字符串的起始位置匹配一个模式，如果起始位置没匹配上的话，返回None

content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hellosdddsd{4}sw{10}.*Demo$',content)
print(result)
print(result.group()) #获取匹配的结果
print(result.span())  #获取匹配字符串的长度范围

　　输出：

（2）泛匹配

　　上面的代码正则表达式太复杂，我们可以使用下面的方式进行简化

content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello.*Demo$',content)
print(result)
print(result.group())
print(result.span())

　　输出结果一样，这样看起来就更简洁，以hello开头，中间匹配任意字符0次到多次，以Demo结尾

（3）分组匹配

　　为了匹配字符串中具体的目标，可以使用（）进行分组匹配

content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hellos(d+).*Demo$',content)
print(result.group())
print(result.group(1))

　　输出：

（4）命名方式的分组匹配

　　(?<name>exp) :匹配exp,并捕获文本到名称为name的组里，也可以写成(?'name'exp)

content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hellos(?P<num>d+).*Demo$',content)
print(result.group())
print(result.group(1))
print(result.groupdict())

　　输出：

　　采用命名分组方式，可以通过key‘num’获取匹配到的信息

（5）贪婪匹配

　　意思就是一直匹配，匹配到匹配不上为止

content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello.*(?P<name>d+).*Demo$',content)
print(result.group())
print(result.group(1))
print(result.groupdict())

　　输出：

　　最终结果输出的是7，出现这样的结果是因为被前面的.*给匹陪掉了，只剩下了一个数字，这就是贪婪匹配

　　若要非贪婪匹配可以使用问号（？）

content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello.*?(?P<name>d+).*Demo$',content)
print(result.group())
print(result.group(1))
print(result.groupdict())

　　这样就可以匹配123了

（6）函数中添加匹配模式

　　def match(pattern, string, flags=0)函数中的第三个参数flags设置匹配模式

　　re.I：使匹配对大小写不敏感

　　re.L：做本地化识别匹配

　　re.S：使.包括换行在内的所有字符

　　re.M：多行匹配，影响^和$

　　re.U：使用unicode字符集解析字符，这个标志影响w,W,,B

　　re.X：该标志通过给予你更灵活的格式以便你将正则表达式写得更易于理解

　　下面以re.I和re.S为例：

content= "heLLo 123 4567 World_This is a regex Demo"
result = re.match('hello',content,re.I)
print(result.group())

　　输出：heLLo

　　不加re.S情况

content= '''heLLo 123 4567 World_This is 
a regex Demo'''
result = re.match('.*',content)
print(result.group())

　　输出：heLLo 123 4567 World_This is

　　再看加re.S的情况

content= '''heLLo 123 4567 World_This is 
a regex Demo'''
result = re.match('.*',content,re.S)
print(result.group())

　　re库中大部分函数都有该flags参数

（7）search函数

　　函数原型：def search(pattern, string, flags=0)

　　扫描整个字符串，返回第一个匹配成功的结果

content= '''hahhaha hello 123 4567 world'''
result = re.search('hello.*world',content)
print(result.group())

　　输出：hello 123 4567 world，如果将search改为match将提示异常，因为没有匹配到内容

（8）findall函数

　　函数原型：def findall(pattern, string, flags=0)

　　搜索字符串，以列表的形式返回所有能匹配的字串

content= '''
    <url>
        <loc>http://example.webscraping.com/places/default/view/Afghanistan-1</loc>
    </url>
    <url>
        <loc>http://example.webscraping.com/places/default/view/Aland-Islands-2</loc>
    </url>
    <url>
        <loc>http://example.webscraping.com/places/default/view/Albania-3</loc>
    </url>
    <url>
        <loc>http://example.webscraping.com/places/default/view/Algeria-4</loc>
    </url>
    <url>
        <loc>http://example.webscraping.com/places/default/view/American-Samoa-5</loc>
    </url>'''
urls = re.findall('<loc>（.*）</loc>',content)
for url in urls:
    print(url)

　　输出：

（9）sub函数

　　函数原型：def subn(pattern, repl, string, count=0, flags=0)

　　替换字符串中每一个匹配的子串后返回替换后的字符串

content= '''hahhaha hello 123 4567 world'''
str = re.sub('hello.*world','zhangsan',content)
print(str)

　　输出：hahhaha zhangsan

（10）compile

　　函数原型：def compile(pattern, flags=0)

　　将正则表达式编译成正则表达式对象，方便复用该正则表达式

content= '''hahhaha hello 123 4567 world'''
pattern = 'hello.*'
regex = re.compile(pattern)
str = re.sub(regex,'zhangsan',content)
print(str)