zoukankan html css js c++ java

爬虫常用正则、re.findall 使用

爬虫常用正则

爬虫经常用到的一些正则，这可以帮助我们更好地处理字符。

正则符

单字符

. : 除换行以外所有字符
[] ：[aoe] [a-w] 匹配集合中任意一个字符
d ：数字  [0-9]
D : 非数字
w ：数字、字母、下划线、中文
W : 非w
s ：所有的空白字符包,括空格、制表符、换页符等等。等价于 [ f

	v]
S : 非空白

数量修饰

* : 任意多次  >=0
+ : 至少1次   >=1
? : 可有可无  0次或者1次
{m} ：固定m次 hello{3,}
{m,} ：至少m次
{m,n} ：m-n次

边界

$ : 以某某结尾 
^ : 以某某开头

分组

(ab)

贪婪模式

.*

非贪婪惰性模式

.*?

# 1 提取出python
'''
key = 'javapythonc++php'

re.findall('python',key)
re.findall('python',key)[0]
'''
# 2 提取出 hello word
'''
key = '<html><h1>hello word</h1></html>'
print(re.findall('<h1>.*</h1>', key))
print(re.findall('<h1>(.*)</h1>', key))
print(re.findall('<h1>(.*)</h1>', key)[0])
'''
# 3 提取170
'''
key = '这个女孩身高170厘米'
print(re.findall('d+', key)[0])
'''
# 4 提取出http://和https://
'''
key = 'http://www.baidu.com and https://www.cnblogs.com'
print(re.findall('https?://', key))
'''
# 5 提取出 hello
'''
key = 'lalala<hTml>hello</HtMl>hahaha'   # 输出的结果<hTml>hello</HtMl>
print(re.findall('<[hH][tT][mM][lL]>.*[/hH][tT][mM][lL]>',key))
'''
# 6 提取hit. 贪婪模式;尽可能多的匹配数据
'''
key = 'qiang@hit.edu.com'                # 加?是贪婪匹配,不加?是非贪婪匹配
print(re.findall('h.*?.', key))
'''
# 7 匹配出所有的saas和sas
'''
key = 'saas and sas and saaas'
print(re.findall('sa{1,2}s',key))
'''
# 8 匹配出 i 开头的行
'''
key = """fall in love with you
i love you very much 
i love she
i love her
"""
print(re.findall('^i.*', key, re.M))
'''
# 9 匹配全部行
'''
key = """
<div>细思极恐
你的队友在看书,
你的闺蜜在减肥,
你的敌人在磨刀,
隔壁老王在练腰.
</div>
"""
print(re.findall('.*', key, re.S))
'''

案例题

re.findall 使用

1、re.findall 可以对多行进行匹配，并依据参数作出不同结果。

re.findall(取值,值,re.M)
    - re.M ：多行匹配
    - re.S ：单行匹配 如果分行则显示/n
    - re.I : 忽略大小写
    - re.sub(正则表达式, 替换内容, 字符串)

查看全文

相关阅读:
HDU 1358 Period （KMP）
POJ 1042 Gone Fishing
Csharp,Javascript 获取显示器的大小的几种方式
 css text 自动换行的实现方法 Internet Explorer,Firefox,Opera,Safar
Dynamic Fonts动态设置字体大小存入Cookie
CSS Image Rollovers翻转效果Image Sprites图片精灵
 CSS three column layout
css 自定义字体 Internet Explorer,Firefox,Opera,Safari
颜色选择器 Color Picker,Internet Explorer,Firefox,Opera,Safar
CSS TextShadow in Safari, Opera, Firefox and more

原文地址：https://www.cnblogs.com/xiangsikai/p/11251620.html