爬虫基础简单正则

zoukankan html css js c++ java

爬虫基础简单正则
爬虫基础
1.urllib.request模块

urlopen():打开一个给定URL字符串表示的Web连接，并返回文件了类型的对象
urlopen()对象的最常用方法：
f.read()：读取所有字节

例子：
from urllib.request import urlopen
html = urlopen("http://pythonscraping.com/pages/page1.html")
print(html.read())
print(html.read().decode('utf-8'))

2.Python bs4模块:解析HTML
bs4:BeautifulSoup

from bs4 import BeautifulSoup
from urllib.request import urlopen
html = urlopen("http://pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read(),"lxml")

注意：bs4解析HTML标签是没有层级顺序的，以下几种都可行，但是推荐第二种把层级标签全部列出来
print(bsObj.h1)
print(bsObj.html.body.h1) （推荐）
print(bsObj.body.h1)
print(bsObj.html.h1)

连接的稳定性和三种常见异常
、
1 from urllib.request import urlopen 2 from urllib.error import HTTPError 3 from urllib.error import URLError 4 from bs4 import BeautifulSoup 5 6 def getTitle(url): 7 try: 8 html = urlopen(url) 9 except HTTPError as e: 10 ''' 11 HTTPError:服务器端没有找到该页面， 12 或者提取页面时候发生错误 13 ''' 14 print(e) 15 return None 16 except URLError as e: 17 ''' 18 URLError:服务器没找到，远程的服务器 19 负责返回ＨＴＴＰ状态编码 20 ''' 21 print(e) 22 return None 23 24 try: 25 bsObj = BeautifulSoup(html.read(),"lxml") 26 title = bsObj.body.h1 27 except AttributeError as e: 28 ''' 29 AttributeError:属性错误，试图获得一个 30 ＨＴＭＬ标签，但是该标签并不存在，BS返回 31 一个空对象，并抛出该异常 32 33 ''' 34 return None 35 36 return title 37 38 title = getTitle( 39 "http://pythonscraping.com/pages/page1.html") 40 41 if title == None: 42 print("Title could not be found") 43 else: 44 print(title)
HTML解析
from bs4 import BeautifulSoup
bsObj = BeautifulSoup(' Extremely bold','lxml')

bsObj的几个属性
#返回值：
<html><body> Extremely bold</body></html>

bsObj.name #[document] 文件
bsObj.contents #内容：完整的HTML内容
bsObj.contents[0].name #是一级标题“HTML”

bsObj.body
返回值：
<body> Extremely bold</body>

bsObj.body.contents
返回值：
[ Extremely bold]

bsObj.b
bsObj.b.contents

bsObj.string #提取文本信息

bsObj的方法

<html><body> Extremely bold</body></html>

bsObj.find('b')

bsObj.find('b',id='tag1')

以下两种写法都可以
bsObj.find(id='tag1')
bsObj.find('',{'id': 'tag1'})

bsObj.get_text() #作用等同于bsObj.string

#每个标签都有自己的name和attrs
bsObj.b.name
bsObj.b.attrs
bsObj.b['id']
bsObj.b.attrs['id']

find()和findAll()
findAll(tag,attributes,recursive,text,limit,keywords)返回一个列表
find(tag,attributes,recursive,text,keywords)返回一个bsObj

tag:HTML标签,在<>里面的
attributes：属性,比如xxx中的id
recursive:递归，是一个布尔型，True表示findAll函数还会搜索子节点，以及子节点的子节点

例子：
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")

bsObj = BeautifulSoup(html.read(),"lxml")
nameList = bsObj.findAll(text="the prince")
print(len(nameList))

aList = bsObj.findAll('span',class='red')
class是python中的保留字，一般推荐用字典的表达方法{'class': 'red'}
aList = bsObj.findAll('span',{'class': 'red'})

正则表达式regular expression: regex
简介和动机
1.如何通过编程使计算机具有在文本中检索某种模式的能力
2.为高级的文本模式匹配、抽取、或文本形式的搜索和替换功能提供基础
3.是一些由字符和特殊符号组成的字符串，它们描述了模式的重复或表述多个字符，于是正则表达式能按照某种模式匹配一系列有相似特征的字符串

Python中的re模块

.点号用于匹配除了换行符以外的任何字符
例子：
f.o 在f和o之间，可以匹配任意一个字符：fao,f9o,f#o

模版：
prog = re.compile(某个模式)
result = prog.match(字符串)

上述两行代码可以合并为一行：
m = re.match(某个模式,字符串)

m = re.match('f.o','fao')

常用正则表达式
| 用于分割不同的regex，表示或者的关系
. 匹配除了换行符以外的任何字符
^ 从字符串起始边界开始匹配
$ 匹配任何以...结尾的字符串
匹配任何单词边界 boundary
B 与相反
[ ] 匹配某些特定字符
? 匹配模式出现零次或者一次
w 匹配一个字母 word
d 匹配一个数字 digit
+ 匹配一个或多个任何字符
{n} 前面的字符重复了n次

regex模式匹配的字符串
at | home at、home
.. 任意两个字符
^From 任何以From作为起始的字符串
river$ 任何以river作为结尾的字符串
the 仅仅匹配单词the
b[ui]t 匹配单词but以及bit
w{3} www

例子:
m = re.match('[cr][23][dp][o2]','c3po')

m = re.match('www-ddd','abc-123')
m = re.match('www-ddd','abc-xyz')

m = re.match('w{3}','cccsdf')

思考题：
写出一个满足此regex的字符串，并用match函数测试
pattern = 'w+@(w+.)?w+.com'

m = re.match(pattern,'23423@qq.com')

qeru@asdf.cn.com
23423@qq.com

注意：w匹配a-z,A-Z,0-9,包括下划线_
记忆：与python中变量的命名规则一致

遍历文档树的子节点
1 from urllib.request import urlopen 2 from bs4 import BeautifulSoup 3 4 html = urlopen( 5 "http://pythonscraping.com/pages/page3.html") 6 7 bsObj = BeautifulSoup(html.read(),'lxml') 8 9 print(bsObj.body.table.prettify()) 10 11 ''' 12 #遍历文档树的子节点 13 for child in bsObj.find("table",{"id":"giftList"} 14 ).children: 15 print(child) 16 17 18 #遍历文档树的子孙节点 19 for child in bsObj.find("table",{"id":"giftList"} 20 ).descendants: 21 print(child) 22 23 24 #遍历文档树兄弟节点 25 for sibling in bsObj.find("table",{"id":"giftList"} 26 ).tr.next_siblings: 27 print(sibling) 28 29 30 ''' 31 32 ''' 33 查找id号为gift3的tr标签，打印出它的next_siblings, 34 再打印出它的previous_siblings, 35 再打印出它的children 36 37 for child in bsObj.find("tr",{"id":"gift3"}).children: 38 print(child) 39 40 print("***********************************") 41 #遍历文档树父节点 42 print(bsObj.find("img", 43 {"src":"../img/gifts/img1.jpg"}) 44 .parent.previous_sibling.get_text()) 45 ''' 46 47 #使用正则表达式遍历文档树 48 ''' 49 使用findAll()把网页中所有的图片下载下来，打印出图片的src 50 ''' 51 import re 52 53 images = bsObj.findAll("img", 54 {"src":re.compile( 55 "../img/gifts/img.*.jpg")}) 56 57 58 for image in images: 59 print(image["src"])
　　
查看全文

相关阅读:
递归初级——第39级台阶
 排序——快速排序（尾递归优化）
排序——快速排序（优化小数组时的排序方案）
排序——快速排序（三数取中法和优化不必要交换）
排序——归并排序（递归实现+迭代实现）
超详细Hexo+Github博客搭建小白教程
 每日算法系列【LeetCode 1031】两个非重叠子数组的最大和
 每日算法系列【LeetCode 330】按要求补齐数组
 5W2H | 关于写博客的七点反思
 每日算法系列【LeetCode 124】二叉树中的最大路径和

原文地址：https://www.cnblogs.com/Han-org/p/8888448.html

爬虫基础 简单正则

爬虫基础简单正则