zoukankan html css js c++ java

爬虫（一）：基础篇

1.对于网页抓取器：urllib2（基础版），requests（加强版）。接下来以urllib2为例子介绍网页抓取的三个基础的小例子

 1 #!/usr/bin/env python
 2 # coding: utf-8
 3 
 4 #方法一：
 5 import urllib2
 6 
 7 #直接请求：
 8 response = urllib2.urlopen("https://baidu.com")
 9 
10 #获取状态码，如果200表示获取成功：
11 print response.getcode()
12 
13 #读取内容：
14 content= response.read()
15 
16 #方法二：
17 #添加data，http header
18 
19 request=urllib2.Request("https://baidu.com")
20 
21 request.add_header('User-Agent','Mozilla/5.0')
22 
23 response2=urllib2.urlopen(request)
24 import cookielib
25 #****************************************
26 cj=cookielib.CookieJar()  #生成一个Cookierjar构架
27 
28 opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
29 
30 urllib2.install_opener(opener)
31 
32 response3=urllib2.urlopen("https://baidu.com")
33 
34 print response3.read()

2.网页解析器：从网页中提取有价值的数据工具:

　　分为：正则表达式、html.parser、BeautifulSoup(采用html.parser 和 lxml形式)

II：结构化解析-DOM（前端的形式）

#!/usr/bin/env python
# coding: utf-8

#BeautifulSoup简单的例子展示
import urllib2
from bs4 import BeautifulSoup
import cookielib
import re
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'html.parser',from_encoding='utf-8')
print "正则表达式"
a_node=soup.find('a',href=re.compile(r'ill'))
print a_node
link_node=soup.find('p',class_="title")
print link_node.name,link_node.get_text()

3.爬虫的构架：确定目标----》分析目标-------》

1.抓取百度百科页面的python的词条，和本页面中关于python的介绍：

在百度词条页面上其他词条链接： href="/view/76320.htm"（一个不完整的链接，在前面加上http://baike.baidu.com/view/76320.htm）

python中__init__ 方法的使用：类class的一个对象被建立时，马上运行，对你的类的方向做一些希望的初始化。

**********************************************分割符**************************************************

1.get_text():获取通过find找到的标签中的文本内容，在括号中添加“”strip=True“”可以去掉文本中多余的空格：

接下来对python百度百科词条的爬取，1，爬取此页面中Python的主页面，2爬取对Python的描写编辑：

 1 #添加data，http heard
 2 def get_new_data(soup):
 3     res_data={}
 4     title_node=soup.find('dd', class_="lemmaWgt-lemmaTitle-title").find('h1')
 5     res_data['title']=title_node.get_text()
 6 
 7     summary_node=soup.find("div",class_="lemma-summary")
 8     res_data['summary']=summary_node.get_text()
 9     #通过get_text()解析出来的文档是数字形式，由于汉字gbk，网页用的UTF-8
10     
11     return res_data
12 
13 
14 url="http://baike.baidu.com/link?url=FimNNlgCZAvayi8faWZOf1tdh5Gpx
15 SsgFKCKEtX7ixvXcJ7Xev2SYa95smVrBpw2k486g9e-EZUVffYjCjS2Iq"
16 request=urllib2.Request(url)
17 response=urllib2.urlopen(request)
18 
19 if(response.code==200):
20     html_cont=response.read()#直接将爬取网页内容存取到html_cont
21 else:
22     print ("爬取页面失败")
23 
24 # 告诉机器采用何种html的解析器   
25 soup=BeautifulSoup(html_cont,"html.parser",from_encoding='UTF-8')
26 
27 new_data=get_new_data(soup)
28 
29 print new_data["summary"]
30 print ("res_data[summary]:%s"% new_data['summary'].encode("utf-8"))

同样在此基础上匹配URL：

 1 def get_new_url(soup,page_url):
 2     #第一步得到所有的url:
 3     new_urls=set()
 4     links=soup.find_all('a',href=re.compile(r"/view/d+.htm"))
 5     for link in links:
 6         new_url=link['href'] # 不完全的url需要更加完备的url
 7         print link['href']
 8         new_full_url=urlparse.urljoin(page_url,new_url)
 9         new_urls.add(new_full_url)
10     print new_urls
11     return new_urls
12     
13 
14 url="http://baike.baidu.com/link?url=FimNNlgCZAvayi8faWZOf1tdh5Gpx
15 SsgFKCKEtX7ixvXcJ7Xev2SYa95smVrBpw2k486g9e-EZUVffYjCjS2Iq"
16 request=urllib2.Request(url)
17 response=urllib2.urlopen(request)
18 
19 if(response.code==200):
20     html_cont=response.read()#直接将爬取网页内容存取到html_cont
21 else:
22     print ("爬取页面失败")
23 
24 # 告诉机器采用何种html的解析器   
25 soup=BeautifulSoup(html_cont,"html.parser",from_encoding='UTF-8')
26 
27 
28 new_url=get_new_url(soup,url)

*******************************************************************************************

接下来是在慕课网上一个简单的爬虫程序，经过编写已经可以。

详情见：

查看全文

相关阅读:
B
A
UVA
马的移动(BFS) 详细注释一个具有情怀的题目
 JAVA JDK 环境变量配置--简单图解
 linux系统(rpm与deb环境)，JAVA JDK的配置
 Jmeter接口测试+压力测试+环境配置+证书导出
 LR访问Https接口
 GitHub linux 提交文件及403错误处理
 random模块写的验证码

原文地址：https://www.cnblogs.com/woainifanfan/p/5700974.html

热门文章
牛客寒假算法基础集训营4 G(最小生成树）
3.23
3.22
3.20
P2196 挖地雷
 3.19
3.18
3.17
3.16
markdown math 数学公式语法