python爬虫——黑板客老师课程学习

zoukankan html css js c++ java

python爬虫——黑板客老师课程学习
程序：

　　目标url

　　内容提取

　　表现形式

为什么：

　　大数据——数据膨胀，信息太多了，不知道哪些信息适合你，例如谷歌搜索引擎。

　　垂直行业搜索——某一个行业的搜索，与搜索引擎最大的区别：搜索引擎是告诉你哪些网页适合你，而垂直搜索引擎是告诉你哪些数据适合你。例如：去哪儿网，告诉你哪些机票适合你；链家网，告诉你哪些房子适合你。

学什么：

　　

　　get && show 就是爬虫

　　装库

　　pip install beautifulsoup4

　　pip install requests

　　pip install selenium

　　beautifulsoup4:把html看成一个树

　　
#!/usr/bin/env python # coding: utf-8 #copyRight by heibanke import urllib from bs4 import BeautifulSoup import re html = urllib.urlopen('http://baike.baidu.com/view/284853.htm') #通过urllib.urlopen来获取这个网址的内容 bs_obj = BeautifulSoup(html,"html.parser") #通过beautifulSoup来实例化一个对象 #findAll(tag, attributes, recursive, text, limit, keywords) #find(tag, attributes, recursive, text, keywords) #recursive=False表示只搜索直接儿子，否则搜索整个子树，默认为True。 #findAll(“a”） #findAll(“a”, href=“”) #findAll(“div”, class=“”) #findAll(“button”, id=“”) #a_list = bs_obj.findAll("a") a_list = bs_obj.findAll("a",href=re.compile(".baidu.comw?"))#正则表达式处理 #这里的a是html中的一个标签 #<a> 标签定义超链接，用于从一张页面链接到另一张页面。 #<a> 元素最重要的属性是 href 属性，它指示链接的目标 print a_list for aa in a_list: if not aa.find("img"):#图片的链接没有用 if aa.attrs.get('href'): print aa.text, aa.attrs['href']
　　这不过是入门而已，我们如果想更深入的了解，还要学会beautifulsoup4这个库，可以通过帮助文档、博客啥的进行学习。

　　关卡1：循环访问url

　　http://www.heibanke.com/lesson/crawler_ex00/

　　

　　我就奇怪了，代码是黑板课老师那边提供的，可是运行的时候就会出错，不知道为什么。

　　
# -*- coding: utf-8 -*- # CopyRight by heibanke import urllib from bs4 import BeautifulSoup import re url='http://www.heibanke.com/lesson/crawler_ex00/' number=[''] loops = 0 while True: content = urllib.urlopen(url+number[0]) bs_obj = BeautifulSoup(content,"html.parser") tag_number = bs_obj.find("h3") number= re.findall(r'd+',tag_number.get_text()) if not number or loops>100: break else: print number[0] loops+=1 print bs_obj.text
　　

　　

　　关卡2：

　　有用户名，然后破解密码，密码是30内数字

　　需要:post数据，requests

　　　　表单提交

　　　　http://www.heibanke.com/lesson/crawler_ex01/

　　requests库：

　　　　·支持各种request类型

　　　　　　HTTP request type:GET,POST,PUT（相当于新建）,DELETE,HEAD and OPTIONS

　　　　·支持各种POST,如上传文件，

　　　　·支持自定义header（有些网站会检测是否是机器人（爬虫）在访问）

　　　　·支持json数据解析

　　　　·支持访问Cookies

　　　　·支持重定向地址

　　　　·支持设置timeout——有的网址访问时间过长，可以自动设置一个timeout

　　　　

　　　　第三关：

　　　　　　登录验证

　　　　　　CSRF跨站请求伪造

　　　　　　CSRF是防止恶意攻击的

　　　　　　Selenium 硒

　　　　一个高级库，模拟浏览器登录的功能

　　　　名字由来：在之前有一个公司Mercury 汞，被惠普收购，这个是对企业做一些测试工具。而Selenium可以降低汞的毒性，相当于它的克星。

　　　　·模拟用户浏览器操作，Selenium IDE可录制测试动作——不用写代码

　　　　·Functional Test，自动测试

　　　　·支持多种语言，Python，Java，ruby，c#，php

　　　　·webdriver支持多种浏览器，最方便是Firefox
查看全文

相关阅读:
在 Linux 上如何挂载 qcow2 磁盘镜像
 CentOS ISO 下载地址
 构建ceph deb 安装包
 ceph 源码安装 configure: error: "Can't find boost spirit headers"
sudo: 没有终端存在,且未指定 askpass 程序
 ubuntu14.04 下出现 libmysqlclient.so.20 找不到问题
 binary-tree-postorder-traversal leetcode C++
binary-tree-preorder-traversal leetcode C++
candy leetcode C++
clone-graph leetcode C++

原文地址：https://www.cnblogs.com/shixisheng/p/5926415.html