zoukankan      html  css  js  c++  java
  • python数据挖掘第二篇-爬虫

    python爬虫

    urllib用法

    eg1:
    from urllib import request
    data = request.urlopen(urlString).read() # data获取的是该网页的所有源码内容
    data=data.decode("utf-8") # 对data编码
    import re
    pat='<div class="name">(.*?)</div>
    res = re.findall(pat,data) # res为一个匹配结果列表
    eg2:
    request.urlretrieve(url,filename=localfilename) #将url指定的网页爬取至filename中
    request.urlcleanup() #当使用了urlretrieve后会产生缓存占用空间,用urlcleanup可以清除缓存
    request.info() request.getcode() # 访问返回码 request.geturl() # 获取当前访问的网页
    timeout # 超时限制 秒为单位
    data = request.urlopen(urlString,timeout=5).read()

    自动模拟http请求

    import urllib.parse # 数据包
    url="http://www.xxx.com"
    data=urllib.parse.urlencode({
    "name":"xuqiqiang",
    "password":"heaoiwoe"
    )}.encode("utf-8")
    req = request.Request(url,data)
    data = request.urlopen(req).read()
    fh = open("D:loadfile.html",'wb')
    fh.write(data)
    fh.close()
    -----------------将请求结果保存完毕

    爬虫异常

    urllib.error
    try:
    ...
    except urllib.error as e:
    if hasattr(e,"code"):
    print(e.code)
    if hasattr(e,"reason"):
    print(e.reason)

    爬虫的浏览器伪装技术

    当爬取网页,返回错误码403说明对方服务器对爬虫屏蔽,此时,不能直接用之前的爬虫方式进行爬取,需要伪装成浏览器进行爬取。
    url="http://www.xxx.com"
    header = ("user-Agent","...") #第二项为第一项的取值
    opener = urllib.request.build_opener()
    opener.addheaders=[header]

    第一种写法:

    data = opener.open(url).read().decode("utf-8","ignore")

    第二种写法:

    opener = urllib.request.build_opener()
    opener.addheaders=[header]
    urllib.request.instal_opener(opener)
    data = urllib.request.urlopen(url).read().decode("utf-8","ignore")
    fh=open(filepath,'wb')
    fh.write(data)
    fh.close()

    新闻爬虫实战

    import urllib.request
    import re
    data = urllib.request.urlopen("http://news.sina.com.cn").read()
    data = data.decode("utf-8","ignore")
    pat = 'href="(http://news.sina.com.cn/.*?)">'
    all_url = re.findall(pat,data)
    for i in range(len(res)):
     thisurl = all_url[i]
     file = "newsFile"+str(i)+".html"
     try:
      urllib.request.urlretrieve(thisurl,file)
     except urllib.error.URLError as e:
      if hasattr(e,"code"):
       print(e.code)
      if hasattr(e,"reason"):
       print(e.reason)

    爬虫防屏蔽之代理服务器

    import urllib.request as rq
    def use_proxy(url,proxy_addr):
     proxy = rq.ProxyHandler({"http":proxy_addr})
     opener = rq.build_opener(proxy,rq.HTTPHandler)
     rq.install_opener(opener)
     return rq.urlopen(url).read().decode("utf-8","ignore")

    淘宝图片爬取

  • 相关阅读:
    CSS简介
    jQuery学习笔记一
    JavaScript基础testDemo
    JavaScript知识点记录
    js实现404页面倒计时跳转 猫
    html5动画之等待加载动画 猫
    开发jquery插件小结 猫
    jquery做一个小的轮播插件有BUG,后续修改 猫
    js倒计时跳转jquery插件版 猫
    nodejs安装配置 猫
  • 原文地址:https://www.cnblogs.com/xqqblog/p/12034490.html
Copyright © 2011-2022 走看看