zoukankan      html  css  js  c++  java
  • Python3爬虫学习

      学了几天python3,发现目前学到的与爬虫还是关系不大,所以现在准备爬虫和语言同步学习。

    2016.8.9晚

      先从最简单的开始,爬取指定url的所有内容:

    #encoding:UTF-8
    import urllib.request
     
    url = "http://www.selflink.cn/selflink"
    data = urllib.request.urlopen(url).read()
    data = data.decode('UTF-8')
    print(data)
    #encoding:UTF-8
    import urllib.request
     
    url = "http://www.cma.gov.cn/"
    data = urllib.request.urlopen(url).read()
    data = data.decode('gbk')
    print(data)

    编码不同,一个是utf-8,一个是gbk

    另外,输出写入文件到的时候如果写入了一个html文件,打开可能会产生乱码,这个时候不要怀疑python的中文兼容性!(python中文兼容性超级好)其实可以用记事本打开一下爬取到的文件,编码是不是错了,可以另存为一下,重新设置一下编码(比如设置成utf-8),再把网页打开就发现不乱码了。比如这份代码:

    import urllib.request
    url = "http://www.douban.com/"
    webPage=urllib.request.urlopen(url)
    data = webPage.read()
    data = data.decode('utf-8')
    f = open("d:/1.html","w")
    f.write(data)
    print(type(webPage))
    print(webPage.geturl())
    print(webPage.info())
    print(webPage.getcode())

    因为python默认是以ASCII码存储文件的,所以在浏览器中爬取到的这个文件就显示了乱码。手动修改文件的编码就可以了。

    当然,如果想自动设置文件编码,需要用到codecs库:

    import urllib.request
    import codecs
    
    url = "http://www.douban.com/"
    webPage=urllib.request.urlopen(url)
    data = webPage.read()
    data = data.decode('utf-8')
    f = codecs.open("d:/1.html","w","utf-8")
    f.write(data)
    print(type(webPage))
    print(webPage.geturl())
    print(webPage.info())
    print(webPage.getcode())

     或者指定文件编码:

    #coding:utf8
    import urllib.request
    import urllib
    import re
    
    s='你好 百度百科'
    s=urllib.parse.quote(s)
    url = "http://www.baidu.com/s?wd=%s"%(s)
    webPage=urllib.request.urlopen(url)
    data = webPage.read()
    data = data.decode('utf-8')
    k = re.split(r's+',data)
    s = []
    sr = []
    sh=[]
    for i in k :
        if (re.match(r'href',i)):
            if (not re.match(r'href="#"',i)):
                s.append(i)
    f = open("D:/Pythoncode/simplecodes/bd.html","w",encoding='utf-8')
    for i in s:
        if (re.match(r'href="http://www.baidu.com/link',i)):
            sr.append(i)
    for it in sr:
        m = re.search(r'href="(.*?)"',it)
        iturl = m.group(1)
        sh.append(iturl)
    iurl = sh[0]
    webPage=urllib.request.urlopen(iurl)
    
    data = webPage.read()
    data = data.decode('utf-8')
    f.write(data)
    f.close()

    当然还有一种方法,就是直接以二进制的方式写入文件。

    import urllib.request
    url = "http://www.douban.com"
    webPage=urllib.request.urlopen(url)
    data = webPage.read()
    #data = data.decode('UTF-8')
    f = open("d:/1.html","wb")
    f.write(data)

    这种方法同样适用于爬取图片或者其他文件:

    import urllib.request
    url = "http://www.selflink.cn/huoying/naruto.jpg"
    webPage=urllib.request.urlopen(url)
    data = webPage.read()
    #data = data.decode('UTF-8')
    f = open("d:/naruto.jpg","wb")
    f.write(data)
  • 相关阅读:
    POJ 1251 Jungle Roads
    1111 Online Map (30 分)
    1122 Hamiltonian Cycle (25 分)
    POJ 2560 Freckles
    1087 All Roads Lead to Rome (30 分)
    1072 Gas Station (30 分)
    1018 Public Bike Management (30 分)
    1030 Travel Plan (30 分)
    22. bootstrap组件#巨幕和旋转图标
    3. Spring配置文件
  • 原文地址:https://www.cnblogs.com/itlqs/p/5754982.html
Copyright © 2011-2022 走看看