zoukankan      html  css  js  c++  java
  • 第一个爬虫——爬取海报网热门图片

    地址:http://pic.haibao.com/hotimage/

    网页元素分析:

    结果

    源码

    import requests
    from bs4 import BeautifulSoup
    import os
    import time
    
    def getHotImgs():
        topPage = requests.get("http://pic.haibao.com/hotimage/").content
        topPageParse = BeautifulSoup(topPage,"html5lib")
        allLiTags = topPageParse.find_all('div', class_="pagelibox")
        imgs = []
        for liTag in allLiTags:
            imgTag = liTag.img
            imgSource = imgTag['data-original']
            if imgSource :
                imgs.append(imgSource)
        return imgs
    
    def saveHotImgs(imgs):
        if not os.path.exists('haibaoHotImg'):
            os.mkdir('haibaoHotImg')
        i = 0
        for img in imgs:
            image = requests.get(img).content
            timestamp = timeMillis()
            fileName = str(timestamp)+str(i)
            imgPar = img.rpartition('.')
            fileExt = imgPar[len(imgPar)-1]
            with file("haibaoHotImg"+'/'+fileName+'.'+fileExt,'w') as imgFile:
                imgFile.write(image)
    
    def timeMillis():
        return int(round(time.time() * 1000))
    
    if __name__ == "__main__":
        imgs = getHotImgs()
        saveHotImgs(imgs)
        print "finished"
  • 相关阅读:
    Web API入门二(实例)
    Web API 入门一
    模板编程
    Unity3D中的AI架构模型
    Linux系列
    LCS记录
    hadoop使用问题
    AOP之Castle DynamicProxy 动态代理
    python 之readability与BeautifulSoup
    django rest_framework--入门教程3
  • 原文地址:https://www.cnblogs.com/night1989/p/9672352.html
Copyright © 2011-2022 走看看