zoukankan      html  css  js  c++  java
  • 第一个爬虫——爬取海报网热门图片

    地址:http://pic.haibao.com/hotimage/

    网页元素分析:

    结果

    源码

    import requests
    from bs4 import BeautifulSoup
    import os
    import time
    
    def getHotImgs():
        topPage = requests.get("http://pic.haibao.com/hotimage/").content
        topPageParse = BeautifulSoup(topPage,"html5lib")
        allLiTags = topPageParse.find_all('div', class_="pagelibox")
        imgs = []
        for liTag in allLiTags:
            imgTag = liTag.img
            imgSource = imgTag['data-original']
            if imgSource :
                imgs.append(imgSource)
        return imgs
    
    def saveHotImgs(imgs):
        if not os.path.exists('haibaoHotImg'):
            os.mkdir('haibaoHotImg')
        i = 0
        for img in imgs:
            image = requests.get(img).content
            timestamp = timeMillis()
            fileName = str(timestamp)+str(i)
            imgPar = img.rpartition('.')
            fileExt = imgPar[len(imgPar)-1]
            with file("haibaoHotImg"+'/'+fileName+'.'+fileExt,'w') as imgFile:
                imgFile.write(image)
    
    def timeMillis():
        return int(round(time.time() * 1000))
    
    if __name__ == "__main__":
        imgs = getHotImgs()
        saveHotImgs(imgs)
        print "finished"
  • 相关阅读:
    go并发和并行
    goroutine
    go并发
    wampserver配置问题
    获取字符串的长度
    mysql中事件失效如何解决
    Go语言中Goroutine与线程的区别
    Mosquitto服务器的日志分析
    phpexcel导出数据 出现Formula Error的解决方案
    Centos6.X 手动升级gcc
  • 原文地址:https://www.cnblogs.com/night1989/p/9672352.html
Copyright © 2011-2022 走看看