zoukankan html css js c++ java

一个爬取Bing每日壁纸的python脚本

1. 背景

Bing搜索每天的背景图片有些比较适合做桌面，但是有的提供下载有的不提供下载。每天去点击下载又不太方便，所以第一次学习了一下python爬虫怎么写，写的很简单。

2. 相关技术

2.1 Python爬虫参考

2.2 Python正则表达式

参考：Python正则表达式指南

2.3 解决登录问题

一些网站需要登录操作，应该是大部分网站都是登录操作的。
登录方案参考：模拟登录一些知名的网站

2.4 logging：内置日志库

参考：python 的日志logging模块学习

3. 爬虫实现

爬虫分三个部分：请求，解析，保存。
下面只展示主要逻辑代码。完整代码参考Github。

3.1 请求脚本

import urllib.request
import re
import logging

def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    if html:
        logging.debug("Get Response:"+str(len(html)))
    else:
        logging.warning("Request failed!")
    return html.decode('utf-8')

3.2 解析脚本

重点是解析脚本，这里定义了两种方法：一种通过正则表达式匹配，另一种使用BeautifulSoup解析文档树。通过文档书解析是原来通过下载页面来解析的，但是发现下载的页面与直接请求http://cn.bing.com/获得的响应是不同的，因为有js脚本做了后续处理。所以无法做爬虫解析。只能使用了正则表达式匹配，效果还好。

from bs4 import BeautifulSoup
import json
import re
import logging

def getJpg(html):
    reg = r'(url:.{10,90}jpg)' //这里匹配包含"url:**jpg"的字符串，没写出更精确的正则表达式，只能写匹配10到90个字符了
    logging.debug("Using re "+reg+" to get Jpg")
    jpgre= re.compile(reg)
    jpglist=re.findall(jpgre,html)
    if jpglist:
        logging.debug("Get jpg list("+str(len(jpglist))+"):"+str(jpglist))
        jpgUrl = jpglist[0].split('"')[1]
        imageUrl = host+jpgUrl
        logging.info("Get jpg url:"+imageUrl)
        return imageUrl
def bingParser(html):
    #soup=BeautifulSoup(html,"html.parser")//直接解析响应就会有问题获取不到
    soup=BeautifulSoup(open('Bing.html'),"html.parser") //最初通过下载的页面解析成功
    print(soup.title)
    print(type(soup.a))
    print(soup.select('#bgDiv'))
    style = (soup.select('#bgDiv')[0].attrs['style']).strip()
    print(style)
    json_style=json.dumps(style)
    print(json_style)
    imageurl=style.strip().split(';')[-3:-2]
    #print(imageurl[0].split('"')[1])
    imageUrl = (imageurl[0].split('"')[1])
    #imageUrl = (imageurl[0].split(':')[1].strip().split('"')[1])
    print(imageUrl)
    return imageUrl

3.3 保存脚本

保存脚本是需要运行的脚本，所以其他脚本都在这里调用了。

import urllib.request
import urllib.parse
import parseHtml
import request
import logging
import sys
//定义日志
logging.basicConfig(level=logging.DEBUG,
                format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
                datefmt='%Y-%m-%d %H:%M:%S',
                filename='bingcn.log',
                filemode='a'
                ) 

host="http://cn.bing.com"
logging.info("From:"+host)
html = request.getHtml(host)
imageurl =  parseHtml.getJpg(html)
logging.info("Image url:"+imageurl)
fileName = imageurl.split('/')[-1:][0]
logging.info("Image file name:"+fileName)

def saveImg(imageURL,fileName):
    url = (imageURL)
    logging.info('Image file url:'+url)
    #url=urllib.parse.urlencode(url)
    u = urllib.request.urlopen(url)
    data = u.read()
    f = open(fileName, 'wb')
    f.write(data)
    logging.info("Save file :"+imageURL)
    f.close()
    
saveImg(imageurl,fileName)

4. 运行

脚本针对python3环境写的，直接运行saveImage.py即可。
如果使用日志文件的方式，可以在当前目录下看到日志文件bingcn.log，保存的图片也在当前目录下。

james@james:~/code/hello-world/code/python/networkong/pycrowler/crowler_bingcn > python3 saveImage.py
2017-06-26 14:36:05 saveImage.py[line:19] INFO From:http://cn.bing.com
2017-06-26 14:36:06 request.py[line:12] DEBUG Get Response:126510
2017-06-26 14:36:06 parseHtml.py[line:91] DEBUG Using re (url:.{10,90}jpg) to get Jpg
2017-06-26 14:36:06 parseHtml.py[line:95] DEBUG Get jpg list(2):['url: "/az/hprichbg/rb/MadagascarLemurs_ZH-CN7754035615_1920x1080.jpg', "url:'\\/az\\/hprichbg\\/rb\\/CallanishSS_ZH-CN12559903397_1920x1080.jpg"]
2017-06-26 14:36:06 parseHtml.py[line:98] INFO Get jpg url:http://cn.bing.com/az/hprichbg/rb/MadagascarLemurs_ZH-CN7754035615_1920x1080.jpg
2017-06-26 14:36:06 saveImage.py[line:24] INFO Image url:http://cn.bing.com/az/hprichbg/rb/MadagascarLemurs_ZH-CN7754035615_1920x1080.jpg
2017-06-26 14:36:06 saveImage.py[line:26] INFO Image file name:MadagascarLemurs_ZH-CN7754035615_1920x1080.jpg
2017-06-26 14:36:06 saveImage.py[line:30] INFO Image file url:http://cn.bing.com/az/hprichbg/rb/MadagascarLemurs_ZH-CN7754035615_1920x1080.jpg
2017-06-26 14:36:06 saveImage.py[line:36] INFO Save file :http://cn.bing.com/az/hprichbg/rb/MadagascarLemurs_ZH-CN7754035615_1920x1080.jpg

查看全文

相关阅读:
数组循环的各种方法的区别
 数组里面findIndex和indexOf的区别
 选择器的绑定
 把dialog对话框设置成组件的形式
 css font-family字体及各大主流网站对比
 记一下公司的备注怎么写
 可删
 瑞萨电子：嵌入式终端与人工智能融合改变工业格局
 linux有什么作用
 Linux有哪些特点

原文地址：https://www.cnblogs.com/drawnkid/p/7080549.html