zoukankan      html  css  js  c++  java
  • windows批处理执行图片爬取脚本

     背景

    由于测试时需要上传一些图片,而自己保存的图片很少。

    为了让测试数据看起来不那么重复,所以网上找了一个爬虫脚本,以下是源码:

     1 import requests
     2 import os
     3 
     4 class Image():
     5     url = 'https://image.baidu.com/search/acjson'
     6     headers = {
     7         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.9 Safari/537.36'
     8     }
     9     varlist = []
    10     dir = './images'
    11     params = {}
    12 
    13     def __init__(self):
    14         global page_num,keywords
    15         page_num = int(input('请输入要抓取的页数:
    '))
    16         keywords = input('请输入关键字:
    ')
    17         if self.catch_page():
    18             self.writeData()
    19         else:
    20             print('抓取页面失败')
    21 
    22     def catch_page(self):
    23         for i in range(0,page_num * 30,30):
    24             self.params = {
    25                 'tn': 'resultjson_com',
    26                 'ipn': 'rj',
    27                 'ct': '201326592',
    28                 'is': '',
    29                 'fp': 'result',
    30                 'queryWord': keywords,
    31                 'cl': '2',
    32                 'lm': '-1',
    33                 'ie': 'utf-8',
    34                 'oe': 'utf-8',
    35                 'adpicid': '',
    36                 'st': '-1',
    37                 'z': '',
    38                 'ic': '0',
    39                 'hd': '',
    40                 'latest': '',
    41                 'copyright': '',
    42                 'word': keywords,
    43                 's': '',
    44                 'se': '',
    45                 'tab': '',
    46                 'width': '',
    47                 'height': '',
    48                 'face': '0',
    49                 'istype': '2',
    50                 'qc': '',
    51                 'nc': '1',
    52                 'fr': '',
    53                 'expermode': '',
    54                 'force': '',
    55                 'cg': 'girl',
    56                 'pn': i,
    57                 'rn': '30',
    58                 'gsm': '',
    59                 '1584010126096': ''
    60             }
    61             res = requests.get(url = self.url,params = self.params).json()['data']
    62             for j in range(0,30):
    63                 self.varlist.append(res[j]['thumbURL'])
    64         if self.varlist != None:
    65             return True
    66         return False
    67 
    68     def writeData(self):
    69         # 判读是否存在文件,不存在则创建
    70         if not os.path.exists(self.dir):
    71             os.mkdir(self.dir)
    72 
    73         for i in range(0,page_num * 30):
    74             print(f'正在下载第{i}条数据')
    75             images = requests.get(url = self.varlist[i])
    76             open(f'./images/{i}.jpg','wb').write(images.content)
    77 
    78 if __name__ == '__main__':
    79     Image()
    View Code

    这代码可能作者跑当时ok,但我跑失败了(报错:requests.exceptions.TooManyRedirects: Exceeded 30 redirects.),排查了一下,请求时加上headers参数就ok了。

    图片是保存到当前路径下的,要是把图片存储换一个目录,就需要移动这个爬虫文件,当然你也可以改代码里面的路径,但是换一次路径就改下代码?感觉不太优雅。

    那能不能写个window的批处理脚本(xxx.bat),py文件不动,你要换那个目录就把.bat文件放在那个目录里,py文件就放一个地方不用动。

    解决方案

    首先,这方法肯定是可行的

    其次,我得确认py文件的路径

    接着,我可以执行这个py文件

    然后,执行的时候把当前.bat路径传给py文件

    最后,在py代码里把图片保存在传入的路径下

    大功告成!.bat文件内容如下:

    1 @echo off
    2 rem 这里的D:和D:Python 是Python文件所在的盘及路径
    3 D:
    4 cd D:spider
    5 
    6 echo  当前路径:%~dp0
    7 python drink_pic.py %~dp0
    8 pause
    9 exit

    其中:

    • %cd%代表的是当前工作目录(current working directory,variable);
    • %~dp0代表的是当前批处理文件所在完整目录(the batch file's directory,fixed)

    以下是修改后图片爬虫py源码:

      1 import argparse
      2 
      3 import requests
      4 import os
      5 import sys
      6 
      7 class Image():
      8     url = 'https://image.baidu.com/search/acjson'
      9     headers = {
     10         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.9 Safari/537.36',
     11         'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
     12         'Accept-Encoding':'gzip, deflate',
     13         'Accept-Language':'zh-CN,zh;q=0.9',
     14         'Connection':'keep-alive',
     15         'Cookie':'BDqhfp=%E8%BD%AF%E4%BB%B6%E6%B5%8B%E8%AF%95logo%26%26NaN-1undefined-1undefined%26%262928%26%266; BAIDUID=50559E09CC89BCB4A35AE534A4AFBD93:FG=1; PSTM=1613793192; BIDUPSID=994A62B2BBC179C9D5FDDD4576FD1138; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; __yjs_duid=1_b93b073db4b3095e4b6ca8bdad9666671613879345923; H_PS_PSSID=33512_33241_33257_33344_31254_33601_33585_26350_33264; delPer=0; PSINO=5; ZD_ENTRY=baidu; BA_HECTOR=2081a48k040k852hlm1g3c5g40r; BDRCVFR[dG2JNJb_ajR]=mk3SLVN4HKm; userFrom=www.baidu.com; indexPageSugList=%5B%22%E9%85%92%22%5D; cleanHistoryStatus=0',
     16         'Host':'image.baidu.com',
     17         'Referer':'https://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=%E9%85%92',
     18         'Upgrade-Insecure-Requests':'1',
     19         'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
     20     }
     21     varlist = []
     22     dir = './images'
     23     params = {}
     24 
     25     def __init__(self, pt):
     26         global page_num,keywords
     27         page_num = int(input('请输入要抓取的页数:
    '))
     28         keywords = input('请输入关键字:
    ')
     29         if self.catch_page():
     30             self.writeData(pt)
     31         else:
     32             print('抓取页面失败')
     33 
     34     def catch_page(self):
     35         for i in range(0,page_num * 30,30):
     36             self.params = {
     37                 'tn': 'resultjson_com',
     38                 'ipn': 'rj',
     39                 'ct': '201326592',
     40                 'is': '',
     41                 'fp': 'result',
     42                 'queryWord': keywords,
     43                 'cl': '2',
     44                 'lm': '-1',
     45                 'ie': 'utf-8',
     46                 'oe': 'utf-8',
     47                 'adpicid': '',
     48                 'st': '-1',
     49                 'z': '',
     50                 'ic': '0',
     51                 'hd': '',
     52                 'latest': '',
     53                 'copyright': '',
     54                 'word': keywords,
     55                 's': '',
     56                 'se': '',
     57                 'tab': '',
     58                 'width': '',
     59                 'height': '',
     60                 'face': '0',
     61                 'istype': '2',
     62                 'qc': '',
     63                 'nc': '1',
     64                 'fr': '',
     65                 'expermode': '',
     66                 'force': '',
     67                 'cg': 'girl',
     68                 'pn': i,
     69                 'rn': '30',
     70                 'gsm': '',
     71                 '1584010126096': ''
     72             }
     73             res = requests.get(url = self.url,headers = self.headers, params = self.params).json()['data']
     74             print("---------res=", res)
     75             for j in range(0,30):
     76                 self.varlist.append(res[j]['thumbURL'])
     77         if self.varlist != None:
     78             print(self.varlist)
     79             return True
     80         return False
     81 
     82     def writeData(self, pt):
     83         # 判读是否存在文件,不存在则创建
     84         pt = pt + 'images/'
     85         if not os.path.exists(pt):
     86             os.mkdir(pt)
     87         print(pt)
     88         for i in range(0,page_num * 30):
     89             print(f'正在下载第{i}条数据')
     90             images_data = requests.get(self.varlist[i])
     91             images_content = images_data.content
     92             open(pt + f'{i}.jpg','wb').write(images_content)
     93 
     94 if __name__ == '__main__':
     95     # sys.argv[1]这里代表接受CMD传入的第一个参数,如果传多个参数命令后以空格隔开
     96     print("入参[1]为:", sys.argv[1])
     97     pt = sys.argv[1]
     98     # pt = 'E:/图片视频/'
     99     pt1 = pt.replace('\', '/')
    100     print('path',pt1)
    101     im= Image(pt1)
    View Code
  • 相关阅读:
    Codeforces 845E Fire in the City 线段树
    Codeforces 542D Superhero's Job dp (看题解)
    Codeforces 797F Mice and Holes dp
    Codeforces 408D Parcels dp (看题解)
    Codeforces 464D World of Darkraft
    Codeforces 215E Periodical Numbers 容斥原理
    Codeforces 285E Positions in Permutations dp + 容斥原理
    Codeforces 875E Delivery Club dp
    Codeforces 888F Connecting Vertices 区间dp (看题解)
    Codeforces 946F Fibonacci String Subsequences dp (看题解)
  • 原文地址:https://www.cnblogs.com/qgc1995/p/15111325.html
Copyright © 2011-2022 走看看