zoukankan      html  css  js  c++  java
  • 制作数据集(一):从google image中批量下载图片

    背景:因为项目需要,希望制作一个由平面、反光材质的照片组成的数据集,如木质纹理的桌面、门面, 平坦的瓷砖地板、墙面,反光的金属表面等等。但是找不到能满足需求的数据集,所以制作了自己的数据集

    方法:首先po出原文链接:https://www.pyimagesearch.com/2017/12/04/how-to-create-a-deep-learning-dataset-using-google-images/

    • 在谷歌图片中搜索想要下载的图片,这里仅限谷歌浏览器,国内可以通过GHelper插件使用谷歌,读者可以自行百度,如何安装GHelper
    • 如下,
    • 1、先在谷歌图片中搜索"C Ronaldo",
    • 2、滚动页面,直到加载完所有你想下载的图片
    • 3、使用“ctrl + shift + j”调出JavaScript控制台
    • 4、将下面的代码复制到控制台,随即出现一个名为‘urls.txt’的下载页面,选择合适的位置下载即可
    • /**
       * simulate a right-click event so we can grab the image URL using the
       * context menu alleviating the need to navigate to another page
       *
       * attributed to @jmiserez: http://pyimg.co/9qe7y
       *
       * @param   {object}  element  DOM Element
       *
       * @return  {void}
       */
      function simulateRightClick( element ) {
          var event1 = new MouseEvent( 'mousedown', {
              bubbles: true,
              cancelable: false,
              view: window,
              button: 2,
              buttons: 2,
              clientX: element.getBoundingClientRect().x,
              clientY: element.getBoundingClientRect().y
          } );
          element.dispatchEvent( event1 );
          var event2 = new MouseEvent( 'mouseup', {
              bubbles: true,
              cancelable: false,
              view: window,
              button: 2,
              buttons: 0,
              clientX: element.getBoundingClientRect().x,
              clientY: element.getBoundingClientRect().y
          } );
          element.dispatchEvent( event2 );
          var event3 = new MouseEvent( 'contextmenu', {
              bubbles: true,
              cancelable: false,
              view: window,
              button: 2,
              buttons: 0,
              clientX: element.getBoundingClientRect().x,
              clientY: element.getBoundingClientRect().y
          } );
          element.dispatchEvent( event3 );
      }
      
      /**
       * grabs a URL Parameter from a query string because Google Images
       * stores the full image URL in a query parameter
       *
       * @param   {string}  queryString  The Query String
       * @param   {string}  key          The key to grab a value for
       *
       * @return  {string}               value
       */
      function getURLParam( queryString, key ) {
          var vars = queryString.replace( /^?/, '' ).split( '&' );
          for ( let i = 0; i < vars.length; i++ ) {
              let pair = vars[ i ].split( '=' );
              if ( pair[0] == key ) {
                  return pair[1];
              }
          }
          return false;
      }
      
      
      
      /**
       * Generate and automatically download a txt file from the URL contents
       *
       * @param   {string}  contents  The contents to download
       *
       * @return  {void}
       */
      function createDownload( contents ) {
          var hiddenElement = document.createElement( 'a' );
          hiddenElement.href = 'data:attachment/text,' + encodeURI( contents );
          hiddenElement.target = '_blank';
          hiddenElement.download = 'urls.txt';
          hiddenElement.click();
      }
      
      /**
       * grab all URLs va a Promise that resolves once all URLs have been
       * acquired
       *
       * @return  {object}  Promise object
       */
      function grabUrls() {
          var urls = [];
          return new Promise( function( resolve, reject ) {
              var count = document.querySelectorAll(
                  '.isv-r a:first-of-type' ).length,
                  index = 0;
              Array.prototype.forEach.call( document.querySelectorAll(
                  '.isv-r a:first-of-type' ), function( element ) {
                  // using the right click menu Google will generate the
                  // full-size URL; won't work in Internet Explorer
                  // (http://pyimg.co/byukr)
                  simulateRightClick( element.querySelector( ':scope img' ) );
                  // Wait for it to appear on the <a> element
                  var interval = setInterval( function() {
                      if ( element.href.trim() !== '' ) {
                          clearInterval( interval );
                          // extract the full-size version of the image
                          let googleUrl = element.href.replace( /.*(?)/, '$1' ),
                              fullImageUrl = decodeURIComponent(
                                  getURLParam( googleUrl, 'imgurl' ) );
                          if ( fullImageUrl !== 'false' ) {
                              urls.push( fullImageUrl );
                          }
                          // sometimes the URL returns a "false" string and
                          // we still want to count those so our Promise
                          // resolves
                          index++;
                          if ( index == ( count - 1 ) ) {
                              resolve( urls );
                          }
                      }
                  }, 10 );
              } );
          } );
      }
      
      /**
       * Call the main function to grab the URLs and initiate the download
       */
      grabUrls().then( function( urls ) {
          urls = urls.join( '
      ' );
          createDownload( urls );
      } );
    • Python 爬取图片
    • 新建一个“download_images.py”文件,并在虚拟环境下,安装你需要安装的包,并将下面的代码复制到你的“download_images.py”文件
    • # import the necessary packages
      from imutils import paths
      import argparse
      import requests
      import cv2
      import os
      # construct the argument parse and parse the arguments
      ap = argparse.ArgumentParser()
      ap.add_argument("-u", "--urls", required=True,
          help="path to file containing image URLs")    # 设置刚刚保存的urls.txt路径
      ap.add_argument("-o", "--output", required=True,
          help="path to output directory of images")    # 设置图片保存的路径
      args = vars(ap.parse_args())
      # grab the list of URLs from the input file, then initialize the
      # total number of images downloaded thus far
      rows = open(args["urls"]).read().strip().split("
      ")
      total = 0
      # loop the URLs
      for url in rows:
          try:
              # try to download the image
              r = requests.get(url, timeout=60)
              # save the image to disk
              p = os.path.sep.join([args["output"], "{}.jpg".format(
                  str(total).zfill(8))])
              f = open(p, "wb")
              f.write(r.content)
              f.close()
              # update the counter
              print("[INFO] downloaded: {}".format(p))
              total += 1
          # handle if any exceptions are thrown during the download process
          except:
              print("[INFO] error downloading {}...skipping".format(p))
      # loop over the image paths we just downloaded
      for imagePath in paths.list_images(args["output"]):
          # initialize if the image should be deleted or not
          delete = False
          # try to load the image
          try:
              image = cv2.imread(imagePath)
              # if the image is `None` then we could not properly load it
              # from disk, so delete it
              if image is None:
                  delete = True
          # if OpenCV cannot load the image then the image is likely
          # corrupt so we should delete it
          except:
              print("Except")
              delete = True
          # check to see if the image should be deleted
          if delete:
              print("[INFO] deleting {}".format(imagePath))
              os.remove(imagePath)
      $ python download_images.py --urls urls.txt --output images/santa    # 设置路径
      [INFO] downloaded: images/santa/00000000.jpg
      [INFO] downloaded: images/santa/00000001.jpg
      [INFO] downloaded: images/santa/00000002.jpg
      [INFO] downloaded: images/santa/00000003.jpg
      ...
      [INFO] downloaded: images/santa/00000519.jpg
      [INFO] error downloading images/santa/00000519.jpg...skipping
      [INFO] downloaded: images/santa/00000520.jpg
      ...
      [INFO] deleting images/santa/00000211.jpg
      [INFO] deleting images/santa/00000199.jpg
  • 相关阅读:
    集成服务监控器green.monitor发布
    细说Angular ngclass
    利用Travis CI 让你的github项目持续构建(Node.js为例)
    CSV和集合对象基于Annotation操作封装
    【HeadFirst 设计模式学习笔记】4 工厂模式
    【HeadFirst 设计模式学习笔记】5 单例模式
    【面向对象设计基础】常量数据管理器
    【面向对象设计基础】存取方法+Private类成员和方法
    【HeadFirst 设计模式学习笔记】3 装饰模式
    【HeadFirst 设计模式学习笔记】2 观察者模式
  • 原文地址:https://www.cnblogs.com/LuckBelongsToStrugglingMan/p/12900993.html
Copyright © 2011-2022 走看看