zoukankan      html  css  js  c++  java
  • 制作数据集(二):使用Bing,制作更快,更干净的数据集!

    背景:上一个版本的数据集制作方法尽管有效,但是数据集并不干净,例如出现很多广告等等,所以使用必应浏览器制作数据。注意,尽管必应浏览器不用FQ,但是是需要注册一些信息,不过不麻烦!

    方法:首先还是先PO出作者的原文:https://www.pyimagesearch.com/2018/04/09/how-to-quickly-build-a-deep-learning-image-dataset/

    • 到微软的Bing Image Search API中注册账号,它是7天免费的:https://azure.microsoft.com/en-us/services/cognitive-services/bing-image-search-api/
    • 成功以后,你可以看到如下页面,并记住你的密钥1和密钥2
    • 建议阅读下面两个文档,尤其是第一个文档的.json文件格式,和第二个文档的count和offset的意义,它能帮助你在后面的代码阅读中更快地理解代码的意义。
    • 下面,在你的虚拟环境中,安装requests包
    • $ pip install requests
    • 新建一个“search_bing_api.py”文件,并将下面的代码复制到你这个文件中
      • # import the necessary packages
        from requests import exceptions
        import argparse
        import requests
        import cv2
        import os
        # construct the argument parser and parse the arguments
        ap = argparse.ArgumentParser()
        ap.add_argument("-q", "--query", required=True,
            help="search query to search Bing Image API for")    # 这个字段表示你想搜索什么
        ap.add_argument("-o", "--output", required=True,
            help="path to output directory of images")
        args = vars(ap.parse_args())
        
        args['output'] = os.path.join(args['output'], args['query'])
        if not os.path.exists(args['output']):
            os.makedirs(args['output'])
        
        # set your Microsoft Cognitive Services API key along with (1) the
        # maximum number of results for a given search and (2) the group size
        # for results (maximum of 50 per request)
        API_KEY = "YOUR_API_KEY_GOES_HERE"    # 将刚刚的密钥1 或 密钥2 复制到这里
        MAX_RESULTS = 250
        GROUP_SIZE = 50    # 最大是150,它的意义可以查看上面的文档链接2
        # set the endpoint API URL
        URL = "https://api.cognitive.microsoft.com/bing/v7.0/images/search"
        
        # when attempting to download images from the web both the Python
        # programming language and the requests library have a number of
        # exceptions that can be thrown so let's build a list of them now
        # so we can filter on them
        EXCEPTIONS = set([IOError, FileNotFoundError,
            exceptions.RequestException, exceptions.HTTPError,
            exceptions.ConnectionError, exceptions.Timeout])
        
        # store the search term in a convenience variable then set the
        # headers and search parameters
        term = args["query"]
        headers = {"Ocp-Apim-Subscription-Key" : API_KEY}
        params = {"q": term, "offset": 0, "count": GROUP_SIZE}
        # make the search
        print("[INFO] searching Bing API for '{}'".format(term))
        search = requests.get(URL, headers=headers, params=params)
        search.raise_for_status()
        # grab the results from the search, including the total number of
        # estimated results returned by the Bing API
        results = search.json()
        estNumResults = min(results["totalEstimatedMatches"], MAX_RESULTS)
        print("[INFO] {} total results for '{}'".format(estNumResults,
            term))
        # initialize the total number of images downloaded thus far
        total = 0
        
        # loop over the estimated number of results in `GROUP_SIZE` groups
        for offset in range(0, estNumResults, GROUP_SIZE):
            # update the search parameters using the current offset, then
            # make the request to fetch the results
            print("[INFO] making request for group {}-{} of {}...".format(
                offset, offset + GROUP_SIZE, estNumResults))
            params["offset"] = offset
            search = requests.get(URL, headers=headers, params=params)
            search.raise_for_status()
            results = search.json()
            print("[INFO] saving images for group {}-{} of {}...".format(
                offset, offset + GROUP_SIZE, estNumResults))
        
            # loop over the results
            for v in results["value"]:
                # try to download the image
                try:
                    # make a request to download the image
                    print("[INFO] fetching: {}".format(v["contentUrl"]))
                    r = requests.get(v["contentUrl"], timeout=30)
                    # build the path to the output image
                    ext = v["contentUrl"][v["contentUrl"].rfind("."):]
                    p = os.path.sep.join([args["output"], "{}{}".format(
                        str(total).zfill(8), ext)])
                    # write the image to disk
                    f = open(p, "wb")
                    f.write(r.content)
                    f.close()
                # catch any errors that would not unable us to download the
                # image
                except Exception as e:
                    # check to see if our exception is in our list of
                    # exceptions to check for
                    if type(e) in EXCEPTIONS:
                        print("[INFO] skipping: {}".format(v["contentUrl"]))
                        continue
                        # try to load the image from disk
                image = cv2.imread(p)
                # if the image is `None` then we could not properly load the
                # image from disk (so it should be ignored)
                if image is None:
                    print("[INFO] deleting: {}".format(p))
                    os.remove(p)
                    continue
                # update the counter
                total += 1
        $ mkdir dataset
        $ mkdir dataset/charmander
        $ python search_bing_api.py --query "charmander" --output dataset/charmander  # 传入希望搜索的字段,以及输出目录
        [INFO] searching Bing API for 'charmander'
        [INFO] 250 total results for 'charmander'
        [INFO] making request for group 0-50 of 250...
        [INFO] saving images for group 0-50 of 250...
        [INFO] fetching: https://fc06.deviantart.net/fs70/i/2012/355/8/2/0004_c___charmander_by_gaghiel1987-d5oqbts.png
        [INFO] fetching: https://th03.deviantart.net/fs71/PRE/f/2010/067/5/d/Charmander_by_Woodsman819.jpg
        [INFO] fetching: https://fc05.deviantart.net/fs70/f/2011/120/8/6/pokemon___charmander_by_lilnutta10-d2vr4ov.jpg
        ...
        [INFO] making request for group 50-100 of 250...
        [INFO] saving images for group 50-100 of 250...
        ...
        [INFO] fetching: https://38.media.tumblr.com/f0fdd67a86bc3eee31a5fd16a44c07af/tumblr_nbhf2vTtSH1qc9mvbo1_500.gif
        [INFO] deleting: dataset/charmander/00000174.gif
        ...
  • 相关阅读:
    七、python抽象
    六、python条件、循环及其它语句
    泛型数组
    判断数组是否包含此元素
    数组倒序
    Class字节码文件获取的三种方法
    操作系统的产生
    Java重载的概述
    表设计遵循的三大范式实例
    pyqt5窗体分隔
  • 原文地址:https://www.cnblogs.com/LuckBelongsToStrugglingMan/p/12901158.html
Copyright © 2011-2022 走看看