zoukankan      html  css  js  c++  java
  • 如何自建数据集

    自建数据集

    (1) 学会使用爬虫爬取图像和视频,从视频中提取图片。

    (2) 对获得的图片数据进行整理,包括重命名,格式统一,去重。

    爬取图片

    有些任务没有直接对应的开源数据集,或者开源数据集中的数据比较少,这就需要我们通过搜索引擎自行爬取图片。

    百度图片爬虫

    Download images from Google, Bing, Baidu. 谷歌、百度、必应图片下载

    Google, Naver multiprocess image web crawler (Selenium)

    数据集整理

    爬取的如果是视频需要先转换成图片,如果是图片就要做好统一格式、数据清洗的工作。

    视频转换成图片

    使用爬虫爬取数据,如果是视频可以使用 python getimagefromvideo.py <video_path> 将视频转换为图片

    #coding:utf8
    import cv2
    import dlib
    import numpy as np
    import sys
    import os
    
    video_capture = cv2.VideoCapture(sys.argv[1])
    video_id = sys.argv[1].split('.')[0]
    os.mkdir(video_id)
    
    count = 0
    while True:
        is_sucessfully_read, im = video_capture.read()
        if is_sucessfully_read == False:
            break
        cv2.imwrite(os.path.join(video_id,str(count)+'.jpg'),im)
        print "image shape=",im.shape
        count = count + 1
        print count
    

    统一图片后缀格式

    统一后缀格式可以减少以后写数据 API 时的压力,也可以测试图片是不是可以正常的读取,及时防止未知问题的出现,这很重要。

    使用 python reformat_image.py <images_folder_path> 将图片全部转换为 jpg 格式,这也是所有框架支持的格式。

    import os
    import sys
    import cv2
    import numpy as np
    
    def listfiles(rootDir):
        list_dirs = os.walk(rootDir)
        for root, dirs, files in list_dirs:
            for d in dirs:
                print os.path.join(root,d)
            for f in files:
                    
                fileid = f.split('.')[0] 
    
                filepath = os.path.join(root,f)
                try:
                    src = cv2.imread(filepath,1)
                    print "src=",filepath,src.shape
                    os.remove(filepath)
                    cv2.imwrite(os.path.join(root,fileid+".jpg"),src)
                except:
                    os.remove(filepath)
                    continue
    
    listfiles(sys.argv[1])
    

    按格式重命名图片

    统一格式的命名有利于区分和整理数据

    mkdir tmp
    ./rename_files_function.sh <images_folder_path> ./tmp/ <label>
    
    i=0
    dir=$1
    resultdir=$2
    app=$3
    for file in $dir""* 
    do 
    	arr=$(echo $file | tr "/" "
    ")
    	for x in $arr
    	do
    		filename=$x
    	done
    
    	brr=$(echo $filename | tr "." "
    ")
    	brrs=( $brr )
    	fileid=${brrs[0]}
    
    	num=${#brrs[@]}  
    	index=$(expr $num - 1)
    	fileformat=${brrs[index]}
    	
    	echo file=""$file
    	echo fileid=""$fileid
    	echo fileformat=""$fileformat
    
    	if [ $fileformat == jpeg -o $fileformat == png -o $fileformat == jpg -o $fileformat == bmp ] ;
        	then
            #echo "good"
    		i=$(expr $i + 1)
    		resultfile=$resultdir""$app""$i"".$fileformat
    		echo file=""$file"",resultfile=""$resultfile
    		mv "$file" "$resultfile"
    	else
    		echo $file""not good
        fi
    done
    
    echo 执行删除""$dir""*
    #rm $dir""*
    echo 执行mv""$resultdir""*
    mv $resultdir""* $dir
    

    去重

    如果你使用多个关键词或者使用不同的搜索引擎同样的关键词,或者从视频中提取图片,那么爬取回来的图片很可能有重复或者非常的相似,这样的样本应该被去除。

    去除有很多种方法,比如直接比较两幅图像是不是完全相同,通过 hash 等相似度方法来进行相似度,这里我们提供一个方法,利用相似度来进行去重。

    # sudo pip install python-Levenshtein
    conda install -c conda-forge python-levenshtein 
    python remove_repeat.py <image_path>
    
    https://anaconda.org/conda-forge/python-levenshtein
    
    #!/usr/bin/env python
    #coding:utf8
    import math
    
    from PIL import Image
    import Levenshtein
    
    class BWImageCompare(object):
        """Compares two images (b/w)."""
    
        _pixel = 255
        _colour = False
    
        def __init__(self, imga, imgb, maxsize=64):
            """Save a copy of the image objects."""
    
            sizea, sizeb = imga.size, imgb.size
    
            newx = min(sizea[0], sizeb[0], maxsize)
            newy = min(sizea[1], sizeb[1], maxsize)
    
            # Rescale to a common size:
            imga = imga.resize((newx, newy), Image.BICUBIC)
            imgb = imgb.resize((newx, newy), Image.BICUBIC)
    
            if not self._colour:
                # Store the images in B/W Int format
                imga = imga.convert('I')
                imgb = imgb.convert('I')
    
            self._imga = imga
            self._imgb = imgb
    
            # Store the common image size
            self.x, self.y = newx, newy
    
        def _img_int(self, img):
            """Convert an image to a list of pixels."""
    
            x, y = img.size
    
            for i in xrange(x):
                for j in xrange(y):
                    yield img.getpixel((i, j))
    
        @property
        def imga_int(self):
            """Return a tuple representing the first image."""
    
            if not hasattr(self, '_imga_int'):
                self._imga_int = tuple(self._img_int(self._imga))
    
            return self._imga_int
    
        @property
        def imgb_int(self):
            """Return a tuple representing the second image."""
    
            if not hasattr(self, '_imgb_int'):
                self._imgb_int = tuple(self._img_int(self._imgb))
    
            return self._imgb_int
    
        @property
        def mse(self):
            """Return the mean square error between the two images."""
    
            if not hasattr(self, '_mse'):
                tmp = sum((a-b)**2 for a, b in zip(self.imga_int, self.imgb_int))
                self._mse = float(tmp) / self.x / self.y
    
            return self._mse
    
        @property
        def psnr(self):
            """Calculate the peak signal-to-noise ratio."""
    
            if not hasattr(self, '_psnr'):
                self._psnr = 20 * math.log(self._pixel / math.sqrt(self.mse), 10)
    
            return self._psnr
    
        @property
        def nrmsd(self):
            """Calculate the normalized root mean square deviation."""
    
            if not hasattr(self, '_nrmsd'):
                self._nrmsd = math.sqrt(self.mse) / self._pixel
    
            return self._nrmsd
    
        @property
        def levenshtein(self):
            """Calculate the Levenshtein distance."""
    
            if not hasattr(self, '_lv'):
                stra = ''.join((chr(x) for x in self.imga_int))
                strb = ''.join((chr(x) for x in self.imgb_int))
    
                lv = Levenshtein.distance(stra, strb)
    
                self._lv = float(lv) / self.x / self.y
    
            return self._lv
    
    
    class ImageCompare(BWImageCompare):
        """Compares two images (colour)."""
    
        _pixel = 255 ** 3
        _colour = True
    
        def _img_int(self, img):
            """Convert an image to a list of pixels."""
    
            x, y = img.size
    
            for i in xrange(x):
                for j in xrange(y):
                    pixel = img.getpixel((i, j))
                    yield pixel[0] | (pixel[1]<<8) | (pixel[2]<<16)
    
        @property
        def levenshtein(self):
            """Calculate the Levenshtein distance."""
    
            if not hasattr(self, '_lv'):
                stra_r = ''.join((chr(x>>16) for x in self.imga_int))
                strb_r = ''.join((chr(x>>16) for x in self.imgb_int))
                lv_r = Levenshtein.distance(stra_r, strb_r)
    
                stra_g = ''.join((chr((x>>8)&0xff) for x in self.imga_int))
                strb_g = ''.join((chr((x>>8)&0xff) for x in self.imgb_int))
                lv_g = Levenshtein.distance(stra_g, strb_g)
    
                stra_b = ''.join((chr(x&0xff) for x in self.imga_int))
                strb_b = ''.join((chr(x&0xff) for x in self.imgb_int))
                lv_b = Levenshtein.distance(stra_b, strb_b)
    
                self._lv = (lv_r + lv_g + lv_b) / 3. / self.x / self.y
    
            return self._lv
    
    
    class FuzzyImageCompare(object):
        """Compares two images based on the previous comparison values."""
    
        def __init__(self, imga, imgb, lb=1, tol=15):
            """Store the images in the instance."""
    
            self._imga, self._imgb, self._lb, self._tol = imga, imgb, lb, tol
    
        def compare(self):
            """Run all the comparisons."""
    
            if hasattr(self, '_compare'):
                return self._compare
    
            lb, i = self._lb, 2
    
            diffs = {
                'levenshtein': [],
                'nrmsd': [],
                'psnr': [],
            }
    
            stop = {
                'levenshtein': False,
                'nrmsd': False,
                'psnr': False,
            }
    
            while not all(stop.values()):
                cmp = ImageCompare(self._imga, self._imgb, i)
    
                diff = diffs['levenshtein']
                if len(diff) >= lb+2 and 
                    abs(diff[-1] - diff[-lb-1]) <= abs(diff[-lb-1] - diff[-lb-2]):
                    stop['levenshtein'] = True
                else:
                    diff.append(cmp.levenshtein)
    
                diff = diffs['nrmsd']
                if len(diff) >= lb+2 and 
                    abs(diff[-1] - diff[-lb-1]) <= abs(diff[-lb-1] - diff[-lb-2]):
                    stop['nrmsd'] = True
                else:
                    diff.append(cmp.nrmsd)
    
                diff = diffs['psnr']
                if len(diff) >= lb+2 and 
                    abs(diff[-1] - diff[-lb-1]) <= abs(diff[-lb-1] - diff[-lb-2]):
                    stop['psnr'] = True
                else:
                    try:
                        diff.append(cmp.psnr)
                    except ZeroDivisionError:
                        diff.append(-1)  # to indicate that the images are identical
    
                i *= 2
    
            self._compare = {
                'levenshtein': 100 - diffs['levenshtein'][-1] * 100,
                'nrmsd': 100 - diffs['nrmsd'][-1] * 100,
                'psnr': diffs['psnr'][-1] == -1 and 100.0 or diffs['psnr'][-1],
            }
    
            return self._compare
    
        def similarity(self):
            """Try to calculate the image similarity."""
    
            cmp = self.compare()
    
            lnrmsd = (cmp['levenshtein'] + cmp['nrmsd']) / 2
            return lnrmsd
            return min(lnrmsd * cmp['psnr'] / self._tol, 100.0)  # TODO: fix psnr!
    
    
    if __name__ == '__main__':
    
        import sys
        import os
        
        srcimages = os.listdir(sys.argv[1])
        srcimages.sort()
       
        tot = len(srcimages)
        tot = (tot ** 2 - tot) / 2
    
        print 'Comparing %d images:' % tot
    	
        images = {}
    
        ###向后删除图片
        similarity_thresh = 0.5 ##相似度阈值,超过即判断为相同图片
        i = 0
        while(i < len(srcimages)-1):
            print "i=", i,"num of srcimages",len(srcimages)
    
            imga = Image.open(os.path.join(sys.argv[1],srcimages[i]))
            imgb = Image.open(os.path.join(sys.argv[1],srcimages[i+1]))
            cmp = FuzzyImageCompare(imga, imgb)
            sim = cmp.similarity() / 100
            print "image ",os.path.join(sys.argv[1],srcimages[i])," and image",os.path.join(sys.argv[1],srcimages[i+1])," sim=",sim
            if sim > similarity_thresh:
                print "delete ",os.path.join(sys.argv[1],srcimages[i+1])
                os.remove(os.path.join(sys.argv[1],srcimages[i+1]))
                srcimages.pop(i+1)
            else:
                i = i+1 
    
        '''
        results, i = {}, 1
        for namea, imga in images.items():
            for nameb, imgb in images.items():
                if namea == nameb or (nameb, namea) in results:
                    continue
                print ' * %2d / %2d:' % (i, tot),
                print namea, nameb, '...',
                cmp = FuzzyImageCompare(imga, imgb)
                sim = cmp.similarity()
                results[(namea, nameb)] = sim
                print '%.2f %%' % sim
                i += 1
        res = max(results.values())
        imgs = [k for k, v in results.iteritems() if v == res][0]
        print 'Most similar images: %s %s (%.2f %%)' % (imgs[0], imgs[1], res)
        
        '''
    

    在此之后还需要自己手动筛选图片,工作量其实也不小,不过经过去重还是可以减少不少工作量的。

    数据集标注

    爬取的图片需要自己标注,可以使用下面这些标注工具。

    https://github.com/tzutalin/labelImg

    LabelImg is a graphical image annotation tool and label object bounding boxes in images https://youtu.be/p0nR2YsCY_U

    https://github.com/wkentaro/labelme

    Image Polygonal Annotation with Python (polygon, rectangle, circle, line, point and image-level flag annotation).

    https://github.com/Microsoft/VoTT

    Visual Object Tagging Tool: An electron app for building end to end Object Detection Models from Images and Videos.

    数据集划分

    一般会按照 8:1:1 将数据集划分为训练集、验证集、测试集。这个要根据自己的情况编写 shell 脚本,下面是我用 darknet 训练 yolov3 模型时划分数据的脚本。

    #!/bin/sh
    
    if [ $# != 1 ];then
        echo "Usage: $0 <full path>"
        exit -1
    fi
    
    path=$1
    
    for sub_dir in `ls $path`
        do
            # 获取子文件夹的全路径
            sub_dir_path=$path/$sub_dir
            if [ -d $sub_dir_path ]
                then
                    # 将子目录下所有文件移动到父目录中
                    `mv $sub_dir_path/* $path`
                    # 删除子目录
                    `rm -rf $sub_dir_path`
            fi
    
            # 给所有文件添加前缀
    
        done
    
    `rm tmp.txt`
    # 将文件夹下指定类型的文件写到文件中
    # ***** 问题:最后会有个空行 *****
    # 图片文件存在对应的 txt 文件,则将图片路径追加到 tmp.txt 文件中
    for image in `find $path | grep -E 'jpg|png|JPEG|JPG|PNG'`
    do
        
        txt=${image%.*}".txt"
    
        if [ -f $txt ]
            then
                echo ${image} 
                `echo ${image} >> tmp.txt`
        fi
        
    done
    
    # 将路径 8:1:1 放到 train.txt,val.txt,test.txt
    # 1. 计算 tmp,txt 文件行数
    # 2. 计算得出分配到各个文件的行号
    # 3. 将对应行数的内容写到对应文件夹中
    line=`cat tmp.txt | wc -l`
    line1=$(($line/10*8))
    line2=$(($line/10*8+line/10+1))
    
    `sed -n 1,${line1}p tmp.txt >> train.txt`
    `sed -n $((${line1}+1)),${line2}p tmp.txt >> val.txt`
    `sed -n $((${line2}+1)),$((${line}-1))p tmp.txt >> test.txt`
    

    yolov3 的标注格式如下所示

    9 0.732955 0.591102 0.270317 0.193503
    

    统计标签的时候可以使用

    awk '{print $1}' *.txt | sort -g | uniq -c
    

    以上就是自己建立一个数据集的流程:爬取图片->整理图片->标注图片->训练。

  • 相关阅读:
    Python基础之元组tuple(带了枷锁的列表)
    Python基础之元组tuple(带了枷锁的列表)
    Python基础之列表
    Python基础之列表
    穷举法解决这个问题(1,2)
    二分查找
    正則表達式
    Restful WebService简介
    杭电1285确定比赛名次
    ACdream区域赛指导赛之手速赛系列(5) 题解
  • 原文地址:https://www.cnblogs.com/shuiyj/p/13185113.html
Copyright © 2011-2022 走看看