处理同事爬取的图片时,其因爬取过程中因图片类型/网络等问题,获取到较大批次破损图片,现需清除破损文件,并做简要记录.
要点:
在python中,可以使⽤imghdr模块中的what()⽅法判断图⽚⽂件是否损坏,若⽂件损坏,则返回None,否则返回图⽚⽂件的类型,如jpeg等。imgh 内容⻅: https://docs.python.org/3/library/imghdr.html
progressbar模块,则可以展示代码处理进度
os模块用以本地文件夹及文件的相关操作
业务:
选取需处理图片所在的文件夹(含其子文件),获取图片集,判断文件类型,损坏(类型为 None),则删除,并记录到本地txt文件
代码:
#!/usr/bin/env python # -*- coding:utf-8 -*- # __author__ = "NYA" import os import imghdr from progressbar import ProgressBar """ imghdr what 类型判断,去除损坏文件 """ path = '/home/lab/images' original_images = [] # 此处获取文件夹下所有图片的方式不适合大数据量下的处理 ''' for root, dirs, filenames in os.walk(path): for filename in filenames: original_images.append(os.path.join(root, filename)) ''' for file in os.listdir(path): file_path = os.path.join(path, file) original_images.append(file_path) original_images = sorted(original_images) print('totalNum:', len(original_images)) f = open('/home/lab/check_error.txt', 'wb') error_images = [] progress = ProgressBar() for filename in progress(original_images): check = imghdr.what(filename) if check == None: f.write(filename) f.write(' ') os.remove(filename) error_images.append(filename) print('errorFileNum:',len(error_images)) f.close()