zoukankan      html  css  js  c++  java
  • python批量合并csv文件并筛选特定行存入txt文件

    批量合并找到3种方式:

    • open的方式,先读后写入
    • pandas的concat方法
    • HDF5

    对前2种方式尝试了,第3种方式待试验。采用pandas的concat方法时合并的列会错列,还待进一步找到原因,第一种方式已实现。

    1.有几个文件为xlsx格式,需要先转成csv文件。其他都为csv格式。转换代码:

    import pandas as pd
    import glob
    
    def xlsx_to_csv():
        xlsx_list = [f for f in glob.glob('*.{}'.format("xlsx"))]
        for i in xlsx_list:
            rdata = pd.read_excel(i,index_col=0)
            rdata.to_csv(i.split('.')[0] + '.csv',encoding='gb18030')
    
    if __name__ == '__main__':
        xlsx_to_csv()
    

      

    2.批量合并文件代码

    import os,sys
    import time
    import glob
    import pandas as pd
    import xlrd
    from xlrd import XLRDError
    
    
    class ArgsError():
    	pass
    
    class concatAndScreenData(object):
        def __init__(self , path =None):
            self.path = path
            self.person_list = []
    
    
        def set_path(self):
            if not self.path:
                self.path = os.getcwd()
    
        def concat_data(self):
            file_list = os.listdir(self.path)
            csv_list = [f for f in file_list if os.path.splitext(f)[1] == '.csv']
            # csv_list = [f for f in glob.glob('*.{}'.format("csv"))]  # 或者glob获取list
            # csv_concat = pd.concat([ pd.read_csv(i , encoding='gb18030') for i in csv_list ],axis=0,ignore_index=False)
            # csv_concat.to_csv('数据源.csv', index = 0 ,encoding= 'gb18030',sep= ',')  # concat合并存在列错位问题待解决
            for i in csv_list:
                fr = open(i, 'rb').read()
                with open('数据源.csv','ab') as f:
                    f.write(fr)
    
        def read_person(self):
            lists = []
            if os.path.exists('人员名单.xlsx'):
                data = xlrd.open_workbook(self.path + '/人员名单.xlsx')
                table = data.sheet_by_name('ty')
                for i in range(table.nrows):
                    col = table.row_values(i)
                    lists.append(col)
            for i in range(len(lists)):
                self.person_list.append(lists[i][0])
            return self.person_list
    
        def screen_data(self):
            if os.path.exists('数据源.csv'):
                # df = pd.read_csv('数据源.csv',header= None ,chunksize= 100000 , encoding= 'gb18030',low_memory=False) #header= None 自动加列索引,从0开始
                df = pd.read_csv('数据源.csv',header= None ,chunksize= 100000, encoding= 'gb18030',low_memory=False) #header= None 自动加列索引,从0开始
                for chunk in df:
                    chunk.rename(columns={2:'names'},inplace=True)
                    filename = open("data.txt",'a',errors='ignore')
                    mylist = ''
                    for i in range(len(chunk)):
                        k = chunk.iloc[i]['names']
                        if k in self.person_list:
                            for m in chunk.columns.values:
                                mylist = mylist + str(chunk.iloc[i][m])
                                if m != 7:
                                    mylist = mylist + ','
                            mylist = mylist + '
    '
                    filename.write(mylist)
                    filename.close()
            else:
                print(u'文件不存在!')
    
        def run(self, path = None):
            self.set_path()
            #concat data
            try:
                self.concat_data()
                time.sleep(100)
                self.read_person()
            except ArgsError:
                raise ArgsError(u'文件路径错误或未关闭')
            except IOError or WindowsError:
                raise ArgsError(u'文件路径错误或未关闭')
            # screen data
            self.screen_data()
    
    app = concatAndScreenData()
    if __name__ == '__main__':
    	app.run(path = None)
    

      

  • 相关阅读:
    bzoj 4911: [Sdoi2017]切树游戏
    bzoj 2654: tree
    bzoj 3240: [Noi2013]矩阵游戏
    有标号的DAG计数 III
    有标号的DAG计数 II
    bzoj 3512: DZY Loves Math IV
    bzoj 4480: [Jsoi2013]快乐的jyy
    bzoj 5323: [Jxoi2018]游戏
    codeforces412A
    7.6 T1 深度优先搜索(dfs)
  • 原文地址:https://www.cnblogs.com/hqczsh/p/12811659.html
Copyright © 2011-2022 走看看