zoukankan      html  css  js  c++  java
  • 使用shell数据处理数据实例①-------手把手教学版

    引子:

    在工作过程中经常要处理各种小数据,同事间会用各种工具方法来处理,比如用java、python、Perl甚至用UE手工处理。但貌似不都方便。

    今天举一例子使用shell来处理,来说明shell一行可以解决用其它语言多行的问题。

    需求:

    有两个文件, device_jingweidu.txt 和device.txt,文件内容如下面格式:

    device_jingweidu.txt:

    08659100001310000171,08659100001110000171@QQY.SZF,1,三明尤溪水东水库坝上,118.210124,26.176165
    08659100001310000172,08659100001110000172@QQY.SZF,1,三明尤溪水东水库坝下,118.210124,26.176165
    08659100001310000365,08659100001110000365@QQY.SZF,1,北高镇东乡村码头水位监控,119.156551,25.382036
    08659100001310000607,08659100001110000607@QQY.SZF,1,海洋渔业无线接入摄像头,0,0
    08659100001310000621,08659100001110000621@QQY.SZF,1,福州连江山仔水库上游,119.423026,26.225916
    08659100001310000622,08659100001110000622@QQY.SZF,1,福州连江山仔水库下游,119.420331,26.230514
    08659100001310000696,08659100001110000696@QQY.SZF,1,南平沙溪口水库2,118.09497,26.589413
    08659100001310001026,08659100001110001026@QQY.SZF,1,077乌山路市政府门口B,119.303413,26.080324
    08659100001310001094,08659100001110001094@QQY.SZF,1,三明安砂水库坝上,117.119716,26.038324
    08659100001310001095,08659100001110001095@QQY.SZF,1,三明安砂水库坝下,117.119716,26.038324
    08659100001310001096,08659100001110001096@QQY.SZF,1,南平沙溪口水库1,118.09497,26.589413
    08659100001310001188,08659100001110001188@QQY.SZF,1,福鼎-沙埕渔港(省),120.424978,27.167177
    08659100001310001346,08659100001110001346@QQY.SZF,1,福清东张水库大坝上弧门旁,119.296539,25.70688
    08659100001310001356,08659100001110001356@QQY.SZF,1,惠女水库溢洪道桥,118.589045,25.099911
    08659100001310001392,08659100001110001392@QQY.SZF,1,福州_连江黄歧渔港,119.889881,26.324397
    08659100001310001393,08659100001110001393@QQY.SZF,1,福州_罗源磬里渔排区,119.723905,26.468112

    后面两列为经维度

    device.txt:

    08659100001310001026@QQY.SZF,08659100001110001026@QQY.SZF,1,077乌山路市政府门口B,0,0
    08659100001310001346@QQY.SZF,08659100001110001346@QQY.SZF,1,福清东张水库大坝上弧门旁,0,0
    08659100001310001417@QQY.SZF,08659100001110001417@QQY.SZF,1,宁德_霞浦三沙渔港,0,0
    08659100001310001420@QQY.SZF,08659100001110001420@QQY.SZF,1,南平_九峰桥头,0,0
    08659100001310001544@QQY.SZF,08659100001110001544@QQY.SZF,1,(暂拆)湄洲镇湄洲岛渔港-010038,0,0
    08659100001310003450@QQY.SZF,08659100001110003450@QQY.SZF,1,092_北大路华林路口,0,0
    08659100001310003657@QQY.SZF,08659100001110003657@QQY.SZF,1,城厢区下尾渔港-010050,0,0
    08659100001310004612@QQY.SZF,08659100001110004612@QQY.SZF,1,01闽清水口水库水工楼楼顶,0,0
    08659100001310004613@QQY.SZF,08659100001110004613@QQY.SZF,1,02闽清水口水库船闸屋顶,0,0
    08659100001310005130@QQY.SZF,08659100001110005130@QQY.SZF,1,惠安崇武码头,0,0

    后面两列经维度都为0

    现需要将device.txt的文件与device_jingweidu.txt文件比对,然后将经纬度更新上次。

    需求很简单,根据地点为参数,然后将经纬度更新至device.txt文件

    实现:

    思路很简单:将原文件存储,然后再读取要更新的文件进行比对,有的就经纬度替换

    先看看python是怎么实现的吧,很简单

    #coding:utf-8
    '''
    Created on 2016-5-4
    @author: Administrator
    '''
    
    source_text=r'f:\Work\device(带经纬度).txt'
    uipath=unicode(source_text,'utf8')
    dest_text=r'F:Workdevice.txt'
    
    source_dict={}
    with open(uipath) as source:
        for line in source:
            new=line.split(',',4)
            source_dict[new[3]]=new[4].strip()
    
    dd=open('f:\Work\new_devices.txt','wb')
    
    
    with open(dest_text) as dest:
        for line in dest:
            new=line.split(',')
            
            #print new[3],
            if new[3] in source_dict:
                dd.write(",".join(new[:-2])+","+source_dict[new[3]]+"
    ") #这里是在windows下,所以用
    换行
            else:
                dd.write(",".join(new[:-2])+","+"0.0"+"
    ")
                     
    dd.close()

    结果如下:

    08659100001310001026@QQY.SZF,08659100001110001026@QQY.SZF,1,077乌山路市政府门口B,119.303413,26.080324
    08659100001310001346@QQY.SZF,08659100001110001346@QQY.SZF,1,福清东张水库大坝上弧门旁,119.296539,25.70688
    08659100001310001417@QQY.SZF,08659100001110001417@QQY.SZF,1,宁德_霞浦三沙渔港,0.0
    08659100001310001420@QQY.SZF,08659100001110001420@QQY.SZF,1,南平_九峰桥头,118.189546,26.639557
    08659100001310001544@QQY.SZF,08659100001110001544@QQY.SZF,1,(暂拆)湄洲镇湄洲岛渔港-010038,0.0
    08659100001310003450@QQY.SZF,08659100001110003450@QQY.SZF,1,092_北大路华林路口,119.299107,26.104303
    08659100001310003657@QQY.SZF,08659100001110003657@QQY.SZF,1,城厢区下尾渔港-010050,0.0
    08659100001310004612@QQY.SZF,08659100001110004612@QQY.SZF,1,01闽清水口水库水工楼楼顶,118.839738,26.298528
    08659100001310004613@QQY.SZF,08659100001110004613@QQY.SZF,1,02闽清水口水库船闸屋顶,118.839738,26.298528

    但是,太麻烦,为了处理文本每次都去写个脚本,万一未装python环境呢。

    正好,数据正好在linux下,就直接用shell处理吧,首选awk

    awk如何处理两个文件呢

    使用NR、FNR来处理,如下:

    # awk '{print NR,$0}' file1 file2
    1 a b c d
    2 a b d c
    3 a c b d
    4 aa bb cc dd
    5 aa bb dd cc
    6 aa cc bb dd
    
    # awk '{print FNR,$0}' file1 file2
    1 a b c d
    2 a b d c
    3 a c b d
    1 aa bb cc dd
    2 aa bb dd cc
    3 aa cc bb dd

    知道这一点后,处理我们这个需求更方便了,如下:

    awk 'BEGIN{FS=OFS=","}NR==FNR{w[$4]=$5","$6}NR>FNR{for(a in w) if (a==$4){print $1,$2,$3,$4,w[a]}}' device_jingweidu.txt device.txt 

    结果如下

    08659100001310001026@QQY.SZF,08659100001110001026@QQY.SZF,1,077乌山路市政府门口B,119.303413,26.080324
    08659100001310001346@QQY.SZF,08659100001110001346@QQY.SZF,1,福清东张水库大坝上弧门旁,119.296539,25.70688
    08659100001310001420@QQY.SZF,08659100001110001420@QQY.SZF,1,南平_九峰桥头,118.189546,26.639557
    08659100001310003450@QQY.SZF,08659100001110003450@QQY.SZF,1,092_北大路华林路口,119.299107,26.104303
    08659100001310004612@QQY.SZF,08659100001110004612@QQY.SZF,1,01闽清水口水库水工楼楼顶,118.839738,26.298528
    08659100001310004613@QQY.SZF,08659100001110004613@QQY.SZF,1,02闽清水口水库船闸屋顶,118.839738,26.298528

    看不懂吗?来解释下吧:

    1、awk的编码规则:

    awk 'BEGIN{action}
    模式{action}
    END{action}'

    2、BEGIN{FS=OFS=","}  FS表示:指定处理分隔符 OFS:输出分隔符,具体可以参考我之前的blog对FSOFSIFS等的解释http://www.cnblogs.com/landhu/p/4962521.html

    因为这里的分隔都是以,来处理,所以先将分隔符以,号标识

    3、NR==FNR{w[$4]=$5","$6}

    NR==FNR,按之前NR和FNR说知道,NR==FNR表明在处理第一个文件

    w[$4]=$5","$6,这是awk中的关联数组,下标为文件一中的第四列,值为$5,$6 ,存在数据w中

    awk中的关联数组是shell中的一个难点,与其它语言的数组还不一样,具体可以参考:http://www.cnblogs.com/chengmo/archive/2010/10/08/1846190.html

    4、NR>FNR 

    表明处理第二个文件

    {for(a in w) if (a==$4){print $1,$2,$3,$4,w[a]}}

    关联数组的遍历,遍历时将值与第二个文件的第四列对比,有值时,就输出

    优化

    这个脚本没问题,但存在一个问题,未匹配上的没有显示且脚本有点长,我们来进行优化下,如下

    awk 'BEGIN{FS=OFS=","}NR==FNR{w[$4]=$5","$6}NR>FNR && gsub(/0,0/,w[$4])' device_jingweidu.txt device.txt 

    结果如下:

    08659100001310001026@QQY.SZF,08659100001110001026@QQY.SZF,1,077乌山路市政府门口B,119.303413,26.080324
    08659100001310001346@QQY.SZF,08659100001110001346@QQY.SZF,1,福清东张水库大坝上弧门旁,119.296539,25.70688
    08659100001310001417@QQY.SZF,08659100001110001417@QQY.SZF,1,宁德_霞浦三沙渔港,
    08659100001310001420@QQY.SZF,08659100001110001420@QQY.SZF,1,南平_九峰桥头,118.189546,26.639557
    08659100001310001544@QQY.SZF,08659100001110001544@QQY.SZF,1,(暂拆)湄洲镇湄洲岛渔港-010038,
    08659100001310003450@QQY.SZF,08659100001110003450@QQY.SZF,1,092_北大路华林路口,119.299107,26.104303

    这个处理就相对好了,没有匹配的数据还有,显示经纬度为空

    解释下脚本:

    使用了一个awk内置函数gsub,用来替换的,直接将0,0替换数组的值

    关于awk内置函数相关文档:http://www.cnblogs.com/chengmo/archive/2010/10/08/1845913.html

  • 相关阅读:
    Leetcode Reverse Words in a String
    topcoder SRM 619 DIV2 GoodCompanyDivTwo
    topcoder SRM 618 DIV2 MovingRooksDiv2
    topcoder SRM 618 DIV2 WritingWords
    topcoder SRM 618 DIV2 LongWordsDiv2
    Zepto Code Rush 2014 A. Feed with Candy
    Zepto Code Rush 2014 B
    Codeforces Round #245 (Div. 2) B
    Codeforces Round #245 (Div. 2) A
    Codeforces Round #247 (Div. 2) B
  • 原文地址:https://www.cnblogs.com/landhu/p/5461006.html
Copyright © 2011-2022 走看看