zoukankan      html  css  js  c++  java
  • python正则表达式应用优化实例

    1、问题出现

    需要提取一份xml文件中参数名和参数值,格式如下:

    <p name="actOlLaPdcch">true</p>

    我们需要的字段如上,红色部分为参数名,蓝色部分为参数值,当然,实际文档中还有很多干扰因素。

    步骤为先打开文件,然后用正则表达式匹配到我们需要的母项(r"<managedObject class="LNCEL""),然后开始匹配。

    期间发生了一个问题,调试的时候我使用的一小段样本如下:

    <managedObject class="LNCEL" distName="MRBTS-421961/LNBTS-421961/LNCEL-1" operation="create" version="TL16A">
    <p name="a1TimeToTriggerDeactInterMeas">480ms</p>
    <p name="a2RedirectQci1">disabled</p>
    <p name="a2TimeToTriggerActGERANMeas">320ms</p>
    <p name="a2TimeToTriggerActInterFreqMeas">480ms</p>
    <p name="a2TimeToTriggerRedirect">1024ms</p>
    <p name="a3Offset">6</p>
    

    使用的正则表达式为:

               p1 = r"((?<=")w+(?="))" #p1为查找的参数名正则表达式
                p2 = r"(?<=>)w+(?=<)" #p2为查找的参数值正则表达式
                pattern1 = re.compile(p1)
                pattern2 = re.compile(p2)
                get =pattern1.findall(line)[0]
                get2=pattern2.findall(line)[0]
                lncel1_pa.append(get)
                lncel1_va.append(get2)                    
    

    调试过程是完全正常的,匹配数字和文字即可,也是没问题的,然而在真正运行打开整个文档过程中出现了如下报错:

    1   File "E:/pgtool/Files/filesprocess.py", line 32, in <module>
    2     get2=pattern2.findall(line)[0]
    3 
    4 IndexError: list index out of range

    也就意味着在提取参数值过程中list[0]都已经超出了范围,也就是这个list是空的,这是哪里出了问题呢?

    经过spyder的debug功能调试,逐行运行,终于发现了问题所在,即在提取参数中我们使用的是 r"(?<=>)w+(?=<)" 正则表达式也就是提取例如<p name="eutraCelId">108</p> 大于号小于号之间的文字和数字(此处得到的值为108)。但某些参数值存在空格,而且空格类型不止一种,有1个空格的,也有多个空格间隔的,如下:

    <p name="dlsOldtcTarget">6 db</p>
    <p name="dlsUsePartPrb">PRBs with PSS or SSS and PBCH used</p>

    以至于出现了正则表达式匹配不到。

    2、根据此问题做的优化

    考虑到有空格存在所以进行了第一次正则表达式改进:w+ ?

    p2 = r"(?<=>)w+(?=<)" #p2为查找的参数值正则表达式
    修改为-->
    p2 = r"(?<=>)w+ ?(?=<)" #p2为查找的参数值正则表达式

    但这种优化对 <p name="dlsOldtcTarget">6 db</p>这类有一个空格或者没有空格的才能匹配,对于<p name="dlsUsePartPrb">PRBs with PSS or SSS and PBCH used</p>还是不能匹配(间隔多个空格类型),于是进行了第二次优化,将匹配参数值的正则表达式改为.*,也就是在大于号与小于号之间的参数值匹配为任意值(除换行外),问题彻底解决。

    p2 = r"(?<=>)w+(?=<)" #p2为查找的参数值正则表达式
    修改为-->
    p2 = r"(?<=>).*(?=<)" #p2为查找的参数值正则表达式

    3、进一步优化

      当然,为了避免出现两行重叠问题而带来的贪婪匹配问题,我们需要在.*后添加一个?号来避免贪婪匹配

    p2 = r"(?<=>)w+(?=<)" #p2为查找的参数值正则表达式
    修改为-->
    p2 = r"(?<=>).*?(?=<)" #p2为查找的参数值正则表达式,避免贪婪匹配带来的结果增加一个?

      目前完整的代码也仅仅只能匹配一个LNCEL下的参数,后续我们将进一步编辑,可提取其他LNCEL下的参数,大致的思路为:为各个LNCEL参数提取后都放置到一个DataFrame中,然后再进行水平联结,最后转入excel中。

    4、完整代码如下

    代码:  

    # -*- coding: utf-8 -*-
    """
    Created on Thu Jul 12 15:19:37 2018
    @author: lishuixing_nok
    """
    import re
    import  numpy as np
    import pandas as pd
    fp=open('gudi.xml')
    line=fp.readline()
    #print (line)
    lncel1_pa=[]
    lncel1_va=[]
    lncel2=[]
    lncel3=[]
    cels=[]
    while line:
        if re.match(r"<managedObject class="LNCEL"",line):
            k=re.search(r"(LNBTS-)(d+)(/LNCEL)(-d+)",line).group(2)+re.search(r"(LNBTS-)(d+)(/LNCEL)(-d+)",line).group(4)
            cels.append(k)
        elif (re.match(r"</managedObject>",line)):
            break
        else:
            if re.match(r"<p name=",line):
                p1 = r"((?<=")w+(?="))"
                p2 = r"(?<=>).*?(?=<)"
                pattern1 = re.compile(p1)
                pattern2 = re.compile(p2)
                get =pattern1.findall(line)[0]
                get2=pattern2.findall(line)[0]
                lncel1_pa.append(get)
                lncel1_va.append(get2)            
            else:pass
        line = fp.readline()
    fp.close()
    #print  (lncel1_pa,'
    ',lncel1_va) 打印参数和
    pd1=pd.DataFrame(lncel1_va,index=lncel1_pa,columns=cels)
    pd1.index.name='parameter'
    print (pd1)
    pd1.to_excel(r'E:pgtoolFileslncelvalue.xlsx',sheet_name='lncel')

    提取文档样本:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <raml xmlns="raml21.xsd" version="2.1">
    <cmData id="3221225472" scope="all" type="plan">
    <header>
    <log action="create" appInfo="" appVersion="" dateTime="2018-07-09T09:50:29" user=""/>
    <log action="create" appInfo="" appVersion="" dateTime="2018-07-09T09:55:00" user=""/>
    <log action="create" appInfo="" appVersion="" dateTime="2018-07-09T10:15:18" user=""/>
    <log action="create" appInfo="" appVersion="" dateTime="2018-07-09T10:27:07" user=""/>
    <log action="create" appInfo="" appVersion="" dateTime="2018-07-09T11:15:18" user=""/>
    <log action="create" appInfo="" appVersion="" dateTime="2018-07-09T11:56:23" user=""/>
    <log action="create" appInfo="" appVersion="" dateTime="2018-07-09T12:15:18" user=""/>
    <log action="create" appInfo="" appVersion="" dateTime="2018-07-09T13:05:55" user=""/>
    <log action="create" appInfo="" appVersion="" dateTime="2018-07-09T16:15:18" user=""/>
    <log action="create" appInfo="" appVersion="" dateTime="2018-07-09T16:17:01" user=""/>
    <log action="create" appInfo="" appVersion="" dateTime="2018-07-09T16:21:01" user=""/>
    <managedObject class="LNCEL" distName="MRBTS-421961/LNBTS-421961/LNCEL-1" operation="create" version="TL16A">
    <p name="a1TimeToTriggerDeactInterMeas">480ms</p>
    <p name="a2RedirectQci1">disabled</p>
    <p name="a2TimeToTriggerActGERANMeas">320ms</p>
    <p name="a2TimeToTriggerActInterFreqMeas">480ms</p>
    <p name="a2TimeToTriggerRedirect">1024ms</p>
    <p name="a3Offset">6</p>
    <p name="a3ReportInterval">640ms</p>
    <p name="a3TimeToTrigger">320ms</p>
    <p name="a5ReportInterval">640ms</p>
    <p name="a5TimeToTrigger">640ms</p>
    <p name="cqiPerSbCycK">1</p>
    <p name="cqiPerSimulAck">false</p>
    <p name="csgType">openAccess</p>
    <p name="dSrTransMax">64n</p>
    <p name="dlsOldtcTarget">98</p>
    <p name="dlsUsePartPrb">PRBs with PSS or SSS and PBCH used</p>
    <p name="eUlLaAtbPeriod">30</p>
    <p name="eUlLaBlerAveWin">30</p>
    <p name="eUlLaDeltaMcs">3</p>
    <p name="eUlLaLowMcsThr">1</p>
    <p name="eUlLaLowPrbThr">1</p>
    <p name="eUlLaPrbIncDecFactor">16</p>
    <p name="enableAmcPdcch">true</p>
    <p name="filterCoefficientRSRQ">fc4</p>
    <p name="filterCoefficientRSSI">fc2</p>
    <p name="gbrCongHandling">l2andl3</p>
    <p name="grpAssigPUSCH">0</p>
    <p name="harqMaxMsg3">4</p>
    <p name="harqMaxTrDl">5</p>
    <p name="harqMaxTrUlTtiBundling">8</p>
    <p name="harqMaxTxUl">7</p>
    <p name="hopModePusch">interSubFrame</p>
    <p name="hysA3Offset">0</p>
    <p name="hysThreshold2GERAN">2</p>
    <p name="hysThreshold2InterFreq">0</p>
    <p name="hysThreshold2Wcdma">0</p>
    <p name="hysThreshold2a">0</p>
    <p name="hysThreshold3">0</p>
    <p name="hysThreshold4">0</p>
    <p name="idleLBPercCaUe">0</p>
    <p name="idleLBPercentageOfUes">0</p>
    <p name="ilReacTimerUl">50</p>
    <p name="inactivityTimer">10</p>
    <p name="iniMcsDl">6</p>
    <p name="lbLoadFilCoeff">4</p>
    <p name="lcrId">1</p>

    <p name="ulsSchedMethod">channel aware</p>
    <list name="dFListPucch">
    <item>
    <p name="dFpucchF1">0</p>
    <p name="dFpucchF1b">3</p>
    <p name="dFpucchF2">1</p>
    <p name="dFpucchF2a">2</p>
    <p name="dFpucchF2b">2</p>
    </item>
    </list>
    <list name="drxProfile102">
    <item>
    <p name="drxInactivityT">2560</p>
    <p name="drxLongCycle">1280ms</p>
    <p name="drxOnDuratT">10</p>
    <p name="drxProfileIndex">102</p>
    <p name="drxProfilePriority">30</p>
    <p name="drxRetransT">16</p>
    </item>
    </list>
    <list name="drxProfile103">
    <item>
    <p name="drxInactivityT">10</p>
    <p name="drxLongCycle">160ms</p>
    <p name="drxOnDuratT">6</p>
    <p name="drxProfileIndex">103</p>
    <p name="drxProfilePriority">30</p>
    <p name="drxRetransT">16</p>
    </item>
    </list>
    <list name="mimoClConfig">
    <item>
    <p name="mimoClCqiThD">70</p>
    <p name="mimoClCqiThU">90</p>
    <p name="mimoClRiThD">28</p>
    <p name="mimoClRiThU">32</p>
    </item>
    </list>
    <list name="mimoOlConfig">
    <item>
    <p name="mimoOlCqiThD">50</p>
    <p name="mimoOlCqiThU">60</p>
    <p name="mimoOlRiThD">20</p>
    <p name="mimoOlRiThU">21</p>
    </item>
    </list>
    <list name="qci1eVTTConfig">
    <item>
    <p name="qci1DlTargetBler">30</p>
    <p name="qci1HarqMaxTrDl">5</p>
    <p name="qci1HarqMaxTrUl">5</p>
    <p name="qci1ReconStopTimer">10</p>
    <p name="qci1ThroughputFactorDl">16</p>
    <p name="qci1ThroughputFactorUl">16</p>
    <p name="qci1UlTargetBler">3</p>
    </item>
    </list>
    <list name="ulpcPucchConfig">
    <item>
    <p name="ulpcLowlevCch">-103</p>
    <p name="ulpcLowqualCch">1</p>
    <p name="ulpcUplevCch">-98</p>
    <p name="ulpcUpqualCch">4</p>
    </item>
    </list>
    <list name="ulpcPuschConfig">
    <item>
    <p name="ulpcLowlevSch">-103</p>
    <p name="ulpcLowqualSch">18</p>
    <p name="ulpcUplevSch">-96</p>
    <p name="ulpcUpqualSch">20</p>
    </item>
    </list>
    <list name="antCablingMappingConfig">
    <item>
    <p name="rruPort1AntId">0</p>
    <p name="rruPort2AntId">1</p>
    <p name="rruPort3AntId">2</p>
    <p name="rruPort4AntId">3</p>
    <p name="rruPort5AntId">4</p>
    <p name="rruPort6AntId">5</p>
    <p name="rruPort7AntId">6</p>
    <p name="rruPort8AntId">7</p>
    </item>
    </list>
    <list name="drxProfile101">
    <item>
    <p name="drxRetransT">4</p>
    <p name="drxInactivityT">10</p>
    <p name="drxLongCycle">40ms</p>
    <p name="drxOnDuratT">6</p>
    <p name="drxProfileIndex">101</p>
    <p name="drxProfilePriority">30</p>
    </item>
    </list>
    <list name="furtherPlmnIdL">
    <item>
    <p name="autoRemovalAllowed">false</p>
    <p name="cellReserve">Not Reserved</p>
    <p name="mcc">460</p>
    <p name="mnc">29</p>
    <p name="mncLength">2</p>
    </item>
    </list>
    <list name="redBwRbUlConfig">
    <item>
    <p name="redBwMaxRbUl">100</p>
    <p name="redBwMinRbUl">1</p>
    </item>
    </list>
    <list name="srsDlMimoModeDepConf">
    <item>
    <p name="beamformingType">nonBeamforming</p>
    <p name="srsBandwidth">2hbw</p>
    <p name="srsUePeriodicity">40ms</p>
    </item>
    </list>
    <list name="tmSwitchThresholdDef">
    <item>
    <p name="tm3to7CqiTh">40</p>
    <p name="tm8to3CqiTh">70</p>
    <p name="tm9to3CqiTh">150</p>
    </item>
    </list>
    </managedObject>

  • 相关阅读:
    Jmeter 将正则表达式提取的参数传给全局(跨线程组使用变量)
    pod的状态分析
    前端 -- html介绍和head标签
    Python ----- 线程和进程
    网络编程 ------ 基础
    面向对象相关操作
    面向对象 --- 进阶篇
    python --- 面向对象
    python的模块和包的详细说明
    常用模块------时间模块 , random模块 ,os模块 , sys模块
  • 原文地址:https://www.cnblogs.com/mrtop/p/9303866.html
Copyright © 2011-2022 走看看