zoukankan      html  css  js  c++  java
  • 14. 类似正则表达式的字符处理问题

    SQL Serve提供了简单的字符模糊匹配功能,比如:like, patindex,不过对于某些字符处理场景还显得并不足够,日常碰到的几个问题有:

    • 1. 同一个字符/字符串,出现了多少次
    • 2. 同一个字符,第N次出现的位置
    • 3. 多个相同字符连续,合并为一个字符
    • 4. 是否为有效IP/身份证号/手机号等
    • 5. 去除所有数字/字母

     

    . 同一个字符/字符串,出现了多少次

    同一个字符,将其替换为空串,即可计算

    declare @text varchar(1000)
    declare @str  varchar(10)
    set @text = 'ABCBDBE'
    set @str = 'B'
    
    select len(@text) - len(replace(@text,@str,''))

    同一个字符串,仍然是替换,因为是多个字符,方法1替换后需要做一次除法;方法2替换时增加一个字符,则不需要

    --方法1
    declare @text varchar(1000)
    declare @str  varchar(10)
    set @text = 'ABBBCBBBDBBBE'
    set @str = 'BBB'
    
    select (len(@text) - len(replace(@text,@str,'')))/len(@str)
    
    --方法2
    declare @text varchar(1000)
    declare @str  varchar(10)
    set @text = 'ABBBCBBBDBBBE'
    set @str = 'BBB'
    
    select len(replace(@text,@str,@str+'_')) - len(@text)

    . 同一个字符/字符串,第N次出现的位置

    SQL SERVER定位字符位置的函数为CHARINDEX:

    CHARINDEX ( expressionToFind , expressionToSearch [ , start_location ] )

    可以从指定位置起开始检索,但是不能取第N次出现的位置,需要自己写SQL来补充,有以下几种思路:

    1. 自定义函数, 循环中每次为charindex加一个计数,直到为N

    if object_id('NthChar','FN') is not null
        drop function Nthchar
    GO
    
    create function NthChar
    (
    @source_string as nvarchar(4000), 
    @sub_string    as nvarchar(1024),
    @nth           as int
    ) 
    returns int 
    as 
    begin 
        declare @postion int 
        declare @count   int 
    
        set @postion = CHARINDEX(@sub_string, @source_string) 
        set @count = 0 
      
        while @postion > 0 
        begin 
            set @count = @count + 1 
            if @count = @nth 
            begin 
                break 
            end
            set @postion = CHARINDEX(@sub_string, @source_string, @postion + 1) 
        End 
        return @postion 
    end 
    GO
    
    --select dbo.NthChar('abcabc','abc',2)
    --4

    2. 通过CTE,对待处理的整个表字段操作, 递归中每次为charindex加一个计数,直到为N

    if  object_id('tempdb..#T') is not null
        drop table #T
    
    create table #T
    (
    source_string nvarchar(4000)
    )
    
    insert into #T values (N'我们我们')
    insert into #T values (N'我我哦我')
    
    declare @sub_string nvarchar(1024)
    declare @nth        int
    set @sub_string = N'我们'
    set @nth = 2
    
    ;with T(source_string, starts, pos, nth) 
    as (
        select source_string, 1, charindex(@sub_string, source_string), 1 from #t
        union all
        select source_string, pos + 1, charindex(@sub_string, source_string, pos + 1), nth+1 from T
        where pos > 0
    )
    select 
        source_string, pos, nth
    from T
    where pos <> 0
      and nth = @nth
    order by source_string, starts
    
    --source_string    pos    nth
    --我们我们    3    2

    3. 借助数字表 (tally table),到不同起点位置去做charindex,需要先自己构造个数字表

    --numbers/tally table
    IF EXISTS (select * from dbo.sysobjects where id = object_id(N'[dbo].[Numbers]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)
        DROP TABLE dbo.Numbers
    
    --===== Create and populate the Tally table on the fly
     SELECT TOP 1000000 
            IDENTITY(int,1,1) AS number
       INTO dbo.Numbers
       FROM master.dbo.syscolumns sc1,
            master.dbo.syscolumns sc2
    
    --===== Add a Primary Key to maximize performance
      ALTER TABLE dbo.Numbers
            ADD CONSTRAINT PK_numbers_number PRIMARY KEY CLUSTERED (number)
    
    --===== Allow the general public to use it
      GRANT SELECT ON dbo.Numbers TO PUBLIC
    
    --以上数字表创建一次即可,不需要每次都重复创建
    
    DECLARE @source_string    nvarchar(4000), 
            @sub_string       nvarchar(1024), 
            @nth              int
    SET @source_string = 'abcabcvvvvabc'
    SET @sub_string = 'abc'
    SET @nth = 2 
    
    ;WITH T 
    AS
    (            
    SELECT ROW_NUMBER() OVER(ORDER BY number) AS nth,
           number AS [Position In String]
      FROM dbo.Numbers n 
     WHERE n.number <= LEN(@source_string)    
       AND CHARINDEX(@sub_string, @source_string, n.number)-number = 0
       ----OR
       --AND SUBSTRING(@source_string,number,LEN(@sub_string)) = @sub_string
    ) 
    SELECT * FROM T WHERE nth = @nth

    4. 通过CROSS APPLY结合charindex,适用于N值较小的时候,因为CROSS APPLY的次数要随着N的变大而增加,语句也要做相应的修改

    declare @T table
    (
    source_string nvarchar(4000)
    )
    
    insert into @T values
    ('abcabc'),
    ('abcabcvvvvabc')
    
    declare @sub_string nvarchar(1024)
    set @sub_string = 'abc'
    
    select source_string,
           p1.pos as no1,
           p2.pos as no2,
           p3.pos as no3
    from @T
    cross apply (select (charindex(@sub_string, source_string))) as P1(Pos)
    cross apply (select (charindex(@sub_string, source_string, P1.Pos+1))) as P2(Pos)
    cross apply (select (charindex(@sub_string, source_string, P2.Pos+1))) as P3(Pos)

    5. 在SSIS里有内置的函数,但T-SQL中并没有

    --FINDSTRING in SQL Server 2005 SSIS
    FINDSTRING([yourColumn], "|", 2),
    
    --TOKEN in SQL Server 2012 SSIS
    TOKEN(Col1,"|",3)

    注:不难发现,这些方法和字符串拆分的逻辑是类似的,只不过一个是定位,一个是截取,如果要获取第N个字符左右的一个/多个字符,有了N的位置,再结合substring去截取即可;

    . 多个相同字符连续,合并为一个字符

    最常见的就是把多个连续的空格合并为一个空格,解决思路有两个:

    1. 比较容易想到的就是用多个replace

    但是究竟需要replace多少次并不确定,所以还得循环多次才行

    --把两个连续空格替换成一个空格,然后循环,直到charindex检查不到两个连续空格
    declare @str varchar(100)
    set @str='abc        abc     kljlk     kljkl'
    while(charindex('  ',@str)>0)
    begin
        select @str=replace(@str,'  ',' ')
    end
    select @str

    2. 按照空格把字符串拆开

    对每一段拆分开的字符串trim或者replace后,再用一个空格连接,有点繁琐,没写代码示例,如何拆分字符串可参考:“第N次出现的位置”;

    . 是否为有效IP/身份证号/手机号等

    类似IP/身份证号/手机号等这些字符串,往往都有自身特定的规律,通过substring去逐位或逐段判断是可以的,但SQL语句的方式往往性能不佳,建议尝试正则函数,见下。

    . 去除所有数字/字母

    上面定位第N个字符的方式也都可以尝试使用,只是在定位的同时将数字/字母替换掉即可;

    另外也可以通过模糊匹配来做:

    --去除所有非数字
    declare @pos smallint
    declare @string varchar(100)
    
    set @string = '1109A><":{$%^&*4DSE2@!~$%^&567KJHGT'
    
    --如果是1102e53这样的字符串,会被isnumeric认为是科学计数法数字,加上e0就会返回0
    --如果是普通数字,加上e0也不会影响返回1
    while isnumeric(@string+'e0') = 0 --也可同下用patindex判断
    begin
        set @pos = (select patindex('%[^0-9]%',@string))
        set @string = (select replace(@string,substring(@string,@pos,1),''))
    end
    
    select @string
    
    --去除所有数字
    declare @pos smallint
    declare @string varchar(100)
    
    set @string = '1109A><":{$%^&*4DSE2@!~$%^&567KJHGT'
    
    while patindex('%[0-9]%',@string) > 0
    begin
        set @pos = (select patindex('%[0-9]%',@string))
        set @string = (select replace(@string,substring(@string,@pos,1),''))
    end
    
    select @string
    
    --去除所有字母
    declare @pos smallint
    declare @string varchar(100)
    
    set @string = '1109A><":{$%^&*4DSE2@!~$%^&567KJHGT'
    
    while patindex('%[A-Za-z]%',@string) > 0
    begin
        set @pos = (select patindex('%[A-Za-z]%',@string))
        set @string = (select replace(@string,substring(@string,@pos,1),''))
    end
    
    select @string

    . 正则表达式函数

    1. Oracle

    从10g开始,可以在查询中使用正则表达式,它通过一些支持正则表达式的函数来实现:

    Oracle 10 g

    REGEXP_LIKE

    REGEXP_REPLACE

    REGEXP_INSTR

    REGEXP_SUBSTR

    Oracle 11g (新增)

    REGEXP_COUNT

    Oracle用REGEXP函数处理上面几个问题:

    (1) 同一个字符/字符串,出现了多少次

    select length(regexp_replace('123-345-566', '[^-]', '')) from dual;
    select REGEXP_COUNT('123-345-566', '-') from dual; --Oracle 11g

    (2) 同一个字符/字符串,第N次出现的位置

    不需要正则,ORACLE的instr可以直接查找位置:

    instr('source_string','sub_string' [,n][,m])

    n表示从第n个字符开始搜索,缺省值为1,m表示第m次出现,缺省值为1。

    select instr('abcdefghijkabc','abc', 1, 2) position from dual; 

    (3) 多个相同字符连续,合并为一个字符

    select regexp_replace(trim('agc f   f  '),'s+',' ') from dual;

    (4) 是否为有效IP/身份证号/手机号等

    --是否为有效IP
    WITH IP
    AS(
    SELECT '10.20.30.40' ip_address FROM dual UNION ALL
    SELECT 'a.b.c.d' ip_address FROM dual UNION ALL
    SELECT '256.123.0.254' ip_address FROM dual UNION ALL
    SELECT '255.255.255.255' ip_address FROM dual
    )
    SELECT *
    FROM IP
    WHERE REGEXP_LIKE(ip_address, '^(([0-9]{1}|[0-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]).){3}([0-9]{1}|[0-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])$');
    --是否为有效身份证/手机号,暂未举例

    (5) 去除所有数字/字母

    --去除非数字
    SELECT regex_replace('1109A><":{$%^&*4DSE2@!~$%^&567KJHGT','[^0-9]','') FROM dual;
    
    --去除数字
    SELECT regex_replace('1109A><":{$%^&*4DSE2@!~$%^&567KJHGT','[0-9]','') FROM dual;
    
    --去除字母
    SELECT regex_replace('1109A><":{$%^&*4DSE2@!~$%^&567KJHGT','[A-Za-z]','') FROM dual;

    2. SQL Server

    目前最新版本为SQL Server 2017,还没有对REGEXP函数的支持,需要通用CLR来扩展,如下为CLR实现REG_REPLACE:

    --1. 开启 CLR 
    EXEC sp_configure 'show advanced options' , '1'
    GO
    RECONFIGURE
    GO
    EXEC sp_configure 'clr enabled' , '1'
    GO
    RECONFIGURE
    GO
    EXEC sp_configure 'show advanced options' , '0';
    GO
    --首次创建时,应该是从dll文件创建
    --CREATE ASSEMBLY [RegexUtility] FROM 'C:xxxxxxx.dll'
    --GO
    
    --创建好后,可以生成脚本出来部署到其他支持CLR的SQL Server上
    CREATE ASSEMBLY [RegexUtility]
    FROM 0x4D5A90000300000004000000FFFF0000B800000000000000400000000000000000000000000000000000000000000000000000000000000000000000800000000E1FBA0E00B409CD21B8014CCD21546869732070726F6772616D2063616E6E6F742062652072756E20696E20444F53206D6F64652E0D0D0A2400000000000000504500004C0103002073E9470000000000000000E0000E210B010800000A00000006000000000000EE2900000020000000400000000040000020000000020000040000000000000004000000000000000080000000020000000000000300400500001000001000000000100000100000000000001000000000000000000000009C2900004F000000004000003803000000000000000000000000000000000000006000000C00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000200000080000000000000000000000082000004800000000000000000000002E74657874000000F409000000200000000A000000020000000000000000000000000000200000602E72737263000000380300000040000000040000000C0000000000000000000000000000400000402E72656C6F6300000C0000000060000000020000001000000000000000000000000000004000004200000000000000000000000000000000D0290000000000004800000002000500EC200000B008000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000007A0F00280E00000A0F01280E00000A280F00000A6F1000000A281100000A2A001330030047000000010000110F01280E00000A731200000A0A060F00280E00000A6F1300000A6F1000000A2C200F00280E00000A0F01280E00000A0F02280E00000A281400000A281500000A2A7E1600000A2A820F00280E00000A0F01280E00000A0F02280E00000A281400000A281500000A2A1E02281800000A2A42534A4201000100000000000C00000076322E302E35303732370000000005006C0000007C020000237E0000E8020000F402000023537472696E677300000000DC0500000800000023555300E4050000100000002347554944000000F4050000BC02000023426C6F620000000000000002000001471502000900000000FA0133001600000100000014000000020000000400000008000000180000000D00000001000000010000000300000000000A000100000000000600380031000A0060004B000A006B004B000600B900A7000600F100D20006000501A70006001E01A70006003901A70006005401A70006006D01A70006008601A7000600A501A7000600C201A7000600F901D90106001902D9010A00520237020E009E027F020E00A4027F020E00AA027F020A00E10237020000000001000000000001000100010010001B000000050001000100502000000000960075000A0001007020000000009600800013000300C3200000000096008D0013000600E420000000008618A1001E000900000001006702000002006D02000001006702000002006D0200000300C802000001006702000002006D0200000300C8022100A10022002900A10027003100A10022003900A10022004100A10022004900A10022005100A10022005900A10022006100A10022006900A10022007100A1002C007900A1001E008100A1001E0019007502D5008900A402D9009900B002E0001100BC02E4008900A10022008900A402EA008900D402F0001900BC02F7001900DC02FD00A100A1001E000900A1001E0020006B0031002E0033007F022E003B0079022E0063009A022E00230079022E002B0079022E00430079022E00130079022E004B0079022E0053007F022E005B00910240006B0031006000BB00060101010480000001000000BE0B28760000000000001B000000020000000000000000000000010028000000000002000000000000000000000001003F0000000000020000000000000000000000010031000000000000000000003C4D6F64756C653E0052656765785574696C6974792E646C6C0052656765785574696C697479006D73636F726C69620053797374656D004F626A6563740053797374656D2E446174610053797374656D2E446174612E53716C54797065730053716C426F6F6C65616E0053716C537472696E670052656765784D617463680052656765785265706C6163650052656765785265706C61636544656661756C74002E63746F720053797374656D2E5265666C656374696F6E00417373656D626C7956657273696F6E4174747269627574650053797374656D2E52756E74696D652E496E7465726F70536572766963657300436F6D56697369626C6541747472696275746500417373656D626C7943756C7475726541747472696275746500417373656D626C7954726164656D61726B41747472696275746500417373656D626C79436F7079726967687441747472696275746500417373656D626C7950726F6475637441747472696275746500417373656D626C79436F6D70616E7941747472696275746500417373656D626C79436F6E66696775726174696F6E41747472696275746500417373656D626C794465736372697074696F6E41747472696275746500417373656D626C795469746C654174747269627574650053797374656D2E52756E74696D652E436F6D70696C6572536572766963657300436F6D70696C6174696F6E52656C61786174696F6E734174747269627574650052756E74696D65436F6D7061746962696C697479417474726962757465004D6963726F736F66742E53716C5365727665722E5365727665720053716C46756E6374696F6E41747472696275746500696E707574007061747465726E006765745F56616C75650053797374656D2E546578742E526567756C617245787072657373696F6E73005265676578004D617463680047726F7570006765745F53756363657373006F705F496D706C69636974007265706C6163656D656E74005265706C616365004E756C6C0053716C4D6574686F64417474726962757465000003200000000000253F930E81E4D84AAC7FD0EB9C3404A30008B77A5C561934E0890800021109110D110D0A0003110D110D110D110D03200001042001010E0420010102042001010880A20100020054020F497344657465726D696E6973746963015455794D6963726F736F66742E53716C5365727665722E5365727665722E446174614163636573734B696E642C2053797374656D2E446174612C2056657273696F6E3D322E302E302E302C2043756C747572653D6E65757472616C2C205075626C69634B6579546F6B656E3D623737613563353631393334653038390A44617461416363657373000000000320000E06000212490E0E0320000205000111090205200112490E0600030E0E0E0E050001110D0E0306110D040701124581710100060054020F497344657465726D696E6973746963015402094973507265636973650154020A4F6E4E756C6C43616C6C005455794D6963726F736F66742E53716C5365727665722E5365727665722E446174614163636573734B696E642C2053797374656D2E446174612C2056657273696F6E3D322E302E302E302C2043756C747572653D6E65757472616C2C205075626C69634B6579546F6B656E3D623737613563353631393334653038390A446174614163636573730000000054557F4D6963726F736F66742E53716C5365727665722E5365727665722E53797374656D446174614163636573734B696E642C2053797374656D2E446174612C2056657273696F6E3D322E302E302E302C2043756C747572653D6E65757472616C2C205075626C69634B6579546F6B656E3D623737613563353631393334653038391053797374656D4461746141636365737300000000540E044E616D651572656765785F7265706C6163655F64656661756C740501000000001101000C52656765785574696C69747900000801000800000000001E01000100540216577261704E6F6E457863657074696F6E5468726F777301000000C42900000000000000000000DE290000002000000000000000000000000000000000000000000000D0290000000000000000000000005F436F72446C6C4D61696E006D73636F7265652E646C6C0000000000FF250020400000000000000000000000000000000000000000000000000000000100100000001800008000000000000000000000000000000100010000003000008000000000000000000000000000000100000000004800000058400000E00200000000000000000000E00234000000560053005F00560045005200530049004F004E005F0049004E0046004F0000000000BD04EFFE00000100000001002876BE0B000001002876BE0B3F000000000000000400000002000000000000000000000000000000440000000100560061007200460069006C00650049006E0066006F00000000002400040000005400720061006E0073006C006100740069006F006E00000000000000B00440020000010053007400720069006E006700460069006C00650049006E0066006F0000001C020000010030003000300030003000340062003000000044000D000100460069006C0065004400650073006300720069007000740069006F006E0000000000520065006700650078005500740069006C006900740079000000000040000F000100460069006C006500560065007200730069006F006E000000000031002E0030002E0033003000300036002E00330030003200340038000000000044001100010049006E007400650072006E0061006C004E0061006D0065000000520065006700650078005500740069006C006900740079002E0064006C006C00000000002800020001004C006500670061006C0043006F0070007900720069006700680074000000200000004C00110001004F0072006900670069006E0061006C00460069006C0065006E0061006D0065000000520065006700650078005500740069006C006900740079002E0064006C006C00000000003C000D000100500072006F0064007500630074004E0061006D00650000000000520065006700650078005500740069006C006900740079000000000044000F000100500072006F006400750063007400560065007200730069006F006E00000031002E0030002E0033003000300036002E00330030003200340038000000000048000F00010041007300730065006D0062006C0079002000560065007200730069006F006E00000031002E0030002E0033003000300036002E0033003000320034003800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002000000C000000F03900000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
    WITH PERMISSION_SET = SAFE
    
    GO
    2. 创建 Assembly
    --3. 创建 CLR 函数
    CREATE FUNCTION [dbo].[regex_replace](@input [nvarchar](4000), @pattern [nvarchar](4000), @replacement [nvarchar](4000))
    RETURNS [nvarchar](4000) WITH EXECUTE AS CALLER, RETURNS NULL ON NULL INPUT
    AS 
    EXTERNAL NAME [RegexUtility].[RegexUtility].[RegexReplaceDefault]
    GO
    --4. 使用regex_replace替换多个空格为一个空格
    select dbo.regex_replace('agc f   f  ','s+',' ');
    --5. 使用regex_replace去除数字/字母
    --去除非数字
    select dbo.regex_replace('1109A><":{$%^&*4DSE2@!~$%^&567KJHGT','[^0-9]','');
    
    --去除数字
    select dbo.regex_replace('1109A><":{$%^&*4DSE2@!~$%^&567KJHGT','[0-9]','');
    
    --去除字母
    select dbo.regex_replace('1109A><":{$%^&*4DSE2@!~$%^&567KJHGT','[A-Za-z]','');

    注:通过CLR实现更多REGEXP函数,如果有高级语言开发能力,可以自行开发;或者直接使用一些开源贡献也行,比如:http://devnambi.com/2016/sql-server-regex/

    简单的正则表达式符号使用说明:

    f -> 匹配一个换页

    -> 匹配一个换行符

    -> 匹配一个回车符

    -> 匹配一个制表符

    v -> 匹配一个垂直制表符

    s 匹配任何空白字符,包括空格、制表符、换页符等等, 等价于[ f v]

    s+ 匹配任意多个空白字符

    S 匹配非空白字符

     

    ^ 匹配最前边的字符

    $ 与^类似,匹配最末的字符

    + 匹配+号前面的字符1次或n次.等价于{1,}

    * 匹配*前面的字符0次或n次

    ? 匹配?前面的字符0次或1次

    . (小数点)匹配除换行符外的所有单个的字符

     

    小结:

    1. 非正则SQL语句的思路,对不同数据库往往都适用;

    2. 正则表达式中的规则(pattern) 在不同开发语言里,有很多语法是相通的,通常是遵守perl或者linux shell中的sed等工具的规则;

    3. 从性能上来看,非循环写法的通用SQL判断 > REGEXP函数 > 自定义SQL函数;

    4. SQL SERVER 除了CLR的方式扩展正则表达式函数,也可以开启OLE Automation通过VBS来扩展正式表达式函数,曾经测试过性能,比CLR要差很多。

    参考:

    regular expression enhancements in 11g

    http://www.oracle-developer.net/display.php?id=508

    Using Regular Expressions With Oracle Database

    https://docs.oracle.com/cd/B12037_01/appdev.101/b10795/adfns_re.htm

    Replace non numeric characters in string

    https://www.sqlservercentral.com/Forums/Topic470379-338-1.aspx

    IsNumeric, IsInt, IsNumber

    https://www.tek-tips.com/faqs.cfm?fid=6423

  • 相关阅读:
    UVA 10618 Tango Tango Insurrection
    UVA 10118 Free Candies
    HDU 1024 Max Sum Plus Plus
    POJ 1984 Navigation Nightmare
    CODEVS 3546 矩阵链乘法
    UVA 1625 Color Length
    UVA 1347 Tour
    UVA 437 The Tower of Babylon
    UVA 1622 Robot
    UVA127-"Accordian" Patience(模拟)
  • 原文地址:https://www.cnblogs.com/seusoftware/p/7760086.html
Copyright © 2011-2022 走看看