zoukankan      html  css  js  c++  java
  • python正则

    正则表达式

    正则表达式是包含文本和特殊字符的字符串,该字符串描述一个可以识别各种字符的模式。

    特殊符号和字符

    表示法

    描述

    例子

    literal

    匹配文本字符串的字面值literal

    foo

    re1|re2

    匹配re1或者re2

    foo|bar

    .

    匹配任何字符,除了

    b.b

    ^

    匹配起始部分

    ^a(以a开头)

    $

    匹配末尾部分

    ^/bin/*sh$

    *

    匹配0次或多次前面出现的正则表达式

    [0-9]*

    +

    匹配1次或多次前面出现的正则表达式

    [0-9]+

    ?

    匹配0次或1次前面出现的正则表达式

    [0-9]?

    {N}

    匹配N次前面出现的正则表达式

    [0-9]{3}

    {M,N}

    匹配M~N次前面出现的正则表达式

    [0-9]{3,7}

    [...]

    匹配中括号里任一字符

    [abc]

    [..x-y..]

    匹配x-y范围内任一字符

    [0-9a-zA-Z]

    [^...]

    不匹配中括号里面的任意一个字符

    [^0-9a-zA-Z]

    *|+||{})?

    匹配上面频繁出现/重复符号的非贪婪版本(*+、?、{}

    .*?[a-z]

    ()

    匹配封闭的正则表达式,然后另存为子组

    分组:到已经匹配到的数据中再提取数据

    ([0-9]{3}?, f(oo|u)bar

    d

    匹配十进制数字,与[0-9]一致,D与之相反

    datad+.txt

    w

    匹配任何字母,与[A-Za-z0-9]相同,W与之相反

    [A-Za-z0-9]w+

    s

    匹配任何空格字符串,与[ vf]相同,S与之反

    ofsthe

    

    匹配任何单词的边界,B与之反

    The

    N

    匹配已保存的字组N,参见上面的(..)

    Price:16

    c

    逐字匹配任何特殊字符c

    . \ *

    A()

    匹配字符串的起始(结束),参见^$

    ADear

    标志:

    re.Ire.IGNORECASE

    大小写不敏感

    re.Lre.LOCALE

    根据所使用的本地语言环境通过w,W,,B,s,S实现匹配

    re.Mre.MUTILINE

    ^$分别匹配目标字符串的其实和结尾,而不是严格匹配整个字符串本身的起始和结尾

    re.Sre.DOTALLA

    "."匹配除了 之外的所有单个字符;该标记表示'.'号能够匹配全部字符。

    re.Xre.VERBOSE

    通过反斜线转义,否则所有空格加上#(以及在该行中所有后续文字)都被忽略,除非在一个字符类或者允许注释并且提高可读性。


    1.1 re.compile

    re.compile(pattern, flags=0)
    Compile a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods, described below.
    The expression’s behaviour can be modified by specifying a flags value. Values can be any of the following variables, combined using bitwise OR (the | operator).
    The sequence
    prog = re.compile(pattern)
    result = prog.match(string)
    is equivalent to
    result = re.match(pattern, string)
    but using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.
    如果打算做大量匹配和搜索操作,最好先编译正则表达式,以达到重复使用。模块级别的函数会将最近编译过的模式缓存起来,并不会消耗太多性能,但使用预编译,会减少查找和一些额外处理消耗。

    1.2 re.search

    re.search(pattern, string, flags=0)
    浏览整个字符串去匹配第一个,未成功返回None
    Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

    1.3 re.match

    re.match(pattern, string, flags=0)
    从开始位置匹配,匹配成功返回一个match对象,否则返回None
    If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.
    Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.
    例如:
    import re
    # 匹配日期字符串格式
    text = '11/27/2012'
    if re.match(r'd+/d+/d+', text):
        print('yes')
    else:
        print('no')
    m = re.match(r'd+/d+/d+', text)
    print(m.group())  # 11/27/2012

    1.4 re.fullmatch

    re.fullmatch(pattern, string, flags=0)
    If the whole string matches the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.

    1.5 re.split

    re.split(pattern, string, maxsplit=0, flags=0)
    根据正则表达式的模式分隔符,split函数将字符串分割为列表,然后返回成功匹配的列表,分割操作maxsplit次。
    Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list.

    1.6 re.findall

    re.findall(pattern, string, flags=0)
    查找字符串中所有的正则表达式模式,并返回一个匹配列表
    Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

    1.7 re.dinditer

    re.finditer(pattern, string, flags=0)
    Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result unless they touch the beginning of another match.

    1.8 re.sub

    re.sub(pattern, repl, string, count=0, flags=0)
    Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, is converted to a single newline character, is converted to a carriage return, and so forth. Unknown escapes such as & are left alone. Backreferences, such as 6, are replaced with the substring matched by group 6 in the pattern. For example:
    If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string. For example:
          >>> def dashrepl(matchobj):
    ...     if matchobj.group(0) == '-': return ' '
    ...     else: return '-'
    >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
    'pro--gram files'
    >>> re.sub(r'sANDs', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
    'Baked Beans & Spam'
    例1:# 将日期换为‘today’
    import re
    text1 = 'Today is 11/27/2012. PyCon starts 3/13/2013'
    datePat = re.compile(r'd+/d+/d+')
    m = datePat.sub('today',text1)
    print(m) # Today is today. PyCon starts today
    例2:# 将日期格式转换,将11/27/2012转换为2012/11/27
    text1 = 'Today is 11/27/2012. PyCon starts 3/13/2013'
    datePat = re.compile(r'(d+)/(d+)/(d+)')
    m = datePat.sub(r'3-1-2', text1)
    print(m)  # Today is 2012-11-27. PyCon starts 2013-3-13
    
    例3:# 对于更复杂的替换,可以传递一个替换回调函数来实现
    def dashrepl(matchobj):
        print(matchobj.group(0))  # --   --  -
    if matchobj.group(0) == '-':
            return ' '
    else:
            return '-'
    m = re.sub('-{1,2}', dashrepl, 'pro----gram-files')
    print(m)  # pro--gram files
    例4:# 对于更复杂的替换,可以传递一个替换回调函数来实现
    import re
    text = 'UPPER PYTHON, lower python, Mixed Python'
    def matchcase(word):
        print(word)
        # < _sre.SRE_Match object; span=(6, 12), match='PYTHON'>
        # < _sre.SRE_Match object; span=(20, 26), match='python'>
        # < _sre.SRE_Match object; span=(34, 40), match='Python'>
    if word.group() == 'PYTHON':
            return 'SNAKE'
    elif word.group() == 'python':
            return 'snake'
    elif word.group() == 'Python':
            return 'Snake'
    m = re.sub('python', matchcase, text, flags=re.IGNORECASE)
    print(m) # UPPER SNAKE, lower snake, Mixed Snak
    例5: 通用版
    import re
    text = 'UPPER PYTHON, lower python, Mixed Python'
    def matchcase(word):
        # word 是 snake
    def replace(m):
    # < _sre.SRE_Match object; span=(6, 12), match='PYTHON'>
             # < _sre.SRE_Match object; span=(20, 26), match='python'>
            # < _sre.SRE_Match object; span=(34, 40), match='Python'>
            text = m.group()  # PYTHON python Python
    if text.isupper():
                return word.upper()
            elif text.islower():
                return word.lower()
            elif text[0].isupper():
                return word.capitalize()
            else:
                return word
        return replace
    m = re.sub('python', matchcase('snake'), text, flags=re.IGNORECASE)
    print(m)

    1.9 re.subn

    re.subn(pattern, repl, string, count=0, flags=0)
    Perform the same operation as sub(), but return a tuple (new_string, number_of_subs_made).
    例1:# 将日期格式转换,将11/27/2012转换为2012/11/27, 并计算替换了多少次
    text1 = 'Today is 11/27/2012. PyCon starts 3/13/2013'
    datePat = re.compile(r'(d+)/(d+)/(d+)')
    m, n = datePat.subn(r'3-1-2', text1)
    print(m)  # Today is 2012-11-27. PyCon starts 2013-3-13
    print(n)

    1.10 re.escape

    re.escape(string)
    Escape all the characters in pattern except ASCII letters, numbers and '_'. This is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.

    1.11 re.purge

    re.purge()
    Clear the regular expression cache.

    1.12 常用正则表达式

    IP:
    
    ^(25[0-5]|2[0-4]d|[0-1]?d?d)(.(25[0-5]|2[0-4]d|[0-1]?d?d)){3}$
    
    手机号:
    
    ^1[3|4|5|8][0-9]d{8}$
    
    邮箱:
    
    [a-zA-Z0-9_-]+@[a-zA-Z0-9_-]+(.[a-zA-Z0-9_-]+)+

    替换连续空格为单一空格
    re.sub(r"[x00-x20]+", " ", value).strip()
  • 相关阅读:
    Linux网络配置
    配置bash以及bash初始化
    文本提取、分析和修改工具
    标准输入输出和管道
    文件和文件夹管理
    用户、组和权限
    vim工具使用
    Linux获取命令帮助的方法
    Linux知识整理-入门和体验
    Robot Framework简易复刻版-未完成
  • 原文地址:https://www.cnblogs.com/xiaoming279/p/6372764.html
Copyright © 2011-2022 走看看