zoukankan      html  css  js  c++  java
  • Dive into re Module in Python

    Dive into RE in Python

    Standard re module in python is powerful to handle text manipulation,such as searching,matching,splitting etc, and it is necessary to learn about it when tasks above appears.Since official document of re is a bit obscure, I refer to book 《Mastering Regular Expression》 written by Felix Lopez mainly to depict the whole picture of regular expression and help anyone who desires to have an in-depth view of re but lack of systematically fined learning materials.

    import re
    

    Chapter1 Basic syntax of regular expressions

    1.1 Literal and metacharacter

    Regular expressions consist of two components:(1)literals,for example,'file','.xml','145' etc.(2)metacharacters,for example,'?','*',etc., and they have special meaning in regular express.These 12 metacharacters that shall be escaped if they are to be used with literal meaning are as the following:

    • Backslash
    • Caret ^
    • Dollar sign $
    • Dot .
    • Pipe symbol |
    • Question mark ?
    • Asterisk *
    • Plus asign +
    • Opening square bracket [
    • Opening parenthesis (
    • Closing parenthesis )
    • Opening curly brace {

    The meaning of these metacharacters will be illustrated by code examples latter.

    1.2 Character classes

    Character class is actually a set of character. There are 2 main kinds of character classes:
    (1) User-defined by metacharacter '[]', for example, 'licen[cs]e' will match 'license' or 'licence', further more, '-' can be used inside '[]' representing a scope of character, for example,[a-z] can match any lowercase letter, [a-z123] will match any lowercase letter plusing with '1,2,3'.'^' can also be used inside '[]' representing a reverse set.
    (2)Predefined

    • '.'matches any character except
    • 'd'matches any decimal digit, equil to [0-9]
    • 'D' matches any non-decimal digit,equil to [^0-9]
    • 's' matches any whitespace character,equil to [ fv]
    • 'S' matches any non-whitespace character,equil to [^ fv]
    • 'w' matches any alphanumeric character,equil to [a-zA-Z0-9_] (including'_')
    • 'W' matches any non-alphanumeric character,equil to the [^a-zA-Z0-9_]
    re.match(r'w','2a_d').group(0)
    
    '2'
    
    re.match(r'w','_ad').group(0) #Including '_'
    
    '_'
    

    1.3 Alternation

    Character class is a set of character, and alternation is a set of regular expression using '|'.One thing to mention, when using in bigger regular expression, we will probably need to wrap our alternation inside parentheses to express that only that part is alternated and not the whole expression.For instance, 'License:yes|no' will match 'License:yes' or 'no' but not 'yes' or 'no', to achive the latter one, using'License:(yes|no)'.

    re.findall(r'License:yes|no',r'...License:yes,no..')
    
    ['License:yes', 'no']
    
    re.findall(r'License:(yes|no)',r'...License:yes,no...')
    
    ['yes']
    
    re.findall(r'License:(no|yes)','...License:yes,no...')
    
    ['yes']
    
    help(re.findall)
    
    Help on function findall in module re:
    
    findall(pattern, string, flags=0)
        Return a list of all non-overlapping matches in the string.
        
        If one or more capturing groups are present in the pattern, return
        a list of groups; this will be a list of tuples if the pattern
        has more than one group.
        
        Empty matches are included in the result.
    

    1.4 Quantifiers

    Quantifiers are mechanism that determine how a character,metercharater,regular expression or character class can be repeated.

    • '?'means optional(0 or 1 repetitions)
    • '*'means zero or more times
    • '+'means one or more times
    • '{n,m}'means n to m times, m or n can be ignored meaning repeating at least n times or m times
    re.findall(r'd+',r'..123a6bcd45...')#find one decimal digit or more decimal digits
    
    ['123', '6', '45']
    
    re.findall(r'd{1}',r'...123a6bcd45...') #find one decimal digit, equil to re.findall(r'd',r'...123a6bcd45...')
    
    ['1', '2', '3', '6', '4', '5']
    
    re.findall(r'd',r'...123a6bcd45...') # the same with above
    
    ['1', '2', '3', '6', '4', '5']
    
    re.findall(r'd{1,4}',r'...123a6bcd45...') #find one decimal digit to four decimal digits
    
    ['123', '6', '45']
    
    re.findall(r'd{2}',r'...123a6bcd45...') #find 2 decimal digit s
    
    ['12', '45']
    
    re.findall(r'd{2,}',r'...123a6bcd45...') #find at least 2 decimal digits
    
    ['123', '45']
    

    1.5 Greedy and non-greedy behaviour

    Greedy behaviour will try to match as much as possible to have the biggest match result possible, while non-greedy behaviour will oppositely try to match as much as possible to have the least match result.
    For example,as for '##htmll## tab ##xmml##', '##.## will match the whole string,which is exactly in greedy mode, '##.?##' will just match 'htmll', which is in non-greedy mode.

    re.match(r'##.*##','##html## tab ##xmml##').group(0)
    
    '##html## tab ##xmml##'
    
    re.match(r'##.*?##','##html## tab ##xmml##').group(0)
    
    '##html##'
    

    1.6 Boundary matchers

    Until now, we have just tried to find out regular expressions within a text.Sometimes, when it is required to match a whole line, we may also need to match at the begining of a line or even at the end. This can be done thanks to boundary matchers.

    • ^ matches the begining of a line
    • $ matches the end of a line
    •  matches a word boundary
    • B opposite to the , anything that is not a word boundary
    • A matches the begining of the input
      * matches the end of the input
    re.findall(r'd+',r'123efg3')
    
    ['123', '3']
    
    re.findall(r'^d+',r'123efg3') #find decimal digits at the begining of a line
    
    ['123']
    
    re.findall(r'd+$',r'123efg3') #find decimal digits at the end of a line
    
    ['3']
    
    re.findall(r'w+',r'@..abc-123%ef5d#$')
    
    ['abc', '123', 'ef5d']
    

    Chapter 2 Regular expression with Python

    In the former section, the basic syntax of regular expression has been introduced generally with some simple code for illustration purpose.This section will put an emphasize on regular expressions in python.
    We can either directly use module to perform some tasks or use specific objects in module.The reason why we use obj in re is to avoid compie pattern again and so the re module will cache compiled patterns in the future calls.

    re.match('.*','Hello re').group(0) # directly using module to perform match task
    
    'Hello re'
    
    patt=re.compile(r'.*')
    patt.match('Hello re').group(0) #using pattern obj to perform match task
    
    'Hello re'
    

    There are 2 kinds of object in re module:(1) Pattern Object, representing a compiled regular expression (2) Match Object,representing the matched object.

    2.1 Pattern Object

    pattern=re.compile(r'<')
    type(pattern)
    
    _sre.SRE_Pattern
    

    Method match

    help(pattern.match)
    
    Help on built-in function match:
    
    match(string=None, pos=0, endpos=9223372036854775807, *, pattern=None) method of _sre.SRE_Pattern instance
        Matches zero or more characters at the beginning of the string.
    
    help(re.match)
    
    Help on function match in module re:
    
    match(pattern, string, flags=0)
        Try to apply the pattern at the start of the string, returning
        a match object, or None if no match was found.
    

    The key point is that this method try to match the compiled pattern only at the begining of the string, if there is a match,it then returns a MatchObject, otherwise returns None.

    pattern.match('<HTML>') #Return matchObj if matching at the begining of string
    
    <_sre.SRE_Match object; span=(0, 1), match='<'>
    
    type(pattern.match('  <HTML>')) #Return None if not matched at the begining of string
    
    NoneType
    
    pattern.search('  <HTML>') #Search method does not care about the begining
    
    <_sre.SRE_Match object; span=(2, 3), match='<'>
    

    The optional parameter pos specifies where to start searching,as show in the following code:

    pattern.match('  <HTML>',2)
    
    <_sre.SRE_Match object; span=(2, 3), match='<'>
    

    But pos bigger than 0 does not mean that the string starts at that index, see the following code:

    pattern2=re.compile(r'^<HTML>')
    pattern2.match(r'<HTML>')
    
    <_sre.SRE_Match object; span=(0, 6), match='<HTML>'>
    
    type(pattern2.match(r'  <HTML>',2))
    
    NoneType
    
    pattern2.match(r'  <HTML>'[2:])  # it works since slice operation of string,making a new string
    
    <_sre.SRE_Match object; span=(0, 6), match='<HTML>'>
    
    re.match(r'^d','..123
    12a',flags=re.M)
    

    The optional parameter endpos is the same with pos.

    help(pattern.search)
    
    Help on built-in function search:
    
    search(string=None, pos=0, endpos=9223372036854775807, *, pattern=None) method of _sre.SRE_Pattern instance
        Scan through string looking for a match, and return a corresponding match object instance.
        
        Return None if no position in the string matches.
    
    help(re.search)
    
    Help on function search in module re:
    
    search(pattern, string, flags=0)
        Scan through string looking for a match to the pattern, returning
        a match object, or None if no match was found.
    

    Notice the parameters in re.search and patternObj.search are different,flags setting is in re.search not in pattern.search, while pattern.search allows for pos and endpos setting.Also note that with the MULTILINE flag, the ^ symbol matches at the begining of the string and at the begining of each line,it will change the behaviour of search.

    pattern=re.compile('^d+')
    pattern.search('abc10
    12ac4',pattern=re.M)
    
    ---------------------------------------------------------------------------
    
    TypeError                                 Traceback (most recent call last)
    
    <ipython-input-160-737c5def7a9f> in <module>
          1 pattern=re.compile('^d+')
    ----> 2 pattern.search('abc10
    12ac4',pattern=re.M)
    
    
    TypeError: Argument given by name ('pattern') and position (1)
    
    re.search(r'd+','abc10de23
    12ac4',re.M).group(0) # re.M has no impact on regular expression without ^
    
    '10'
    
    re.find(r'd+','abc10de23
    12ac4')
    
    ---------------------------------------------------------------------------
    
    AttributeError                            Traceback (most recent call last)
    
    <ipython-input-177-1ae42c0d5a8f> in <module>
    ----> 1 re.find(r'd+','abc10de23
    12ac4')
    
    
    AttributeError: module 're' has no attribute 'find'
    
    type(re.search(r'^d+','abc10
    12ac4'))
    
    NoneType
    
    re.search(r'^d+','abc10
    12ac4',re.M) # re.M has impact on regular expression with ^
    
    <_sre.SRE_Match object; span=(6, 8), match='12'>
    
    pattern=re.compile(r'd+',re.M)
    pattern.search('abc10de23
    12ac4')# re.M has no impact on regular expression without ^
    
    <_sre.SRE_Match object; span=(3, 5), match='10'>
    
    pattern=re.compile(r'^d+',re.M)
    pattern.search('abc10de23
    12ac4')# re.M has impact on regular expression with ^
    
    <_sre.SRE_Match object; span=(10, 12), match='12'>
    

    The pos and endpos parameters have the same meaning as that in the match operation.

    Method findall

    help(pattern.findall)
    
    Help on built-in function findall:
    
    findall(string=None, pos=0, endpos=9223372036854775807, *, source=None) method of _sre.SRE_Pattern instance
        Return a list of all non-overlapping matches of pattern in string.
    
    help(re.findall)
    
    Help on function findall in module re:
    
    findall(pattern, string, flags=0)
        Return a list of all non-overlapping matches in the string.
        
        If one or more capturing groups are present in the pattern, return
        a list of groups; this will be a list of tuples if the pattern
        has more than one group.
        
        Empty matches are included in the result.
    

    The previous operations such as match and search just work one match at a time.for example:

    re.match(r'a+','abc')
    
    <_sre.SRE_Match object; span=(0, 1), match='a'>
    
    re.search(r'b+','abcb')
    
    <_sre.SRE_Match object; span=(1, 2), match='b'>
    
    re.findall('b','abcb')
    
    ['b', 'b']
    
    re.findall(r'b+','abcb') # findall() returns a list with all the non-overlapping occurrences of a pattern
    
    ['b', 'b']
    
    re.findall(r'b*','abcb') # * allows 0 or more repetitions of the regular expression
    
    ['', 'b', '', 'b', '']
    
    re.findall(r'b?','abcb') # + 0 allows 0  or 1 repetition of the regular expression
    
    ['', 'b', '', 'b', '']
    
    re.findall(r'd$','ab3c5
    g6u7',re.M)
    
    ['5', '7']
    
    re.findall(r'^d','ab3
    4a56
    ab
    7ah',re.M)
    
    ['4', '7']
    
    re.findall(r'^d','ab3
    4a56
    ab
    7ah')
    
    []
    
    re.findall(r'd','ab3
    4a56
    ab
    7ah')
    
    ['3', '4', '5', '6', '7']
    
    re.findall(r'd+','ab3
    4a56
    ab
    7ah')
    
    ['3', '4', '56', '7']
    

    Special notice shall be taken about *,+. * allows 0 or more repetitions of the regular expression,? allows 0 or 1 repetetion, which means both of them match the expression even though the regular expression is not found.

    The process of above re.findall(r'b*','abcb') is as the following:

    • [ ] a b*---->Match ' '
    • [x] b b*---->Match'b'
    • [ ] c b*---->Match ' '
    • [x] b b*---->Match 'b'
    • [ ] $ b*---->Match ' '
    • So returns ['','b','','b','']

    '^' and '$' shall be used in conjunction with re.M flag so that findall method will scan the whole string.

    Method finditer

    It works just like findall, but it returns an iterator in which each element is a MatchObj.

    pattern=re.compile(r'(w+)->(w+)')
    it=pattern.finditer(r'Hello->world->hola->mundo')
    
    next(it).groups()
    
    ('Hello', 'world')
    
    next(it).groups()
    
    ('hola', 'mundo')
    
    next(it).groups()
    
    ---------------------------------------------------------------------------
    
    StopIteration                             Traceback (most recent call last)
    
    <ipython-input-218-f61d210e015b> in <module>
    ----> 1 next(it).groups()
    
    
    StopIteration: 
    

    Method split

    help(re.split)
    
    Help on function split in module re:
    
    split(pattern, string, maxsplit=0, flags=0)
        Split the source string by the occurrences of the pattern,
        returning a list containing the resulting substrings.  If
        capturing parentheses are used in pattern, then the text of all
        groups in the pattern are also returned as part of the resulting
        list.  If maxsplit is nonzero, at most maxsplit splits occur,
        and the remainder of the string is returned as the final element
        of the list.
    
    pattern=re.compile(r'W+')
    pattern.split('hello--->world')
    
    ['hello', 'world']
    
    pattern.split('hello-->world--<halo-->mungo',2) 
    
    ['hello', 'world', 'halo-->mungo']
    

    If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.like the following:

    pattern1=re.compile(r'(W+)')  # 
    pattern1.split('hello-->world--<halo-->mungo')
    
    ['hello', '-->', 'world', '--<', 'halo', '-->', 'mungo']
    

    Method sub

    help(pattern.sub)
    
    Help on built-in function sub:
    
    sub(repl, string, count=0) method of _sre.SRE_Pattern instance
        Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.
    
    help(re.sub)
    
    Help on function sub in module re:
    
    sub(pattern, repl, string, count=0, flags=0)
        Return the string obtained by replacing the leftmost
        non-overlapping occurrences of the pattern in string by the
        replacement repl.  repl can be either a string or a callable;
        if a string, backslash escapes in it are processed.  If it is
        a callable, it's passed the match object and must return
        a replacement string to be used.
    
    pattern=re.compile(r'[0-9]+')
    pattern.sub('*','order1-->order2-->
    order3--->order4')
    
    'order*-->order*-->
    order*--->order*'
    
    re.compile('^d',re.M).sub('*','acd34u
    3gh
    6yds') #'^' shall be used in conjunction with re.M
    
    'acd34u
    *gh
    *yds'
    
    re.compile('^d').sub('*','acd34u
    3gh
    6yds')
    
    'acd34u
    3gh
    6yds'
    

    The 'repl' argument can also be a function, in which case it receives a Matchobject whcih match 'pattern' from string as an argument and the string returned is the replacement.

    def normalize_orders(matchobj):
        if matchobj.group(1)=='-':return 'A'        # Note that: group(1)
        else:return 'B'
    
    re.sub('(-|[A-Z])',normalize_orders,'-1234 A193 -345 B876')
    
    'A1234 B193 A345 B876'
    

    Backreference is a feature using ' umber' to reference the previous matched group,we will learn it in detail later. Here it can also be used to perform the task of sub.Here is the following codes:

    text='imagin a new *world*, a magic *world*'
    pattern=re.compile(r'*(.*?)*')
    pattern.sub(r'<b>1<\b>',text)
    
    'imagin a new <b>world<\b>, a magic <b>world<\b>'
    
    There are 2 main points to notice here. (1)(.*?) non greedy mode allows us to extract '*world*', rather than '*world*,a magic *world*'.
    (2)1 is the backreference, meaning the first and the only one matched group that is (.*?)
    

    Method subn

    help(pattern.subn)
    
    Help on built-in function subn:
    
    subn(repl, string, count=0) method of _sre.SRE_Pattern instance
        Return the tuple (new_string, number_of_subs_made) found by replacing the leftmost non-overlapping occurrences of pattern with the replacement repl.
    
    text='imagin a new *world*, a magic *world*'
    pattern=re.compile(r'*(.*?)*')
    pattern.subn(r'<b>1<\b>',text)
    
    ('imagin a new <b>world<\b>, a magic <b>world<\b>', 2)
    

    2.2 MatchObject

    This object represents the matched pattern;We will get one every time we execute: match,search,finditer etc. operations.

    Method group

    help(match.group)
    
    Help on built-in function group:
    
    group(...) method of _sre.SRE_Match instance
        group([group1, ...]) -> str or tuple.
        Return subgroup(s) of the match by indices or names.
        For 0 returns the entire match.
    

    If this method is invoked with no arguments or 0, it returns the entire match; while if one or more group identifiers are passed,the corresponding group's matches will be returned.

    pattern=re.compile(r'(w+) (w+)')  # 2 (), so there are 2 group to be matched
    match=pattern.search('Hello world JohnYang')
    
    match.group(0)
    
    'Hello world'
    
    match.group()
    
    'Hello world'
    
    match.group(1)
    
    'Hello'
    
    match.group(2)
    
    'world'
    
    match.group(3)
    
    ---------------------------------------------------------------------------
    
    IndexError                                Traceback (most recent call last)
    
    <ipython-input-33-de3fb602d165> in <module>
    ----> 1 match.group(3)
    
    
    IndexError: no such group
    
    match.group(0,2)
    
    ('Hello world', 'world')
    

    If the pattern has named groups(its format is (?P<groupname>R),R is regular expression),they can be accessed using the names or the index.

    pattern=re.compile(r'(?P<first>w+) (?P<second>w+)')
    match=pattern.search('Hello world')
    
    match.group(0)
    
    'Hello world'
    
    match.group('first')
    
    'Hello'
    
    match.group('second')
    
    'world'
    
    match.group(1,'first',2,'second')
    
    ('Hello', 'Hello', 'world', 'world')
    

    Method groups

    help(match.groups)
    
    Help on built-in function groups:
    
    groups(default=None) method of _sre.SRE_Match instance
        Return a tuple containing all the subgroups of the match, from 1.
        
        default
          Is used for groups that did not participate in the match.
    
    pattern=re.compile(r'w+ w+')
    match=pattern.search('Hello world')
    match.groups()                              # no subgroups because of lacking of () in the regular expression
    
    ()
    
    pattern=re.compile(r'(w+) (w+)')
    match=pattern.search('Hello world')
    
    match.groups()
    
    ('Hello', 'world')
    
    match.groups()
    
    ('Hello', 'world')
    
    pattern=re.compile(r'(w+) (w+)?') #(w+)? the ? is necessary showing having the second group or not, if removed, it will fail to matcht the following code
    match1=pattern.search(r'hello ')
    
    match1.groups()
    
    ('hello', None)
    
    match1.groups('default setting for None')
    
    ('hello', 'default setting for None')
    

    Method groupdict

    This method is used in the cases where named groups have benn used.It will return a dictionary with all the groups that were found.

    pattern=re.compile(r'(?P<first>w+) (?P<second>w+)')
    pattern.search('Hellow world').groupdict()
    
    {'first': 'Hellow', 'second': 'world'}
    

    Method start

    Sometimes, it is useful to know the index where the pattern matched.As with all the operations related to groups,if the argument group is 0,then the operateion works with the whole string matched.

    help(match.start)
    
    Help on built-in function start:
    
    start(group=0, /) method of _sre.SRE_Match instance
        Return index of the start of the substring matched by group.
    
    pattern=re.compile(r'(?P<first>w+) (?P<second>w+)?') # ? is necessary
    match=pattern.search(r' Hello ')
    
    match.start(1)
    
    1
    
    match.start(2) # if there are groups don't match, then -1 is returned
    
    -1
    

    Method end

    The end operation behaves exactly the same as start, except that it returns the end of the substring matched by the group.

    match=pattern.search('Hello  ')
    
    match.groups()
    
    ('Hello', None)
    
    match.end(1)
    
    5
    

    Method span

    help(match.span)
    
    Help on built-in function span:
    
    span(group=0, /) method of _sre.SRE_Match instance
        For MatchObject m, return the 2-tuple (m.start(group), m.end(group)).
    
    match.span(1)
    
    (0, 5)
    

    Method expand

    This operation returns the string after replacing it with backreferences in the template string. It's similar to sub.

    help(match.expand)
    
    Help on built-in function expand:
    
    expand(template) method of _sre.SRE_Match instance
        Return the string obtained by doing backslash substitution on the string template, as done by the sub() method.
    
    text='imagin a new *world* a magic *world*'
    match=re.search(r'*(.*?)*',text)
    match.groups()
    
    ('world',)
    
    match.expand(r'<b>1<b>')
    
    '<b>world<b>'
    

    2.3 Module Operations

    There are 2 useful operations from the module.

    Method escape

    help(re.escape)
    
    Help on function escape in module re:
    
    escape(pattern)
        Escape all the characters in pattern except ASCII letters, numbers and '_'.
    
    re.escape('^')
    
    '\^'
    
    re.findall(r'^',r'^like^')
    
    ['^', '^']
    
    re.findall(re.escape('^'),'^^like^^')
    
    ['^', '^', '^', '^']
    

    Method purge

    help(re.purge)
    
    Help on function purge in module re:
    
    purge()
        Clear the regular expression caches
    

    2.4 Compilation flags

    When compiling a pattern string into a pattern object,it's possible to modify the standard behaviour of the patterns. In order to do that, we have to use the compilation flags.These can be combined using '|'. Let's see examples of some important flags.

    re.IGNORECASE or re.I

    This pattern will match lower case and upper case.

    pattern=re.compile(r'[a-z]+',re.I)
    pattern1=re.compile(r'[a-z]+')
    
    pattern.findall('Felix')
    
    ['Felix']
    
    pattern1.findall('Felix')
    
    ['elix']
    

    re.MULTILINE or re.M

    This flag changes the behaviour of two metacharacter:

    • ^ :which now matches at the begining of the string and at the begining of each new line;
    • $ : whcih now matches at the end of the string and the end of each line.
    pattern=re.compile('^w+:s*(w+/w+/w+)')
    
    pattern.findall('date:   12/01/2013 
    date: 11/01/2013')
    
    ['12/01/2013']
    
    pattern1=re.compile('^w+:s*(w+/w+/w+)',re.M)
    
    pattern1.findall('date:   12/01/2013 
    date: 11/01/2013')
    
    ['12/01/2013', '11/01/2013']
    

    re.S

    The metacharacter'.' will match any character even the newline

    re.findall('^d(.)','1
    e')
    
    []
    
    re.findall(r'^d(.)','1
    e',re.S) # see that, '.' can even match newline '
    ' !
    
    ['
    ']
    

    Chapter 3 Grouping

    3.1 Introduction

    We have already seen groups in several examples throughtout chapter2.Grouping is accomplished through two metacharacters, the parentheses ().
    The first use of parentheses would be building a subexpression.For example:

    re.match(r'd-w','1-a2-b3-v').group(0)  # Without group, as long as matching successfully the express, it will return , not caring about the rest of string.
    
    '1-a'
    
    re.match(r'(d-w)+','1-a2-b3-v').group(0,1) #with group,subexpression can be created and used, here match method can walk through the whole string
    
    ('1-a2-b3-v', '3-v')
    
    re.search(r'(ab)+c','ababcab').group(0,1) # match ab following c, so the last ab will not be searched
    
    ('ababc', 'ab')
    

    The second simple use is limiting the scope of alternation.For example, we want to search 'JohnYang' and 'JohnWang',we can use regular expression 'John(Yang|Wang).In contrast, using 'John[Yang|wang]'cannot search either JohnYang or JohnWang.

    re.search('John[Yanng|Wang]','JohnY').group(0) # JohnY shall not be searched,we just want JohnYang and JohnWang
    
    'JohnY'
    
    re.search('John(Yang|Wang)','JohnY').group(0) # JohnY will not be matchedS
    
    ---------------------------------------------------------------------------
    
    AttributeError                            Traceback (most recent call last)
    
    <ipython-input-176-928736c9aaff> in <module>
    ----> 1 re.search('John(Yang|Wang)','JohnY').group(0)
    
    
    AttributeError: 'NoneType' object has no attribute 'group'
    

    3.2 Backreference

    Backreference can be implemented by umber, here,number is the corresponding group in regular expression.The best known example to bring some clarity is the regular expression to find duplicated words, as shown in the following code:

    pattern=re.compile(r'(w+) 1')
    match=pattern.search(r'hello hello world')
    
    match.groups()
    
    ('hello',)
    

    another example of application in sub method:

    pattern=re.compile(r'(d+)-(w+)')  # two groups in regular expressino
    
    pattern.sub(r'2-1','1-a
    20-bear
    34-afcr')
    
    'a-1
    bear-20
    afcr-34'
    

    Backreferences can be used with the first 99 groups.Obviously, with an increase in the number of groups, you will find the task of reading and maintaining the regular expression more complex.And this is something that can be reduced with named groups;

    3.3 Named groups

    Let's see how it works with the previous example by the way of named groups:

    pattern=re.compile(r'(?P<country>d+)-(?P<id>w+)')
    
    pattern.sub(r'g<id>-g<country>','1-a
    20-bear
    34-afcr')
    
    '1-a
    20-bear
    34-afcr'
    

    As we have seen in the previous example,in order to reference a group by the name in the sub operation, we have to use g<name>. We can also use named groups inside the pattern itself, as seen in the following example:

    pattern=re.compile(r'(?P<word>w+) (?P=word)')
    
    pattern.search(r'hello hello world').group(0,1)
    
    ('hello hello', 'hello')
    

    summary

       Use              Syntax
    
    • Inside a pattern (?P=name)
    • In the repl string of
      the sub operation g<name>
    • In any of the operation
      of the MatchObj match.group('name')

    3.4 Non-capturing groups

    The syntax of non-capturing groups is (?:pattern), and the reason why we need to use non-capturing groups is to save resources and
    the group cannot be referenced. And the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
    In a word,non-captured pattern will not be shown when groupped using group(),and will not be referenced using umber.

    re.search(r'(a|b)+','abbacabda').groups() # expected group here is a character
    
    ('a',)
    
    re.search(r'(a|b)+','abbacabda').groups() #expected group here is also a character.
    
    ('a',)
    
    re.search(r'((a|b)+)','abbacabda').groups()  # Things changed because of the addition of an outer layer of '()',one is requiring overlapping a or b, the other is the same with above,just one character.
    
    ('abba', 'a')
    
    re.search(r'((a|b)+)','abbacabda').group(1,2) # Here we can see number 1 group is the outer one,number 2 is the inner one.
    
    ('abba', 'a')
    
    re.search(r'(?:(a|b)+)','abbacabda').groups() # Non-capture the outer one,which is the aggregation of either a or b,so returns just the second group
    
    ('a',)
    
    re.search(r'((?:a|b)+)','abbacabda').groups() #Non-capture the inner one,left the outer group.
    
    ('abba',)
    

    3.5 Special cases with groups

    Flags per group

    The syntax is just a special form of grouping:(?iLmsux),representing re.i,re.L,re.m,re.s,re.u,re.x respectively.

    re.findall(r'(?m)(^d+)','at35e
    3gjh
    7yhg') #equil to "re.findall(r'(^d+)','at35e
    3gjh
    7yhg',flags=re.M)"
    
    ['3', '7']
    
    re.findall(r'(^d+)','at35e
    3gjh
    7yhg',flags=re.M)
    
    ['3', '7']
    
    re.findall(r'(^d+)','at35e
    3gjh
    7yhg')
    
    []
    

    yes-pattern|no-pattern

    This is a very useful case of groups.The syntax is (?(id/name)yes-pattern|no-pattern).This expression means: if the group with id or name has been matched, then at this point of the string, the yes-pattern pattern has to match.If the group has not been matched,then the no-pattern pattern has to match, and no-pattern is optional,can be ommitted.It's just like an if-else statement.

    pattern=re.compile(r'(dd-)?(w{3,4})-(?(1)(dd)|([a-z]{3,4}))$')# when (dd) exists,group 3 shall be matched against (dd), and if group 1 which is(dd) does not exist,group 3 shall be matched with [a-a]{3,4}
    
    pattern.match('34-erte-22').groups() 
    
    ('34-', 'erte', '22', None)
    
    pattern.findall('34-erte-22')
    
    [('34-', 'erte', '22', '')]
    
    pattern.findall('erte-abcd')
    
    [('', 'erte', '', 'abcd')]
    
    pattern.match('ert-abcd').groups()
    
    (None, 'ert', None, 'abcd')
    

    Overlapping groups

    help(re.findall)
    
    Help on function findall in module re:
    
    findall(pattern, string, flags=0)
        Return a list of all non-overlapping matches in the string.
        
        If one or more capturing groups are present in the pattern, return
        a list of groups; this will be a list of tuples if the pattern
        has more than one group.
        
        Empty matches are included in the result.
    
    re.findall(r'(a|b)+','ababcab') #expected group is one character either a or b.
    
    ['b', 'b']
    

    In the above code,findall performs non-overlapping match,overlapping match is : the first group-->abab, the second group-->ab,but captered group is b, it only formed by the last one matched character.while findall returns all non-overlapping matches,so just return ['b','b']

    re.findall('(a|b)?','ababcab')
    
    ('a', 'a')
    
    re.search(r'((a|b)+)','ababcab').groups() # The first group is the outer cancatenating group, the second group is the inner group and as log as pattern matched, it returns.
    
    ('abab', 'b')
    
    re.search(r'((a|b))','ababcab').groups() # Removing '+', the outer group is the same with the inner group, which are both containing one character
    
    ['a', 'b', 'a', 'b', '', 'a', 'b', '']
    
    
    
    re.findall('(a|b)*','ababcab')
    
    
    ['b', '', 'b', '']
    
    re.findall(r'((a|b)+)','ababcab')
    
    [('abab', 'b'), ('ab', 'b')]
    
    re.search(r'(?:(a|b)+)','ababcab').groups() # Non capture the outer group
    
    ('b',)
    
    re.search(r'((?:a|b)+)','ababcab').groups() # Non capture the inner group
    
    ('abab',)
    
    re.findall(r'((?:a|b)+)','ababcab') # find all outer-group pattern
    
    ['abab', 'ab']
    
    re.findall(r'(?:a|b)+','ababcab') # No groups here,proved by the below code,so matching the concatenating pattern group.
    
    ['abab', 'ab']
    
    re.search(r'(?:a|b)+','ababcab').groups()
    
    ()
    

    Chapter 4 Look around

    Until now, we have learned different mechanism of matching characters while discarding them.A character that is matched cannot be compared again, and the only way to match any upcoming character is by discarding it.

    The exception to thhis are a number of metacharacters called zero-width assertion.For example,'^' and '$' are both zero-width assertion,just ensure that the positioin in the input is correct without actually consuming or matching any character.

    A more powerful kind of zero-width assertion is look around, a mechanism with which it is possible to match a certain previous(look behind) or ulterior(look ahead) value to the current position.They effectively do assertion without consumingg characters;They just return a positive or negative result of the match.

    Both look ahead and look behind could be subdivided into another two types each:positive and negative.

    • Positive look ahead:The syntax is (?=pattern),and will match if the passed pattern do match against the forthcoming input.
    • Negative look ahead:The syntax is (?!pattern),and will match if the passed pattern do not match against the forthcoming input.
    • Positive look behind:The syntax is (?<=pattern),and will match if the passed pattern do match against the previous input.
    • Negative look behind:The syntax is (?<!pattern),and will match if the passed pattern do not match against the previous input.

    Look ahead

    pat=re.compile(r'fox')
    
    pat.search('This is a fox').span()
    
    (10, 13)
    
    pat1=re.compile(r'(?=fox)')
    
    pat1.search('This is a fox').span() # This shows look ahead is a zero-width assertion
    
    (10, 10)
    
    pat2=re.compile(r'(w+(?=,))') # expected group is a set of alphanum following ','(not included)
    
    pat2.findall('They were three: Felix,Victor,and Carlos.')
    
    ['Felix', 'Victor']
    
    pat3=re.compile(r'(w+,)')
    
    pat3.findall('They were three: Felix,Victor,and Carlos.') # here ',' has to be matched. So ',' will also be returned.
    
    ['Felix,', 'Victor,']
    

    It's noteworthy that the look ahead mechanism is another subexpression that can be leveraged with all the power of regular expression(It's not the same with look behind mechanism we will discover later)

    pattern=re.compile(r'(w+(?=,|.))') # notice here, '.' has to be backslashed to specify its character meaning rather than metacharacter role.
    
    pattern.findall('They were three: Felix,Victor,and Carlos.')
    
    ['Felix', 'Victor', 'Carlos']
    

    Negative look ahead

    pattern=re.compile(r'John(?!sSmith)')
    
    result=pattern.finditer('I would rather go out with John McLane rather with John Smith or John Bon Jovi')
    
    for i in result:
        print(i.span())
    
    (27, 31)
    (65, 69)
    

    Look around and substitution

    The zero-width nature of the look around operation is especially useful in substitutions.One typical example of look ahead and substitution would be the conversion of a number composed of just numeric characters, such as 1234567890,into a comma separated number, that is 1,234,567,890.

    pattern=re.compile(r'd{1,3}(?=(d{3})+(?!d))')  # Look ahead allowes variable-length pattern
    
    for i in pattern.finditer('1234567890'):
        print(i.start(),i.end())
    
    0 1
    1 4
    4 7
    

    The expected group is d{1,3} and it must be followed with some d{3} and non d.

    pattern.sub('g<0>,','1234567890') 
    
    '1,234,567,890'
    
    pattern.sub('g<0>,','123456789')
    
    '123,456,789'
    

    In a summary, we can understand look ahead this way: (?=pattern) just try to find pattern,and locate at pattern or frontage of pattern if it is not a single character.

    Look behind

    pattern=re.compile(r'(?<=Johns)McLane') # (?<=Johns) locates 's'
    
    result=pattern.finditer(r'I would rather go out with John McLane than with John Smith or John Bon Jovi')
    
    for i in result:
        print(i.start(),i.end())
    
    32 38
    

    Attention
    In python's re module,there is,however,a fundamental difference between howw look ahead and look behind are implemented.The look behind mechanism is only able to match fixed-width patterns. Fixed-width patterns do not contain variable-length matches such as the quantifiers.Other variable-length construction such as backreferences are not allowed either.Alternation is allowed but only if the alternatives have the same length.

    pattern=re.compile(r'(?<=(John|Jonathan)s)McLane')
    
    ---------------------------------------------------------------------------
    
    error                                     Traceback (most recent call last)
    
    <ipython-input-30-07cadf3808e8> in <module>()
    ----> 1 pattern=re.compile(r'(?<=(John|Jonathan)s)McLane')
    
    
    D:Anacondalib
    e.py in compile(pattern, flags)
        232 def compile(pattern, flags=0):
        233     "Compile a regular expression pattern, returning a Pattern object."
    --> 234     return _compile(pattern, flags)
        235 
        236 def purge():
    
    
    D:Anacondalib
    e.py in _compile(pattern, flags)
        284     if not sre_compile.isstring(pattern):
        285         raise TypeError("first argument must be string or compiled pattern")
    --> 286     p = sre_compile.compile(pattern, flags)
        287     if not (flags & DEBUG):
        288         if len(_cache) >= _MAXCACHE:
    
    
    D:Anacondalibsre_compile.py in compile(p, flags)
        766         pattern = None
        767 
    --> 768     code = _code(p, flags)
        769 
        770     if flags & SRE_FLAG_DEBUG:
    
    
    D:Anacondalibsre_compile.py in _code(p, flags)
        605 
        606     # compile the pattern
    --> 607     _compile(code, p.data, flags)
        608 
        609     code.append(SUCCESS)
    
    
    D:Anacondalibsre_compile.py in _compile(code, pattern, flags)
        180                 lo, hi = av[1].getwidth()
        181                 if lo != hi:
    --> 182                     raise error("look-behind requires fixed-width pattern")
        183                 emit(lo) # look behind
        184             _compile(code, av[1], flags)
    
    
    error: look-behind requires fixed-width pattern
    
    pattern=re.compile(r'(?<=B@)[w_]+')
    
    pattern.findall('Know your big data=5 for $50 on eBooks aand 40% offf all eBooks until Friday #hadoop @HadoopNews paacktpub.com/bigdataofffers')
    
    ['HadoopNews']
    

    Negative look behind

    Negative look behind has the same limitations with positive look behind,fixed-width pattern.

    pattern=re.compile(r'(?<!Johns)Doe')
    
    result=pattern.finditer('John Doe,Calvin Doe,Hobbes Doe')
    
    for i in result:
        print(i.start(),i.end())
    
    16 19
    27 30
    

    Look around and groups

    Another beneficial use of look around constructions is inside groups.Typically,when groups are used, a very specific result has to be matched and returned inside the group.As we don't want to pollute the groups with informationo that is not required,among other potential options we can leverage look around as a favorable solution.

    pattern=re.compile(r'w+s[d-]+s[d:,]+s(.*(?<!authentications)failed)')
    

    (.*(?<!authentications)failed) is a group, only when 'authentication' does not appear, can look behind work from the begining of an alphanum set before 'failed'.

    pattern.findall('Info 2020-02-24 23:43:44,487 authentication failed')
    
    []
    
    pattern.findall('Info 2020-02-24 23:43:44,487 something failed')
    
    ['something failed']
    
    
    
    ##### 愿你一寸一寸地攻城略地,一点一点地焕然一新 #####
  • 相关阅读:
    (转) c++ 迭代器
    (转) 学习C++ -> 向量(vector)
    latex 模版
    javascript继承篇
    ES6的Map和Set的使用,以及weakMap的一点理解
    JavaScript中的eval函数
    Node.js中的进程与线程
    普通函数与箭头函数的区别是什么?
    git 远程拉去代码 输入用户名密码
    npm 设置镜像
  • 原文地址:https://www.cnblogs.com/johnyang/p/12359699.html
Copyright © 2011-2022 走看看