Dive into RE in Python
Standard re module in python is powerful to handle text manipulation,such as searching,matching,splitting etc, and it is necessary to learn about it when tasks above appears.Since official document of re is a bit obscure, I refer to book 《Mastering Regular Expression》 written by Felix Lopez mainly to depict the whole picture of regular expression and help anyone who desires to have an in-depth view of re but lack of systematically fined learning materials.
import re
Chapter1 Basic syntax of regular expressions
1.1 Literal and metacharacter
Regular expressions consist of two components:(1)literals,for example,'file','.xml','145' etc.(2)metacharacters,for example,'?','*',etc., and they have special meaning in regular express.These 12 metacharacters that shall be escaped if they are to be used with literal meaning are as the following:
- Backslash
- Caret ^
- Dollar sign $
- Dot .
- Pipe symbol |
- Question mark ?
- Asterisk *
- Plus asign +
- Opening square bracket [
- Opening parenthesis (
- Closing parenthesis )
- Opening curly brace {
The meaning of these metacharacters will be illustrated by code examples latter.
1.2 Character classes
Character class is actually a set of character. There are 2 main kinds of character classes:
(1) User-defined by metacharacter '[]', for example, 'licen[cs]e' will match 'license' or 'licence', further more, '-' can be used inside '[]' representing a scope of character, for example,[a-z] can match any lowercase letter, [a-z123] will match any lowercase letter plusing with '1,2,3'.'^' can also be used inside '[]' representing a reverse set.
(2)Predefined
- '.'matches any character except
- 'd'matches any decimal digit, equil to [0-9]
- 'D' matches any non-decimal digit,equil to [^0-9]
- 's' matches any whitespace character,equil to [ fv]
- 'S' matches any non-whitespace character,equil to [^ fv]
- 'w' matches any alphanumeric character,equil to [a-zA-Z0-9_] (including'_')
- 'W' matches any non-alphanumeric character,equil to the [^a-zA-Z0-9_]
re.match(r'w','2a_d').group(0)
'2'
re.match(r'w','_ad').group(0) #Including '_'
'_'
1.3 Alternation
Character class is a set of character, and alternation is a set of regular expression using '|'.One thing to mention, when using in bigger regular expression, we will probably need to wrap our alternation inside parentheses to express that only that part is alternated and not the whole expression.For instance, 'License:yes|no' will match 'License:yes' or 'no' but not 'yes' or 'no', to achive the latter one, using'License:(yes|no)'.
re.findall(r'License:yes|no',r'...License:yes,no..')
['License:yes', 'no']
re.findall(r'License:(yes|no)',r'...License:yes,no...')
['yes']
re.findall(r'License:(no|yes)','...License:yes,no...')
['yes']
help(re.findall)
Help on function findall in module re:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
1.4 Quantifiers
Quantifiers are mechanism that determine how a character,metercharater,regular expression or character class can be repeated.
- '?'means optional(0 or 1 repetitions)
- '*'means zero or more times
- '+'means one or more times
- '{n,m}'means n to m times, m or n can be ignored meaning repeating at least n times or m times
re.findall(r'd+',r'..123a6bcd45...')#find one decimal digit or more decimal digits
['123', '6', '45']
re.findall(r'd{1}',r'...123a6bcd45...') #find one decimal digit, equil to re.findall(r'd',r'...123a6bcd45...')
['1', '2', '3', '6', '4', '5']
re.findall(r'd',r'...123a6bcd45...') # the same with above
['1', '2', '3', '6', '4', '5']
re.findall(r'd{1,4}',r'...123a6bcd45...') #find one decimal digit to four decimal digits
['123', '6', '45']
re.findall(r'd{2}',r'...123a6bcd45...') #find 2 decimal digit s
['12', '45']
re.findall(r'd{2,}',r'...123a6bcd45...') #find at least 2 decimal digits
['123', '45']
1.5 Greedy and non-greedy behaviour
Greedy behaviour will try to match as much as possible to have the biggest match result possible, while non-greedy behaviour will oppositely try to match as much as possible to have the least match result.
For example,as for '##htmll## tab ##xmml##', '##.## will match the whole string,which is exactly in greedy mode, '##.?##' will just match 'htmll', which is in non-greedy mode.
re.match(r'##.*##','##html## tab ##xmml##').group(0)
'##html## tab ##xmml##'
re.match(r'##.*?##','##html## tab ##xmml##').group(0)
'##html##'
1.6 Boundary matchers
Until now, we have just tried to find out regular expressions within a text.Sometimes, when it is required to match a whole line, we may also need to match at the begining of a line or even at the end. This can be done thanks to boundary matchers.
- ^ matches the begining of a line
- $ matches the end of a line
- matches a word boundary
- B opposite to the , anything that is not a word boundary
- A matches the begining of the input
* matches the end of the input
re.findall(r'd+',r'123efg3')
['123', '3']
re.findall(r'^d+',r'123efg3') #find decimal digits at the begining of a line
['123']
re.findall(r'd+$',r'123efg3') #find decimal digits at the end of a line
['3']
re.findall(r'w+',r'@..abc-123%ef5d#$')
['abc', '123', 'ef5d']
Chapter 2 Regular expression with Python
In the former section, the basic syntax of regular expression has been introduced generally with some simple code for illustration purpose.This section will put an emphasize on regular expressions in python.
We can either directly use module to perform some tasks or use specific objects in module.The reason why we use obj in re is to avoid compie pattern again and so the re module will cache compiled patterns in the future calls.
re.match('.*','Hello re').group(0) # directly using module to perform match task
'Hello re'
patt=re.compile(r'.*')
patt.match('Hello re').group(0) #using pattern obj to perform match task
'Hello re'
There are 2 kinds of object in re module:(1) Pattern Object, representing a compiled regular expression (2) Match Object,representing the matched object.
2.1 Pattern Object
pattern=re.compile(r'<')
type(pattern)
_sre.SRE_Pattern
Method match
help(pattern.match)
Help on built-in function match:
match(string=None, pos=0, endpos=9223372036854775807, *, pattern=None) method of _sre.SRE_Pattern instance
Matches zero or more characters at the beginning of the string.
help(re.match)
Help on function match in module re:
match(pattern, string, flags=0)
Try to apply the pattern at the start of the string, returning
a match object, or None if no match was found.
The key point is that this method try to match the compiled pattern only at the begining of the string, if there is a match,it then returns a MatchObject, otherwise returns None.
pattern.match('<HTML>') #Return matchObj if matching at the begining of string
<_sre.SRE_Match object; span=(0, 1), match='<'>
type(pattern.match(' <HTML>')) #Return None if not matched at the begining of string
NoneType
pattern.search(' <HTML>') #Search method does not care about the begining
<_sre.SRE_Match object; span=(2, 3), match='<'>
The optional parameter pos specifies where to start searching,as show in the following code:
pattern.match(' <HTML>',2)
<_sre.SRE_Match object; span=(2, 3), match='<'>
But pos bigger than 0 does not mean that the string starts at that index, see the following code:
pattern2=re.compile(r'^<HTML>')
pattern2.match(r'<HTML>')
<_sre.SRE_Match object; span=(0, 6), match='<HTML>'>
type(pattern2.match(r' <HTML>',2))
NoneType
pattern2.match(r' <HTML>'[2:]) # it works since slice operation of string,making a new string
<_sre.SRE_Match object; span=(0, 6), match='<HTML>'>
re.match(r'^d','..123
12a',flags=re.M)
The optional parameter endpos is the same with pos.
Method search
help(pattern.search)
Help on built-in function search:
search(string=None, pos=0, endpos=9223372036854775807, *, pattern=None) method of _sre.SRE_Pattern instance
Scan through string looking for a match, and return a corresponding match object instance.
Return None if no position in the string matches.
help(re.search)
Help on function search in module re:
search(pattern, string, flags=0)
Scan through string looking for a match to the pattern, returning
a match object, or None if no match was found.
Notice the parameters in re.search and patternObj.search are different,flags setting is in re.search not in pattern.search, while pattern.search allows for pos and endpos setting.Also note that with the MULTILINE flag, the ^ symbol matches at the begining of the string and at the begining of each line,it will change the behaviour of search
.
pattern=re.compile('^d+')
pattern.search('abc10
12ac4',pattern=re.M)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-160-737c5def7a9f> in <module>
1 pattern=re.compile('^d+')
----> 2 pattern.search('abc10
12ac4',pattern=re.M)
TypeError: Argument given by name ('pattern') and position (1)
re.search(r'd+','abc10de23
12ac4',re.M).group(0) # re.M has no impact on regular expression without ^
'10'
re.find(r'd+','abc10de23
12ac4')
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-177-1ae42c0d5a8f> in <module>
----> 1 re.find(r'd+','abc10de23
12ac4')
AttributeError: module 're' has no attribute 'find'
type(re.search(r'^d+','abc10
12ac4'))
NoneType
re.search(r'^d+','abc10
12ac4',re.M) # re.M has impact on regular expression with ^
<_sre.SRE_Match object; span=(6, 8), match='12'>
pattern=re.compile(r'd+',re.M)
pattern.search('abc10de23
12ac4')# re.M has no impact on regular expression without ^
<_sre.SRE_Match object; span=(3, 5), match='10'>
pattern=re.compile(r'^d+',re.M)
pattern.search('abc10de23
12ac4')# re.M has impact on regular expression with ^
<_sre.SRE_Match object; span=(10, 12), match='12'>
The pos and endpos parameters have the same meaning as that in the match
operation.
Method findall
help(pattern.findall)
Help on built-in function findall:
findall(string=None, pos=0, endpos=9223372036854775807, *, source=None) method of _sre.SRE_Pattern instance
Return a list of all non-overlapping matches of pattern in string.
help(re.findall)
Help on function findall in module re:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
The previous operations such as match and search just work one match at a time.for example:
re.match(r'a+','abc')
<_sre.SRE_Match object; span=(0, 1), match='a'>
re.search(r'b+','abcb')
<_sre.SRE_Match object; span=(1, 2), match='b'>
re.findall('b','abcb')
['b', 'b']
re.findall(r'b+','abcb') # findall() returns a list with all the non-overlapping occurrences of a pattern
['b', 'b']
re.findall(r'b*','abcb') # * allows 0 or more repetitions of the regular expression
['', 'b', '', 'b', '']
re.findall(r'b?','abcb') # + 0 allows 0 or 1 repetition of the regular expression
['', 'b', '', 'b', '']
re.findall(r'd$','ab3c5
g6u7',re.M)
['5', '7']
re.findall(r'^d','ab3
4a56
ab
7ah',re.M)
['4', '7']
re.findall(r'^d','ab3
4a56
ab
7ah')
[]
re.findall(r'd','ab3
4a56
ab
7ah')
['3', '4', '5', '6', '7']
re.findall(r'd+','ab3
4a56
ab
7ah')
['3', '4', '56', '7']
Special notice shall be taken about *,+. * allows 0 or more repetitions of the regular expression,? allows 0 or 1 repetetion, which means both of them match the expression even though the regular expression is not found.
The process of above re.findall(r'b*','abcb') is as the following:
- [ ] a b*---->Match ' '
- [x] b b*---->Match'b'
- [ ] c b*---->Match ' '
- [x] b b*---->Match 'b'
- [ ] $ b*---->Match ' '
- So returns ['','b','','b','']
'^' and '$' shall be used in conjunction with re.M flag so that findall method will scan the whole string.
Method finditer
It works just like findall, but it returns an iterator in which each element is a MatchObj.
pattern=re.compile(r'(w+)->(w+)')
it=pattern.finditer(r'Hello->world->hola->mundo')
next(it).groups()
('Hello', 'world')
next(it).groups()
('hola', 'mundo')
next(it).groups()
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
<ipython-input-218-f61d210e015b> in <module>
----> 1 next(it).groups()
StopIteration:
Method split
help(re.split)
Help on function split in module re:
split(pattern, string, maxsplit=0, flags=0)
Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings. If
capturing parentheses are used in pattern, then the text of all
groups in the pattern are also returned as part of the resulting
list. If maxsplit is nonzero, at most maxsplit splits occur,
and the remainder of the string is returned as the final element
of the list.
pattern=re.compile(r'W+')
pattern.split('hello--->world')
['hello', 'world']
pattern.split('hello-->world--<halo-->mungo',2)
['hello', 'world', 'halo-->mungo']
If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.like the following:
pattern1=re.compile(r'(W+)') #
pattern1.split('hello-->world--<halo-->mungo')
['hello', '-->', 'world', '--<', 'halo', '-->', 'mungo']
Method sub
help(pattern.sub)
Help on built-in function sub:
sub(repl, string, count=0) method of _sre.SRE_Pattern instance
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.
help(re.sub)
Help on function sub in module re:
sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl. repl can be either a string or a callable;
if a string, backslash escapes in it are processed. If it is
a callable, it's passed the match object and must return
a replacement string to be used.
pattern=re.compile(r'[0-9]+')
pattern.sub('*','order1-->order2-->
order3--->order4')
'order*-->order*-->
order*--->order*'
re.compile('^d',re.M).sub('*','acd34u
3gh
6yds') #'^' shall be used in conjunction with re.M
'acd34u
*gh
*yds'
re.compile('^d').sub('*','acd34u
3gh
6yds')
'acd34u
3gh
6yds'
The 'repl' argument can also be a function, in which case it receives a Matchobject whcih match 'pattern' from string as an argument and the string returned is the replacement.
def normalize_orders(matchobj):
if matchobj.group(1)=='-':return 'A' # Note that: group(1)
else:return 'B'
re.sub('(-|[A-Z])',normalize_orders,'-1234 A193 -345 B876')
'A1234 B193 A345 B876'
Backreference is a feature using ' umber' to reference the previous matched group,we will learn it in detail later. Here it can also be used to perform the task of sub.Here is the following codes:
text='imagin a new *world*, a magic *world*'
pattern=re.compile(r'*(.*?)*')
pattern.sub(r'<b>1<\b>',text)
'imagin a new <b>world<\b>, a magic <b>world<\b>'
There are 2 main points to notice here. (1)(.*?) non greedy mode allows us to extract '*world*', rather than '*world*,a magic *world*'.
(2)1 is the backreference, meaning the first and the only one matched group that is (.*?)
Method subn
help(pattern.subn)
Help on built-in function subn:
subn(repl, string, count=0) method of _sre.SRE_Pattern instance
Return the tuple (new_string, number_of_subs_made) found by replacing the leftmost non-overlapping occurrences of pattern with the replacement repl.
text='imagin a new *world*, a magic *world*'
pattern=re.compile(r'*(.*?)*')
pattern.subn(r'<b>1<\b>',text)
('imagin a new <b>world<\b>, a magic <b>world<\b>', 2)
2.2 MatchObject
This object represents the matched pattern;We will get one every time we execute: match,search,finditer etc. operations.
Method group
help(match.group)
Help on built-in function group:
group(...) method of _sre.SRE_Match instance
group([group1, ...]) -> str or tuple.
Return subgroup(s) of the match by indices or names.
For 0 returns the entire match.
If this method is invoked with no arguments or 0, it returns the entire match; while if one or more group identifiers are passed,the corresponding group's matches will be returned.
pattern=re.compile(r'(w+) (w+)') # 2 (), so there are 2 group to be matched
match=pattern.search('Hello world JohnYang')
match.group(0)
'Hello world'
match.group()
'Hello world'
match.group(1)
'Hello'
match.group(2)
'world'
match.group(3)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-33-de3fb602d165> in <module>
----> 1 match.group(3)
IndexError: no such group
match.group(0,2)
('Hello world', 'world')
If the pattern has named groups(its format is (?P<groupname>R
),R is regular expression),they can be accessed using the names or the index.
pattern=re.compile(r'(?P<first>w+) (?P<second>w+)')
match=pattern.search('Hello world')
match.group(0)
'Hello world'
match.group('first')
'Hello'
match.group('second')
'world'
match.group(1,'first',2,'second')
('Hello', 'Hello', 'world', 'world')
Method groups
help(match.groups)
Help on built-in function groups:
groups(default=None) method of _sre.SRE_Match instance
Return a tuple containing all the subgroups of the match, from 1.
default
Is used for groups that did not participate in the match.
pattern=re.compile(r'w+ w+')
match=pattern.search('Hello world')
match.groups() # no subgroups because of lacking of () in the regular expression
()
pattern=re.compile(r'(w+) (w+)')
match=pattern.search('Hello world')
match.groups()
('Hello', 'world')
match.groups()
('Hello', 'world')
pattern=re.compile(r'(w+) (w+)?') #(w+)? the ? is necessary showing having the second group or not, if removed, it will fail to matcht the following code
match1=pattern.search(r'hello ')
match1.groups()
('hello', None)
match1.groups('default setting for None')
('hello', 'default setting for None')
Method groupdict
This method is used in the cases where named groups have benn used.It will return a dictionary with all the groups that were found.
pattern=re.compile(r'(?P<first>w+) (?P<second>w+)')
pattern.search('Hellow world').groupdict()
{'first': 'Hellow', 'second': 'world'}
Method start
Sometimes, it is useful to know the index where the pattern matched.As with all the operations related to groups,if the argument group is 0,then the operateion works with the whole string matched.
help(match.start)
Help on built-in function start:
start(group=0, /) method of _sre.SRE_Match instance
Return index of the start of the substring matched by group.
pattern=re.compile(r'(?P<first>w+) (?P<second>w+)?') # ? is necessary
match=pattern.search(r' Hello ')
match.start(1)
1
match.start(2) # if there are groups don't match, then -1 is returned
-1
Method end
The end operation behaves exactly the same as start
, except that it returns the end of the substring matched by the group.
match=pattern.search('Hello ')
match.groups()
('Hello', None)
match.end(1)
5
Method span
help(match.span)
Help on built-in function span:
span(group=0, /) method of _sre.SRE_Match instance
For MatchObject m, return the 2-tuple (m.start(group), m.end(group)).
match.span(1)
(0, 5)
Method expand
This operation returns the string after replacing it with backreferences in the template string. It's similar to sub.
help(match.expand)
Help on built-in function expand:
expand(template) method of _sre.SRE_Match instance
Return the string obtained by doing backslash substitution on the string template, as done by the sub() method.
text='imagin a new *world* a magic *world*'
match=re.search(r'*(.*?)*',text)
match.groups()
('world',)
match.expand(r'<b>1<b>')
'<b>world<b>'
2.3 Module Operations
There are 2 useful operations from the module.
Method escape
help(re.escape)
Help on function escape in module re:
escape(pattern)
Escape all the characters in pattern except ASCII letters, numbers and '_'.
re.escape('^')
'\^'
re.findall(r'^',r'^like^')
['^', '^']
re.findall(re.escape('^'),'^^like^^')
['^', '^', '^', '^']
Method purge
help(re.purge)
Help on function purge in module re:
purge()
Clear the regular expression caches
2.4 Compilation flags
When compiling a pattern string into a pattern object,it's possible to modify the standard behaviour of the patterns. In order to do that, we have to use the compilation flags.These can be combined using '|'. Let's see examples of some important flags.
re.IGNORECASE or re.I
This pattern will match lower case and upper case.
pattern=re.compile(r'[a-z]+',re.I)
pattern1=re.compile(r'[a-z]+')
pattern.findall('Felix')
['Felix']
pattern1.findall('Felix')
['elix']
re.MULTILINE or re.M
This flag changes the behaviour of two metacharacter:
- ^ :which now matches at the begining of the string and at the begining of each new line;
- $ : whcih now matches at the end of the string and the end of each line.
pattern=re.compile('^w+:s*(w+/w+/w+)')
pattern.findall('date: 12/01/2013
date: 11/01/2013')
['12/01/2013']
pattern1=re.compile('^w+:s*(w+/w+/w+)',re.M)
pattern1.findall('date: 12/01/2013
date: 11/01/2013')
['12/01/2013', '11/01/2013']
re.S
The metacharacter'.' will match any character even the newline
re.findall('^d(.)','1
e')
[]
re.findall(r'^d(.)','1
e',re.S) # see that, '.' can even match newline '
' !
['
']
Chapter 3 Grouping
3.1 Introduction
We have already seen groups in several examples throughtout chapter2.Grouping is accomplished through two metacharacters, the parentheses ().
The first use of parentheses would be building a subexpression.For example:
re.match(r'd-w','1-a2-b3-v').group(0) # Without group, as long as matching successfully the express, it will return , not caring about the rest of string.
'1-a'
re.match(r'(d-w)+','1-a2-b3-v').group(0,1) #with group,subexpression can be created and used, here match method can walk through the whole string
('1-a2-b3-v', '3-v')
re.search(r'(ab)+c','ababcab').group(0,1) # match ab following c, so the last ab will not be searched
('ababc', 'ab')
The second simple use is limiting the scope of alternation.For example, we want to search 'JohnYang' and 'JohnWang',we can use regular expression 'John(Yang|Wang).In contrast, using 'John[Yang|wang]'cannot search either JohnYang or JohnWang.
re.search('John[Yanng|Wang]','JohnY').group(0) # JohnY shall not be searched,we just want JohnYang and JohnWang
'JohnY'
re.search('John(Yang|Wang)','JohnY').group(0) # JohnY will not be matchedS
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-176-928736c9aaff> in <module>
----> 1 re.search('John(Yang|Wang)','JohnY').group(0)
AttributeError: 'NoneType' object has no attribute 'group'
3.2 Backreference
Backreference can be implemented by umber, here,number is the corresponding group in regular expression.The best known example to bring some clarity is the regular expression to find duplicated words, as shown in the following code:
pattern=re.compile(r'(w+) 1')
match=pattern.search(r'hello hello world')
match.groups()
('hello',)
another example of application in sub method:
pattern=re.compile(r'(d+)-(w+)') # two groups in regular expressino
pattern.sub(r'2-1','1-a
20-bear
34-afcr')
'a-1
bear-20
afcr-34'
Backreferences can be used with the first 99 groups.Obviously, with an increase in the number of groups, you will find the task of reading and maintaining the regular expression more complex.And this is something that can be reduced with named groups;
3.3 Named groups
Let's see how it works with the previous example by the way of named groups:
pattern=re.compile(r'(?P<country>d+)-(?P<id>w+)')
pattern.sub(r'g<id>-g<country>','1-a
20-bear
34-afcr')
'1-a
20-bear
34-afcr'
As we have seen in the previous example,in order to reference a group by the name in the sub
operation, we have to use g<name>
. We can also use named groups inside the pattern itself, as seen in the following example:
pattern=re.compile(r'(?P<word>w+) (?P=word)')
pattern.search(r'hello hello world').group(0,1)
('hello hello', 'hello')
summary
Use Syntax
- Inside a pattern
(?P=name)
- In the repl string of
thesub
operationg<name>
- In any of the operation
of the MatchObjmatch.group('name')
3.4 Non-capturing groups
The syntax of non-capturing groups is (?:pattern)
, and the reason why we need to use non-capturing groups is to save resources and
the group cannot be referenced. And the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
In a word,non-captured pattern will not be shown when groupped using group(),and will not be referenced using
umber.
re.search(r'(a|b)+','abbacabda').groups() # expected group here is a character
('a',)
re.search(r'(a|b)+','abbacabda').groups() #expected group here is also a character.
('a',)
re.search(r'((a|b)+)','abbacabda').groups() # Things changed because of the addition of an outer layer of '()',one is requiring overlapping a or b, the other is the same with above,just one character.
('abba', 'a')
re.search(r'((a|b)+)','abbacabda').group(1,2) # Here we can see number 1 group is the outer one,number 2 is the inner one.
('abba', 'a')
re.search(r'(?:(a|b)+)','abbacabda').groups() # Non-capture the outer one,which is the aggregation of either a or b,so returns just the second group
('a',)
re.search(r'((?:a|b)+)','abbacabda').groups() #Non-capture the inner one,left the outer group.
('abba',)
3.5 Special cases with groups
Flags per group
The syntax is just a special form of grouping:(?iLmsux)
,representing re.i,re.L,re.m,re.s,re.u,re.x respectively.
re.findall(r'(?m)(^d+)','at35e
3gjh
7yhg') #equil to "re.findall(r'(^d+)','at35e
3gjh
7yhg',flags=re.M)"
['3', '7']
re.findall(r'(^d+)','at35e
3gjh
7yhg',flags=re.M)
['3', '7']
re.findall(r'(^d+)','at35e
3gjh
7yhg')
[]
yes-pattern|no-pattern
This is a very useful case of groups.The syntax is (?(id/name)yes-pattern|no-pattern)
.This expression means: if the group with id or name has been matched, then at this point of the string, the yes-pattern pattern has to match.If the group has not been matched,then the no-pattern pattern has to match, and no-pattern is optional,can be ommitted.It's just like an if-else statement.
pattern=re.compile(r'(dd-)?(w{3,4})-(?(1)(dd)|([a-z]{3,4}))$')# when (dd) exists,group 3 shall be matched against (dd), and if group 1 which is(dd) does not exist,group 3 shall be matched with [a-a]{3,4}
pattern.match('34-erte-22').groups()
('34-', 'erte', '22', None)
pattern.findall('34-erte-22')
[('34-', 'erte', '22', '')]
pattern.findall('erte-abcd')
[('', 'erte', '', 'abcd')]
pattern.match('ert-abcd').groups()
(None, 'ert', None, 'abcd')
Overlapping groups
help(re.findall)
Help on function findall in module re:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
re.findall(r'(a|b)+','ababcab') #expected group is one character either a or b.
['b', 'b']
In the above code,findall performs non-overlapping match,overlapping match is : the first group-->abab, the second group-->ab,but captered group is b, it only formed by the last one matched character.while findall returns all non-overlapping matches,so just return ['b','b']
re.findall('(a|b)?','ababcab')
('a', 'a')
re.search(r'((a|b)+)','ababcab').groups() # The first group is the outer cancatenating group, the second group is the inner group and as log as pattern matched, it returns.
('abab', 'b')
re.search(r'((a|b))','ababcab').groups() # Removing '+', the outer group is the same with the inner group, which are both containing one character
['a', 'b', 'a', 'b', '', 'a', 'b', '']
re.findall('(a|b)*','ababcab')
['b', '', 'b', '']
re.findall(r'((a|b)+)','ababcab')
[('abab', 'b'), ('ab', 'b')]
re.search(r'(?:(a|b)+)','ababcab').groups() # Non capture the outer group
('b',)
re.search(r'((?:a|b)+)','ababcab').groups() # Non capture the inner group
('abab',)
re.findall(r'((?:a|b)+)','ababcab') # find all outer-group pattern
['abab', 'ab']
re.findall(r'(?:a|b)+','ababcab') # No groups here,proved by the below code,so matching the concatenating pattern group.
['abab', 'ab']
re.search(r'(?:a|b)+','ababcab').groups()
()
Chapter 4 Look around
Until now, we have learned different mechanism of matching characters while discarding them.A character that is matched cannot be compared again, and the only way to match any upcoming character is by discarding it.
The exception to thhis are a number of metacharacters called zero-width assertion.For example,'^' and '$' are both zero-width assertion,just ensure that the positioin in the input is correct without actually consuming or matching any character.
A more powerful kind of zero-width assertion is look around, a mechanism with which it is possible to match a certain previous(look behind) or ulterior(look ahead) value to the current position.They effectively do assertion without consumingg characters;They just return a positive or negative result of the match.
Both look ahead and look behind could be subdivided into another two types each:positive and negative.
- Positive look ahead:The syntax is (?=pattern),and will match if the passed
pattern
do match against the forthcoming input. - Negative look ahead:The syntax is (?!pattern),and will match if the passed
pattern
do not match against the forthcoming input. - Positive look behind:The syntax is (?<=pattern),and will match if the passed
pattern
do match against the previous input. - Negative look behind:The syntax is (?<!pattern),and will match if the passed
pattern
do not match against the previous input.
Look ahead
pat=re.compile(r'fox')
pat.search('This is a fox').span()
(10, 13)
pat1=re.compile(r'(?=fox)')
pat1.search('This is a fox').span() # This shows look ahead is a zero-width assertion
(10, 10)
pat2=re.compile(r'(w+(?=,))') # expected group is a set of alphanum following ','(not included)
pat2.findall('They were three: Felix,Victor,and Carlos.')
['Felix', 'Victor']
pat3=re.compile(r'(w+,)')
pat3.findall('They were three: Felix,Victor,and Carlos.') # here ',' has to be matched. So ',' will also be returned.
['Felix,', 'Victor,']
It's noteworthy that the look ahead mechanism is another subexpression that can be leveraged with all the power of regular expression(It's not the same with look behind mechanism we will discover later)
pattern=re.compile(r'(w+(?=,|.))') # notice here, '.' has to be backslashed to specify its character meaning rather than metacharacter role.
pattern.findall('They were three: Felix,Victor,and Carlos.')
['Felix', 'Victor', 'Carlos']
Negative look ahead
pattern=re.compile(r'John(?!sSmith)')
result=pattern.finditer('I would rather go out with John McLane rather with John Smith or John Bon Jovi')
for i in result:
print(i.span())
(27, 31)
(65, 69)
Look around and substitution
The zero-width nature of the look around operation is especially useful in substitutions.One typical example of look ahead and substitution would be the conversion of a number composed of just numeric characters, such as 1234567890,into a comma separated number, that is 1,234,567,890.
pattern=re.compile(r'd{1,3}(?=(d{3})+(?!d))') # Look ahead allowes variable-length pattern
for i in pattern.finditer('1234567890'):
print(i.start(),i.end())
0 1
1 4
4 7
The expected group is d{1,3} and it must be followed with some d{3} and non d.
pattern.sub('g<0>,','1234567890')
'1,234,567,890'
pattern.sub('g<0>,','123456789')
'123,456,789'
In a summary, we can understand look ahead this way: (?=pattern) just try to find pattern,and locate at pattern or frontage of pattern if it is not a single character.
Look behind
pattern=re.compile(r'(?<=Johns)McLane') # (?<=Johns) locates 's'
result=pattern.finditer(r'I would rather go out with John McLane than with John Smith or John Bon Jovi')
for i in result:
print(i.start(),i.end())
32 38
Attention
In python's re module,there is,however,a fundamental difference between howw look ahead and look behind are implemented.The look behind mechanism is only able to match fixed-width patterns. Fixed-width patterns do not contain variable-length matches such as the quantifiers.Other variable-length construction such as backreferences are not allowed either.Alternation is allowed but only if the alternatives have the same length.
pattern=re.compile(r'(?<=(John|Jonathan)s)McLane')
---------------------------------------------------------------------------
error Traceback (most recent call last)
<ipython-input-30-07cadf3808e8> in <module>()
----> 1 pattern=re.compile(r'(?<=(John|Jonathan)s)McLane')
D:Anacondalib
e.py in compile(pattern, flags)
232 def compile(pattern, flags=0):
233 "Compile a regular expression pattern, returning a Pattern object."
--> 234 return _compile(pattern, flags)
235
236 def purge():
D:Anacondalib
e.py in _compile(pattern, flags)
284 if not sre_compile.isstring(pattern):
285 raise TypeError("first argument must be string or compiled pattern")
--> 286 p = sre_compile.compile(pattern, flags)
287 if not (flags & DEBUG):
288 if len(_cache) >= _MAXCACHE:
D:Anacondalibsre_compile.py in compile(p, flags)
766 pattern = None
767
--> 768 code = _code(p, flags)
769
770 if flags & SRE_FLAG_DEBUG:
D:Anacondalibsre_compile.py in _code(p, flags)
605
606 # compile the pattern
--> 607 _compile(code, p.data, flags)
608
609 code.append(SUCCESS)
D:Anacondalibsre_compile.py in _compile(code, pattern, flags)
180 lo, hi = av[1].getwidth()
181 if lo != hi:
--> 182 raise error("look-behind requires fixed-width pattern")
183 emit(lo) # look behind
184 _compile(code, av[1], flags)
error: look-behind requires fixed-width pattern
pattern=re.compile(r'(?<=B@)[w_]+')
pattern.findall('Know your big data=5 for $50 on eBooks aand 40% offf all eBooks until Friday #hadoop @HadoopNews paacktpub.com/bigdataofffers')
['HadoopNews']
Negative look behind
Negative look behind has the same limitations with positive look behind,fixed-width pattern.
pattern=re.compile(r'(?<!Johns)Doe')
result=pattern.finditer('John Doe,Calvin Doe,Hobbes Doe')
for i in result:
print(i.start(),i.end())
16 19
27 30
Look around and groups
Another beneficial use of look around constructions is inside groups.Typically,when groups are used, a very specific result has to be matched and returned inside the group.As we don't want to pollute the groups with informationo that is not required,among other potential options we can leverage look around as a favorable solution.
pattern=re.compile(r'w+s[d-]+s[d:,]+s(.*(?<!authentications)failed)')
(.*(?<!authentications)failed) is a group, only when 'authentication' does not appear, can look behind work from the begining of an alphanum set before 'failed'.
pattern.findall('Info 2020-02-24 23:43:44,487 authentication failed')
[]
pattern.findall('Info 2020-02-24 23:43:44,487 something failed')
['something failed']