zoukankan      html  css  js  c++  java
  • 正则表达式

    注意本文中所有的正则表达式使用反斜线括起来,正则表达式两边的反斜线不属于正则表达式的一部分. 另:本文写于2019年2月20日(now:2020)

    基本正则表达式 Basic Regular Expression Patterns

    简单正则匹配

    RE Match Example Patterns
    /woodchucks/ woodchucks “interesting links to woodchucks and lemurs”
    /a/ a “Mary Ann stopped by Mona’s”
    /!/ ! “You’ve left the burglar behind again!” said Nori

    方括号匹配

    使用方括号来选择多种匹配字符The use of the brackets [] to specify a disjunction of characters.

    RE Match Example Patterns
    /[wW]oodchuck/ Woodchuck or woodchuck Woodchuck
    /[abc]/ ‘a’, ‘b’, or ‘c’ “In uomini, in soldati”
    /[1234567890]/ any digit “plenty of 7 to 5”

    范围匹配

    使用方括号和破折号进行范围匹配The use of the brackets [] plus the dash - to specify a range.

    RE Match Example Patterns
    /[A-Z]/ an upper case letter “we should call it ‘Drenched Blossoms’ ”
    /[a-z]/ a lower case letter my beans were impatient to be hoed!”
    /[0-9]/ a single digit “Chapter 1: Down the Rabbit Hole”
    /[b-g]/ s b, c, d, e, f, org
    /[2-5]/ 2, 3, 4, or 5

    取反匹配符

    RE Match Example Patterns
    /[ˆA-Z]/ not an upper case letter “Oyfn pripetchik”
    /[ˆSs]/ neither ‘S’ nor ‘s’ I have no exquisite reason for’t”
    /[ˆ.]/ not a period our resident Djinn”
    /[eˆ]/ either ‘e’ or ‘ˆ’ “look up ˆ now”
    /aˆb/ the pattern ‘aˆb’ “look up aˆb now”

    选择匹配

    RE Match Example Patterns
    /woodchucks?/ woodchuck or woodchucks woodchuck
    /colou?r/ color or colour colour

    匹配多个

    • Kleene * means “zero or more occurrences of the immediately previous character or regular expression”
    • Kleene +, which means “one or more of the previous character”
    • wildcard . a wildcard expression that matches any single character (except a carriage return)
    RE Match Example Patterns
    /a*/ a,aaaaaa or "Minor"(Minor has zero a’s)
    /aa*/ matching one or more a,meaning one a followed by zero or more as
    /[ab]*/ zero or more a’s or b’s,aaaa or ababab or bbbb
    /[0-9][0-9]*/ multiple digits
    /[0-9]+/ a sequence of digits”
    /baaa*!/ baa!,baaa!,baaaa!,baaaaa!
    /baa+!/ baa!,baaa!,baaaa!,baaaaa!
    /beg.n/ any character between beg and n begin, beg’n, begun
    • The wildcard is often used together with the Kleene star to mean “any string of characters”For example, suppose we want to find any line in which a particular word, for example, aardvark, appears twice. We can specify this with the regular expression /aardvark.*aardvark/.

    Anchors

    • ˆ matches the start of a line
    • $ matches the end of a line
    •  matches a word boundary
    • B matches a non-boundary
    RE Match Example Patterns
    /ˆThe/ the word The only at the start of a line
    / $/ a space at the end of a line
    /ˆThe dog.$/ a line that contains only the phrase The dog. (We have to use the backslash here since we want the . to mean “period” and not the wildcard.)
    /the/ the word the but not the word other
    /99/ "There are 99 bottles","There are 299 bottles"(不匹配),"$99"

    Thus, the caret ˆ has three uses:

    1. to match the start of a line;
    2. to indicate a negation inside of square brackets;
    3. to mean a caret. (What are the contexts that allow grep or Python to know which function a given caret is supposed to have?)

    分离,分组,运算符优先级

    ‘|’ and '()'

    RE Match Example Patterns
    /cat|dog/ either the string cat or the string dog
    /gupp(y|ies)/ guppy or guppies

    一个例子:(完全照搬Speech and Language Processing)
    Perhaps we have a line that has column labels of the form
    Column 1 Column 2 Column 3. The expression /Column [0-9]+ */ will not match any number of columns; instead, it will match a single column followed by any number of spaces! The star here applies only to the space that precedes it, not to the whole sequence. With the parentheses, we could write the expression /(Column [0-9]+ )/ to match the word Column, followed by a number and optional spaces, the whole pattern repeated any number of times.

    运算符优先级

    The following table gives the order operator precedence of RE operator precedence, from highest precedence to lowest precedence

    Parenthesis ()
    Counters * + ? {}
    Sequences and anchors the ˆmy end$
    Disjunction |

    贪婪匹配和不贪婪匹配(待续)

    更多的正则表达式操作符More Operators

    Aliases for common sets of characters

    RE Expansion Match First Matches
    d [0-9] any digit Party of 5
    D [ˆ0-9] any non-digit Blue moon
    w [a-zA-Z0-9_] any alphanumeric/underscore Daiyu
    W [ˆw] a non-alphanumeric !!!!
    s [ f] whitespace (space, tab)
    S [ˆs] Non-whitespace in Concord
    /a.{24}z/ a followed by 24 dots followed by z

    Regular expression operators for counting

    RE Match
    * zero or more occurrences of the previous char or expression
    + one or more occurrences of the previous char or expression
    ? exactly zero or one occurrence of the previous char or expression
    {n} n occurrences of the previous char or expression
    {n,m} from n to m occurrences of the previous char or expression
    {n,} at least n occurrences of the previous char or expression

    Some characters that need to be backslashed.

    RE Match First Patterns Matched
    * an asterisk “*” “KAPLA*N”
    . a period “.” “Dr. Livingston, I presume”
    ? a question mark “Why don’t they come and lend a hand?”
    a newline
    a tab

    正则替换

    (完全照搬Speech and Language Processing)
    It is often useful to be able to refer to a particular subpart of the string matching the first pattern. For example, suppose we wanted to put angle brackets around all integers in a text, for example, changing the 35 boxes to the <35> boxes. We’d like a way to refer to the integer we’ve found so that we can easily add the brackets. To do this, we put parentheses ( and ) around the first pattern and use the number operator 1 in the second pattern to refer back. Here’s how it looks:s/([0-9]+)/<1>/

    The parenthesis and number operators can also specify that a certain string or expression must occur twice in the text. For example, suppose we are looking for the pattern “the Xer they were, the Xer they will be”, where we want to constrain the two X’s to be the same string. We do this by surrounding the first X with the parenthesis operator, and replacing the second X with the number operator 1, as follows:/the (.*)er they were, the 1er they will be/

    Here the 1 will be replaced by whatever string matched the first item in parentheses. So this will match The bigger they were, the bigger they will be but not The bigger they were, the faster they will be.

    This use of parentheses to store a pattern in memory is called a capture group. Every time a capture group is used (i.e., parentheses surround a pattern), the re sulting match is stored in a numbered register. If you match two different sets of parentheses, 2 means whatever matched the second capture group. Thus /the (.)er they (.), the 1er we 2/ will match The faster they ran, the faster we ran but not The faster they ran, the faster we ate. Similarly, the third capture group is stored in 3, the fourth is 4, and so on.

    Parentheses thus have a double function in regular expressions; they are used to group terms for specifying the order in which operators should apply, and they are used to capture something in a register. Occasionally we might want to use parentheses for grouping, but don’t want to capture the resulting pattern in a register. In that case we use a non-capturing group, which is specified by putting the commands ?: after the open paren, in the form (?: pattern ).
    /(?:some|a few) (people|cats) like some 1/ will match some cats like some people but not some people like some a few.

    可以启发的两个例子

    A Simple Example

    Suppose we wanted to write a RE to find cases of the English article the. A simple (but incorrect) pattern might be:

    /the/

    One problem is that this pattern will miss the word when it begins a sentence and hence is capitalized (i.e., The). This might lead us to the following pattern:

    /[tT]he/

    But we will still incorrectly return texts with the embedded in other words (e.g., other or theology). So we need to specify that we want instances with a word boundary on both sides:

    /[tT]he/

    Suppose we wanted to do this without the use of //. We might want this since // won’t treat underscores and numbers as word boundaries; but we might want to find the in some context where it might also have underlines or numbers nearby (the or the25). We need to specify that we want instances in which there are no alphabetic letters on either side of the the:

    /[ˆa-zA-Z][tT]he[ˆa-zA-Z]/

    But there is still one more problem with this pattern: it won’t find the word the
    when it begins a line. This is because the regular expression [ˆa-zA-Z], which
    we used to avoid embedded instances of the, implies that there must be some single
    (although non-alphabetic) character before the the. We can avoid this by specifying
    that before the the we require either the beginning-of-line or a non-alphabetic
    character, and the same at the end of the line:

    /(ˆ|[ˆa-zA-Z])[tT]he([ˆa-zA-Z]|$)/

    The process we just went through was based on fixing two kinds of errors: false
    positives, strings that we incorrectly matched like other or there, and false nega
    tives, strings that we incorrectly missed, like The. Addressing these two kinds of errors comes up again and again in implementing speech and language processing
    systems. Reducing the overall error rate for an application thus involves two antagonistic
    efforts:

    • Increasing precision (minimizing false positives)
    • Increasing recall (minimizing false negatives)

    A More Complex Example

    Let’s try out a more significant example of the power of REs. Suppose we want to build an application to help a user buy a computer on the Web. The user might want “any machine with more than 6 GHz and 500 GB of disk space for less than 1000dollar”.To do this kind of retrieval,we first need to be able to look for expressions like 6 GHz or 500 GB or Mac or $999.99. In the rest of this section we’ll work out some simple regular expressions for this task.

    First, let’s complete our regular expression for prices. Here’s a regular expression for a dollar sign followed by a string of digits:

    /$[0-9]+/

    Note that the $ character has a different function here than the end-of-line function we discussed earlier. Regular expression parsers are in fact smart enough to realize that $ here doesn’t mean end-of-line. (As a thought experiment, think about how regex parsers might figure out the function of $ from the context.)

    Now we just need to deal with fractions of dollars. We’ll add a decimal point and two digits afterwards:

    /$[0-9]+.[0-9][0-9]/

    This pattern only allows $199.99 but not $199. We need to make the cents optional and to make sure we’re at a word boundary:

    /$[0-9]+(.[0-9][0-9])?/

    How about specifications for processor speed? Here’s a pattern for that:

    /[0-9]+ *(GHz|[Gg]igahertz)/

    Note that we use / */ to mean “zero or more spaces” since there might always be extra spaces lying around. We also need to allow for optional fractions again (5.5 GB); note the use of ? for making the final s optional:

    /[0-9]+(.[0-9]+)? *(GB|[Gg]igabytes?)/


    常用正则表达式参照表

    1. 取汉字以外字符(近似为全是汉字):[ˆu4e00-u9fa5]|[$a-z{}+-*()()&/, .:]

    2. next


    参考文献:
    [1]Speech and Language Processing
    [2]Thinking in Java.

  • 相关阅读:
    linux系统数据落盘之细节
    不同类型文件“可读写”的含义
    zz存储系统中缓存的三种类型
    library满(磁带紊乱、虚拟机恢复失败)
    TSM日常维护
    入门级磁带机使用方法
    关于 tsm 磁带槽位
    TSM lan-free原理及配置
    TSM中备份(Backup)和归档(Archive)的区别
    指定stg备份 (即指定tape 磁带)
  • 原文地址:https://www.cnblogs.com/cheaptalk/p/12369662.html
Copyright © 2011-2022 走看看