zoukankan html css js c++ java

正则表达式

注意本文中所有的正则表达式使用反斜线括起来，正则表达式两边的反斜线不属于正则表达式的一部分. 另：本文写于2019年2月20日（now：2020）

基本正则表达式 Basic Regular Expression Patterns

简单正则匹配

RE	Match	Example Patterns
/woodchucks/	woodchucks	“interesting links to woodchucks and lemurs”
/a/	a	“Mary Ann stopped by Mona’s”
/!/	!	“You’ve left the burglar behind again!” said Nori

方括号匹配

使用方括号来选择多种匹配字符The use of the brackets [] to specify a disjunction of characters.

RE	Match	Example Patterns
/[wW]oodchuck/	Woodchuck or woodchuck	“Woodchuck”
/[abc]/	‘a’, ‘b’, or ‘c’	“In uomini, in soldati”
/[1234567890]/	any digit	“plenty of 7 to 5”

范围匹配

使用方括号和破折号进行范围匹配The use of the brackets [] plus the dash - to specify a range.

RE	Match	Example Patterns
/[A-Z]/	an upper case letter	“we should call it ‘Drenched Blossoms’ ”
/[a-z]/	a lower case letter	“my beans were impatient to be hoed!”
/[0-9]/	a single digit	“Chapter 1: Down the Rabbit Hole”
/[b-g]/	s b, c, d, e, f, org
/[2-5]/	2, 3, 4, or 5

取反匹配符

RE	Match	Example Patterns
/[ˆA-Z]/	not an upper case letter	“Oyfn pripetchik”
/[ˆSs]/	neither ‘S’ nor ‘s’	“I have no exquisite reason for’t”
/[ˆ.]/	not a period	“our resident Djinn”
/[eˆ]/	either ‘e’ or ‘ˆ’	“look up ˆ now”
/aˆb/	the pattern ‘aˆb’	“look up aˆb now”

选择匹配

RE	Match	Example Patterns
/woodchucks?/	woodchuck or woodchucks	“woodchuck”
/colou?r/	color or colour	“colour”

匹配多个

Kleene * means “zero or more occurrences of the immediately previous character or regular expression”
Kleene +, which means “one or more of the previous character”
wildcard . a wildcard expression that matches any single character (except a carriage return)

RE	Match	Example Patterns
/a*/	a,aaaaaa or "Minor"(Minor has zero a’s)
/aa*/	matching one or more a,meaning one a followed by zero or more as
/[ab]*/	zero or more a’s or b’s,aaaa or ababab or bbbb
/[0-9][0-9]*/	multiple digits
/[0-9]+/	a sequence of digits”
/baaa*!/	baa!,baaa!,baaaa!,baaaaa!
/baa+!/	baa!,baaa!,baaaa!,baaaaa!
/beg.n/	any character between beg and n	begin, beg’n, begun

The wildcard is often used together with the Kleene star to mean “any string of characters”For example, suppose we want to find any line in which a particular word, for example, aardvark, appears twice. We can specify this with the regular expression /aardvark.*aardvark/.

Anchors

ˆ matches the start of a line
$ matches the end of a line
matches a word boundary
B matches a non-boundary

RE	Match	Example Patterns
/ˆThe/	the word The only at the start of a line
/ $/	a space at the end of a line
/ˆThe dog.$/	a line that contains only the phrase The dog. (We have to use the backslash here since we want the . to mean “period” and not the wildcard.)
/the/	the word the but not the word other
/99/		"There are 99 bottles","There are 299 bottles"(不匹配),"$99"

Thus, the caret ˆ has three uses:

to match the start of a line;
to indicate a negation inside of square brackets;
to mean a caret. (What are the contexts that allow grep or Python to know which function a given caret is supposed to have?)

分离,分组,运算符优先级

‘|’ and '()'

RE	Match	Example Patterns
/cat\|dog/	either the string cat or the string dog
/gupp(y\|ies)/	guppy or guppies

一个例子:(完全照搬Speech and Language Processing)
Perhaps we have a line that has column labels of the form
Column 1 Column 2 Column 3. The expression /Column [0-9]+ */ will not match any number of columns; instead, it will match a single column followed by any number of spaces! The star here applies only to the space that precedes it, not to the whole sequence. With the parentheses, we could write the expression /(Column [0-9]+ )/ to match the word Column, followed by a number and optional spaces, the whole pattern repeated any number of times.

运算符优先级

The following table gives the order operator precedence of RE operator precedence, from highest precedence to lowest precedence


Parenthesis	()
Counters	* + ? {}
Sequences and anchors	the ˆmy end$
Disjunction	\|

贪婪匹配和不贪婪匹配(待续)

RE	Expansion	Match	First Matches
d	[0-9]	any digit	Party of 5
D	[ˆ0-9]	any non-digit	Blue moon
w	[a-zA-Z0-9_]	any alphanumeric/underscore	Daiyu
W	[ˆw]	a non-alphanumeric	!!!!
s	[ f]	whitespace (space, tab)
S	[ˆs]	Non-whitespace	in Concord
/a.{24}z/		a followed by 24 dots followed by z

RE	Match
*	zero or more occurrences of the previous char or expression
+	one or more occurrences of the previous char or expression
?	exactly zero or one occurrence of the previous char or expression
{n}	n occurrences of the previous char or expression
{n,m}	from n to m occurrences of the previous char or expression
{n,}	at least n occurrences of the previous char or expression

RE	Match	First Patterns Matched
*	an asterisk “*”	“KAPLA*N”
.	a period “.”	“Dr. Livingston, I presume”
?	a question mark	“Why don’t they come and lend a hand?”
	a newline
	a tab

正则替换

(完全照搬Speech and Language Processing)
It is often useful to be able to refer to a particular subpart of the string matching the first pattern. For example, suppose we wanted to put angle brackets around all integers in a text, for example, changing the 35 boxes to the <35> boxes. We’d like a way to refer to the integer we’ve found so that we can easily add the brackets. To do this, we put parentheses ( and ) around the first pattern and use the number operator 1 in the second pattern to refer back. Here’s how it looks:s/([0-9]+)/<1>/

The parenthesis and number operators can also specify that a certain string or expression must occur twice in the text. For example, suppose we are looking for the pattern “the Xer they were, the Xer they will be”, where we want to constrain the two X’s to be the same string. We do this by surrounding the first X with the parenthesis operator, and replacing the second X with the number operator 1, as follows:/the (.*)er they were, the 1er they will be/

Here the 1 will be replaced by whatever string matched the first item in parentheses. So this will match The bigger they were, the bigger they will be but not The bigger they were, the faster they will be.

This use of parentheses to store a pattern in memory is called a capture group. Every time a capture group is used (i.e., parentheses surround a pattern), the re sulting match is stored in a numbered register. If you match two different sets of parentheses, 2 means whatever matched the second capture group. Thus /the (.)er they (.), the 1er we 2/ will match The faster they ran, the faster we ran but not The faster they ran, the faster we ate. Similarly, the third capture group is stored in 3, the fourth is 4, and so on.

Parentheses thus have a double function in regular expressions; they are used to group terms for specifying the order in which operators should apply, and they are used to capture something in a register. Occasionally we might want to use parentheses for grouping, but don’t want to capture the resulting pattern in a register. In that case we use a non-capturing group, which is specified by putting the commands ?: after the open paren, in the form (?: pattern ).
/(?:some|a few) (people|cats) like some 1/ will match some cats like some people but not some people like some a few.

可以启发的两个例子

A Simple Example

Suppose we wanted to write a RE to find cases of the English article the. A simple (but incorrect) pattern might be:

/the/

One problem is that this pattern will miss the word when it begins a sentence and hence is capitalized (i.e., The). This might lead us to the following pattern:

/[tT]he/

But we will still incorrectly return texts with the embedded in other words (e.g., other or theology). So we need to specify that we want instances with a word boundary on both sides:

/[tT]he/

Suppose we wanted to do this without the use of //. We might want this since // won’t treat underscores and numbers as word boundaries; but we might want to find the in some context where it might also have underlines or numbers nearby (the or the25). We need to specify that we want instances in which there are no alphabetic letters on either side of the the:

/[ˆa-zA-Z][tT]he[ˆa-zA-Z]/

But there is still one more problem with this pattern: it won’t find the word the
when it begins a line. This is because the regular expression [ˆa-zA-Z], which
we used to avoid embedded instances of the, implies that there must be some single
(although non-alphabetic) character before the the. We can avoid this by specifying
that before the the we require either the beginning-of-line or a non-alphabetic
character, and the same at the end of the line:

/(ˆ|[ˆa-zA-Z])[tT]he([ˆa-zA-Z]|$)/

The process we just went through was based on fixing two kinds of errors: false
positives, strings that we incorrectly matched like other or there, and false nega
tives, strings that we incorrectly missed, like The. Addressing these two kinds of errors comes up again and again in implementing speech and language processing
systems. Reducing the overall error rate for an application thus involves two antagonistic
efforts:

Increasing precision (minimizing false positives)
Increasing recall (minimizing false negatives)

A More Complex Example

Let’s try out a more significant example of the power of REs. Suppose we want to build an application to help a user buy a computer on the Web. The user might want “any machine with more than 6 GHz and 500 GB of disk space for less than 1000dollar”.To do this kind of retrieval,we first need to be able to look for expressions like 6 GHz or 500 GB or Mac or $999.99. In the rest of this section we’ll work out some simple regular expressions for this task.

First, let’s complete our regular expression for prices. Here’s a regular expression for a dollar sign followed by a string of digits:

/$[0-9]+/

Note that the $ character has a different function here than the end-of-line function we discussed earlier. Regular expression parsers are in fact smart enough to realize that $ here doesn’t mean end-of-line. (As a thought experiment, think about how regex parsers might figure out the function of $ from the context.)

Now we just need to deal with fractions of dollars. We’ll add a decimal point and two digits afterwards:

/$[0-9]+.[0-9][0-9]/

This pattern only allows $199.99 but not $199. We need to make the cents optional and to make sure we’re at a word boundary:

/$[0-9]+(.[0-9][0-9])?/

How about specifications for processor speed? Here’s a pattern for that:

/[0-9]+ *(GHz|[Gg]igahertz)/

Note that we use / */ to mean “zero or more spaces” since there might always be extra spaces lying around. We also need to allow for optional fractions again (5.5 GB); note the use of ? for making the final s optional:

/[0-9]+(.[0-9]+)? *(GB|[Gg]igabytes?)/

常用正则表达式参照表

取汉字以外字符(近似为全是汉字):[ˆu4e00-u9fa5]|[$a-z{}+-*()（）&/，．：]
next

参考文献:
[1]Speech and Language Processing
[2]Thinking in Java.

查看全文

相关阅读:
BaseDao
url中文参数解决方案
 Ajax实现步骤和原理
 在服务器端使用文件时的路径解决方案
 用户验证登录拦截器
 jenkins环境搭建
 gitlab环境搭建
 nexus3.X环境搭建
 base64文件大小计算
 JVM远程调试功能

原文地址：https://www.cnblogs.com/cheaptalk/p/12369662.html