zoukankan      html  css  js  c++  java
  • Regular Expressions in Java

    In the project of Data Mining, I have to make use of the regular expressions to deal with the large amount of text in html.

    I used regular expression in Linux (grep) before and find it quite an efficient way to deal with text, especially when their amount is very large.

    Introduction

    Regular expressions are a way to describe a set of strings based on common characteristics shared by each string in the set. They can be used to search, edit, or manipulate text and data. You must learn a specific syntax to create regular expressions — one that goes beyond the normal syntax of the Java programming language. Regular expressions vary in complexity, but once you understand the basics of how they're constructed, you'll be able to decipher (or create) any regular expression.

    The package of java.util.regex

    It primary consists three classes:

    Pattern: a compiled representation of a regular expression.

    Matcher: interprets the Patten and performs match operation against an input string.

    PatternSyntaxException: indicates an syntax error in a regular expression pattern

    A single regular expression program

     1 package regexTestHarness;
     2 
     3 import java.util.regex.Pattern;
     4 import java.util.regex.Matcher;
     5 import java.io.BufferedReader;
     6 import java.io.InputStreamReader;
     7 
     8 public class RegexTestHarness {
     9     public static void main(String[] args) {
    10         try {
    11 
    12             System.out.println("%nEnter your regex: ");
    13 
    14             InputStreamReader isr = new InputStreamReader(System.in);
    15 
    16             BufferedReader br = new BufferedReader(isr);
    17 
    18             String s = br.readLine();
    19 
    20             Pattern pattern = Pattern.compile(s);
    21 
    22             System.out.println("%nEnter your text: ");
    23 
    24             isr = new InputStreamReader(System.in);
    25 
    26             br = new BufferedReader(isr);
    27 
    28             s = br.readLine();
    29 
    30             Matcher matcher = pattern.matcher(s);
    31 
    32             boolean found = false;
    33             while (matcher.find()) {
    34                 System.out.print("I found the text " + matcher.group()
    35                         + " starting at " + "index " + matcher.start()
    36                         + " and ending at index " + matcher.end());
    37                 found = true;
    38             }
    39             if (!found) {
    40                 System.out.println("No match found.");
    41             }
    42         } catch (Exception e) {
    43             e.printStackTrace();
    44         }
    45     }
    46 
    47 }

     

    Chracter classes and Predefined classes

    ConstructDescription
    [abc] a, b, or c (simple class)
    [^abc] Any character except a, b, or c (negation)
    [a-zA-Z] a through z, or A through Z, inclusive (range)
    [a-d[m-p]] a through d, or m through p: [a-dm-p] (union)
    [a-z&&[def]] d, e, or f (intersection)
    [a-z&&[^bc]] a through z, except for b and c: [ad-z] (subtraction)
    [a-z&&[^m-p]] a through z, and not m through p: [a-lq-z] (subtraction)
    ConstructDescription
    . Any character (may or may not match line terminators)
    \d A digit: [0-9]
    \D A non-digit: [^0-9]
    \s A whitespace character: [ \t\n\x0B\f\r]
    \S A non-whitespace character: [^\s]
    \w A word character: [a-zA-Z_0-9]
    \W A non-word character: [^\w]

     

    Quantifiers

    GreedyReluctantPossessiveMeaning
    X? X?? X?+ X, once or not at all
    X* X*? X*+ X, zero or more times
    X+ X+? X++ X, one or more times
    X{n} X{n}? X{n}+ X, exactly n times
    X{n,} X{n,}? X{n,}+ X, at least n times
    X{n,m} X{n,m}? X{n,m}+ X, at least n but not more than m times

    Chinese Characters

    [\u4e00-\u9fa5]

  • 相关阅读:
    VBA宏-转载记录备份 2021年5月21日 星期五
    (记)利用Word发布文章到cnblogs博客
    赖氏经典英语语法—虚拟语气
    MFC进阶教程深入浅出版.笔记第5天
    MFC进阶教程深入浅出版.笔记第4天
    MFC进阶教程深入浅出版.笔记第3天
    MFC进阶教程深入浅出版.笔记第2天
    MFC进阶教程深入浅出版.笔记第1天
    介词7:during, through, besides, since…
    2.无人机无人车轨迹优化分类
  • 原文地址:https://www.cnblogs.com/johnpher/p/2573865.html
Copyright © 2011-2022 走看看