zoukankan html css js c++ java

java动态爬虫jsoup以及正则表达式的运用

1.jsoup是java的HTML解析器，可直接解析某个URL地址，HTML文本内容。http://jsoup.org/官网

2.解析URL地址

1  Document doc = Jsoup
2                     .connect(url)
3                     .userAgent(
4                             "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:37.0) Gecko/20100101 Firefox/37.0)") // 设置User-Agent
5                     .timeout(5000) // 设置连接超时时间
6                     .get();

View Code

 1 Elements elements = doc.getElementsByClass("desc");
 2 Elements subelements = elements.get(0).getElementsByTag("li");
 3  Elements dayElements = eachDayElement.getElementsByTag("tr");
 4  Elements firstSubElements = firstElement.getElementsByTag("td");
 5 String text = elements.get(0).text();
 6 private static String regEx_publishDate = "由中央气象台\s*(\d+):(\d+)\s*发布的";
 7     private static Pattern pattern_publishDate = Pattern
 8             .compile(regEx_publishDate);
 9 Matcher matcher = pattern_publishDate.matcher(text);
10 if (matcher.find()) {
11             int hour = Integer.parseInt(matcher.group(1));
12             int minute = Integer.parseInt(matcher.group(2));}

View Code

3.要有jsoup的jar包

4. s 匹配任意的空白符 S匹配任意不是空白符的字符 d匹配数字 +重复一次或更多次 * 重复零次或更多次

demo:

1 (\d{4})-(\d{2})-(\d{2})\s+(\d{2}):(\d{2})发布
2 (\S+过敏\S+)：\s+(\S+)\s+(\S+)
3 \s+(感冒\S+)：\s+(\S+)\s+(\S+)
4 \s*(\S+)\s*
5 首要污染物：\s*(\S+)\s*"

View Code

正则表达式语法：

https://msdn.microsoft.com/zh-cn/library/ae5bf541%28v=vs.80%29.aspx

查看全文

相关阅读:
代码结构
 linux 启动盘制作multisystem
cmake 各种语法的作用
 leetcode Longest Consecutive Sequence
leetcode find kth
leetcode twoSum
S3pool pytorch
数学：优化：拉格朗日乘子法
 Fisher判别分析（线性判别分析——LDA）
数学：优化：牛顿法

原文地址：https://www.cnblogs.com/dobestself-994395/p/4610303.html