1.jsoup是java的HTML解析器,可直接解析某个URL地址,HTML文本内容。http://jsoup.org/官网
2.解析URL地址
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
1 Document doc = Jsoup 2 .connect(url) 3 .userAgent( 4 "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:37.0) Gecko/20100101 Firefox/37.0)") // 设置User-Agent 5 .timeout(5000) // 设置连接超时时间 6 .get();
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
1 Elements elements = doc.getElementsByClass("desc"); 2 Elements subelements = elements.get(0).getElementsByTag("li"); 3 Elements dayElements = eachDayElement.getElementsByTag("tr"); 4 Elements firstSubElements = firstElement.getElementsByTag("td"); 5 String text = elements.get(0).text(); 6 private static String regEx_publishDate = "由中央气象台\s*(\d+):(\d+)\s*发布的"; 7 private static Pattern pattern_publishDate = Pattern 8 .compile(regEx_publishDate); 9 Matcher matcher = pattern_publishDate.matcher(text); 10 if (matcher.find()) { 11 int hour = Integer.parseInt(matcher.group(1)); 12 int minute = Integer.parseInt(matcher.group(2));}
3.要有jsoup的jar包
4. s 匹配任意的空白符 S匹配任意不是空白符的字符 d匹配数字 +重复一次或更多次 * 重复零次或更多次
demo:
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
1 (\d{4})-(\d{2})-(\d{2})\s+(\d{2}):(\d{2})发布 2 (\S+过敏\S+):\s+(\S+)\s+(\S+) 3 \s+(感冒\S+):\s+(\S+)\s+(\S+) 4 \s*(\S+)\s* 5 首要污染物:\s*(\S+)\s*"
正则表达式语法:
https://msdn.microsoft.com/zh-cn/library/ae5bf541%28v=vs.80%29.aspx