zoukankan      html  css  js  c++  java
  • Java网络爬虫

    WikiScraper.java

    package master.haku.scrape;
    
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import java.net.*;
    import java.io.*;
    
    public class WikiScraper {
        public static void main(String[] args) {
            scrapeTopic("/wiki/Python");
        }
    
        public static void scrapeTopic(String url) {
            String html = getUrl("https://en.wikipedia.org" + url);
            Document doc = Jsoup.parse(html);
            String contentText = doc.select("#mw-content-text > p").first().text();
            System.out.println(contentText);
        }
    
        public static String getUrl(String url) {
            URL urlObj = null;
            try {
                urlObj = new URL(url);
            } catch (MalformedURLException e) {
                System.out.println("The url was malformed!");
                return "";
            }
    
            URLConnection urlCon = null;
            BufferedReader in = null;
            String outputText = "";
    
            try {
                urlCon = urlObj.openConnection();
                in = new BufferedReader(new InputStreamReader(urlCon.getInputStream()));
                String line = "";
                while ((line = in.readLine()) != null) {
                    outputText += line;
                }
                in.close();
            } catch (IOException e) {
                System.out.println("There was an error connecting to the URL");
                return "";
            }
    
            return outputText;
        }
    }

    运行结果:

    A python is a constricting snake belonging to the Python (genus), or, more generally, any snake in the family Pythonidae (containing the Python genus).

  • 相关阅读:
    手动卸载Office2010
    css盒子模型和定位
    [转]Mysql 存储过程和函数区别
    (转载)今天面试两个人的感受
    配置apache和php mysql的一些问题
    css position[转
    drools7 (四、FactHandle 介绍)
    drools7 (三、kmodule 介绍)
    drools7 (二、agenda-group 的使用)
    drools7 (一、最简单的例子)
  • 原文地址:https://www.cnblogs.com/davidgu/p/4836305.html
Copyright © 2011-2022 走看看