zoukankan      html  css  js  c++  java
  • Java网络爬虫

    WikiScraper.java

    package master.haku.scrape;
    
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import java.net.*;
    import java.io.*;
    
    public class WikiScraper {
        public static void main(String[] args) {
            scrapeTopic("/wiki/Python");
        }
    
        public static void scrapeTopic(String url) {
            String html = getUrl("https://en.wikipedia.org" + url);
            Document doc = Jsoup.parse(html);
            String contentText = doc.select("#mw-content-text > p").first().text();
            System.out.println(contentText);
        }
    
        public static String getUrl(String url) {
            URL urlObj = null;
            try {
                urlObj = new URL(url);
            } catch (MalformedURLException e) {
                System.out.println("The url was malformed!");
                return "";
            }
    
            URLConnection urlCon = null;
            BufferedReader in = null;
            String outputText = "";
    
            try {
                urlCon = urlObj.openConnection();
                in = new BufferedReader(new InputStreamReader(urlCon.getInputStream()));
                String line = "";
                while ((line = in.readLine()) != null) {
                    outputText += line;
                }
                in.close();
            } catch (IOException e) {
                System.out.println("There was an error connecting to the URL");
                return "";
            }
    
            return outputText;
        }
    }

    运行结果:

    A python is a constricting snake belonging to the Python (genus), or, more generally, any snake in the family Pythonidae (containing the Python genus).

  • 相关阅读:
    FastMM、FastCode、FastMove的使用(图文并茂)
    12种JavaScript MVC框架之比较
    十款最佳Node.js MVC框架
    Couchbase 服务器
    C#程序员阅读的书籍
    ORM的实现
    Linux内核策略介绍
    ASP.NET MVC + EF 利用存储过程读取大数据
    面向.Net程序员的dump分析
    动态加载与插件化
  • 原文地址:https://www.cnblogs.com/davidgu/p/4836305.html
Copyright © 2011-2022 走看看