zoukankan      html  css  js  c++  java
  • Java网络爬虫

    WikiScraper.java

    package master.haku.scrape;
    
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import java.net.*;
    import java.io.*;
    
    public class WikiScraper {
        public static void main(String[] args) {
            scrapeTopic("/wiki/Python");
        }
    
        public static void scrapeTopic(String url) {
            String html = getUrl("https://en.wikipedia.org" + url);
            Document doc = Jsoup.parse(html);
            String contentText = doc.select("#mw-content-text > p").first().text();
            System.out.println(contentText);
        }
    
        public static String getUrl(String url) {
            URL urlObj = null;
            try {
                urlObj = new URL(url);
            } catch (MalformedURLException e) {
                System.out.println("The url was malformed!");
                return "";
            }
    
            URLConnection urlCon = null;
            BufferedReader in = null;
            String outputText = "";
    
            try {
                urlCon = urlObj.openConnection();
                in = new BufferedReader(new InputStreamReader(urlCon.getInputStream()));
                String line = "";
                while ((line = in.readLine()) != null) {
                    outputText += line;
                }
                in.close();
            } catch (IOException e) {
                System.out.println("There was an error connecting to the URL");
                return "";
            }
    
            return outputText;
        }
    }

    运行结果:

    A python is a constricting snake belonging to the Python (genus), or, more generally, any snake in the family Pythonidae (containing the Python genus).

  • 相关阅读:
    Find a way(两个BFS)
    ACM代码模板
    ElasticSearch-集群
    ElasticSearch-倒排索引
    ElasticSearch-IK分词器
    ElasticSearch-数据类型
    ElasticSearch-REST APIS
    cmd命令行中的errorlevel和延迟赋值
    ubuntu 12.04内核升级到3.13.1
    ubuntu 12.04安装TP-LINK TL-WN725N v2
  • 原文地址:https://www.cnblogs.com/davidgu/p/4836305.html
Copyright © 2011-2022 走看看