zoukankan      html  css  js  c++  java
  • 老李分享:网页爬虫java实现

    老李分享:网页爬虫java实现

     

     poptest是国内唯一一家培养测试开发工程师的培训机构,以学员能胜任自动化测试,性能测试,测试工具开发等工作为目标。如果对课程感兴趣,请大家咨询qq:908821478,咨询电话010-84505200。

    一. 设计思路

     

    (1)一个收集所需网页全站或者指定子域名的链接队列

    (2)一个存放将要访问的URL队列(跟上述有点重复, 用空间换时间, 提升爬取速度)

    (3)一个保存已访问过URL的数据结构

     数据结构有了, 接下来就是算法了, 一般推荐采取广度优先的爬取算法, 免得中了反爬虫的某些循环无限深度的陷阱。

     使用了 jsoup (一个解析HTML元素的Lib)和 httpclient (网络请求包)来简化代码实现。

      

    二. 代码实现

    上述三种数据结构:

    // 已爬取URL <URL, isAccess>
    final static ConcurrentHashMap<String, Boolean> urlQueue = new ConcurrentHashMap<String, Boolean>();

    // 待爬取URL
    final static ConcurrentLinkedDeque<String> urlWaitingQueue = new ConcurrentLinkedDeque<String>();

    // 待扫描网页URL队列
    final static ConcurrentLinkedDeque<String> urlWaitingScanQueue = new ConcurrentLinkedDeque<String>();

    入队等待:

    /**

         * url store in the waiting queue

         * @param originalUrl

         * @throws Exception

         */

        private static void enterWaitingQueue(final String originalUrl) throws Exception{

            String url = urlWaitingScanQueue.poll();

            // if accessed, ignore the url

            /*while (urlQueue.containsKey(url)) {

                url = urlWaitingQueue.poll();

            }*/

            final String finalUrl = url;

            Thread.sleep(600);

            new Thread(new Runnable() {

                public void run() {

                    try{

                        if (finalUrl != null) {

                            Connection conn = Jsoup.connect(finalUrl);

                            Document doc = conn.get();

                            //urlQueue.putIfAbsent(finalUrl, Boolean.TRUE); // accessed

                            logger.info("扫描网页URL: " + finalUrl);

                            Elements links = doc.select("a[href]");

                            for (int linkNum = 0; linkNum < links.size(); linkNum++) {

                                Element element = links.get(linkNum);

                                String suburl = element.attr("href");

                                // 某条件下, 并且原来没访问过

                                if (!urlQueue.containsKey(suburl)) {

                                        urlWaitingScanQueue.offer(suburl);

                                        urlWaitingQueue.offer(suburl);

                                        logger.info("URL入队等待" + linkNum + ": " + suburl);

                                    }

                                }

                            }

                        }

                    } catch (Exception ee) {

                        logger.error("muti thread executing error, url: " + finalUrl, ee);

                    }

                }

            }).start();

        }

    访问页面:

    private static void viewPages() throws Exception{

            Thread.sleep(500);

            new Thread(new Runnable() {

                @Override

                public void run() {

                    try {

                        while(!urlWaitingQueue.isEmpty()) {

                            String url = urlWaitingQueue.peek();

                            final String finalUrl = url;

                            // build a client, like open a browser

                            CloseableHttpClient httpClient = HttpClients.createDefault();

                            // create get method, like input url in the browser

                            //HttpGet httpGet = new HttpGet("http://www.dxy.cn");

                            HttpPost httpPost = new HttpPost(finalUrl);

                            StringBuffer stringBuffer = new StringBuffer();

                            HttpResponse response;

                            //List<NameValuePair> keyValue = new ArrayList<NameValuePair>();

                            //  Post parameter

                            //            keyValue.add(new BasicNameValuePair("username", "zhu"));

                            //

                            //            httpPost.setEntity(new UrlEncodedFormEntity(keyValue, "UTF-8"));

                            // access and get response

                            response = httpClient.execute(httpPost);

                            // record access URL

                            urlQueue.putIfAbsent(finalUrl, Boolean.TRUE);

                            if (response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) {

                                HttpEntity httpEntity = response.getEntity();

                                if (httpEntity != null) {

                                    logger.info("viewPages访问URL:" + finalUrl);

                                    BufferedReader reader = new BufferedReader(

                                            new InputStreamReader(httpEntity.getContent(), "UTF-8"));

                                    String line = null;

                                    if (httpEntity.getContentLength() > 0) {

                                        stringBuffer = new StringBuffer((int) httpEntity.getContentLength());

                                        while ((line = reader.readLine()) != null) {

                                            stringBuffer.append(line);

                                        }

                                        System.out.println(finalUrl + "内容: " + stringBuffer);

                                    }

                                }

                            }

                        }

                    } catch (Exception e) {

                        logger.error("view pages error", e);

                    }

                }

            }).start();

        }

    三. 总结及将来要实现功能

    以上贴出了简易版Java爬虫的核心实现模块, 基本上拿起来就能测试。

    控制爬取速度(调度模块), 使用代理IP访问(收集网络代理模块)的实现在你可以在自己的版本中会慢慢加上...

  • 相关阅读:
    旅行计划
    两只塔姆沃斯牛
    迷宫
    异或序列
    异或之和
    素数个数
    SAC E#1
    [JSOI2010]Group 部落划分 Group
    [USACO12FEB]附近的牛Nearby Cows
    [HNOI2008]Cards
  • 原文地址:https://www.cnblogs.com/poptest/p/4992234.html
Copyright © 2011-2022 走看看