zoukankan      html  css  js  c++  java
  • 一只垂直的小爬虫

    这只垂直的小爬虫,使用如下实现

    实现的思路很简单,我从主函数开始简单叙述一下整个运行流程,第一步:收集需要爬取的url地址,容器我选择的是ConcurrentLinkedQueue非阻塞队列,它底层使用Unsafe实现,要的就是它线程安全的特性

    主函数代码如下:

    
        static String url = "http://www.qlu.edu.cn/38/list.htm";
        // 添加url任务
          public static ConcurrentLinkedQueue<String>  add( ConcurrentLinkedQueue<String> queue){
                for (int i=1;i<=19;i++){
                    String subString = StringUtils.substringBefore(url, ".htm");
                    queue.add(subString+i+".htm");
                }
              return queue;
          }
          
    public static void main(String[] args) throws IOException {
            ConcurrentLinkedQueue<String> queue = new ConcurrentLinkedQueue();
            queue.add(url);
            ConcurrentLinkedQueue<String> newQueue = add(queue);
            // 多线程下载解析
            TPoolForDownLoadRootUrl.downLoadRootTaskPool(queue);
    
        }
    

    第二步:把url列表丢线程池:

    我使用的线程池是newCachedThreadPool 根据提交的任务数,动态分配线程

    线程池里面干了这么几件事,下载源html

    /**
     *  下载html的业务实现
     * @Author: Changwu
     * @Date: 2019/3/24 11:13
     */
    public class downLoadHtml {
        public static Logger logger = Logger.getLogger(downLoadHtml.class);
        /**
         * 根据url 下载网页源码
         * @param url
         * @return
         */
        public static String downLoadHtmlByUrl(String url) throws IOException {
            CloseableHttpClient httpClient = HttpClients.createDefault();
            HttpGet httpGet = new HttpGet(url);
            //设置请求头
            httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36");
    
            CloseableHttpResponse response = httpClient.execute(httpGet);
            logger.info("请求"+url+"状态码为"+response.getStatusLine().getStatusCode());
            HttpEntity entity = response.getEntity();
            String result = EntityUtils.toString(entity, "utf-8");
            return  result;
        }
    

    解析rootUrl,目的是拿到新闻主页的url,因为新闻的正文,在那里面,边解析遍封装RootBean

    
        /**
         * 解析源html.封装成一级Bean对象并返回
         *
         * @param sourceHtml
         * @return
         */
        public static List<RootBean> getRootBeanList(String sourceHtml) {
            LinkedList<RootBean> rootBeanList = new LinkedList<>();
            Document doc = Jsoup.parse(sourceHtml);
            Elements elements = doc.select("#wp_news_w6 ul li");
            String rootUrl = "http://www.qlu.edu.cn";
    
            for (Element element : elements) {
                RootBean rootBean = new RootBean();
                // 获取url并拼装
                String href = element.child(0).child(0).attr("href");
                // 获取title
                String title = element.text();
                String[] split = title.split("\s+");
                //封装
                System.out.println(title);
    
                if (split.length >= 2) {
                    String s = element.outerHtml();
                    String regex = "class="news_meta">.*";
                    Pattern compile = Pattern.compile(regex);
                    Matcher matcher = compile.matcher(s);
                    if (matcher.find()) {
    
                        String group = matcher.group(0);
                        String ss = StringUtils.substring(group, 18);
                        ss = StringUtils.substringBefore(ss, "</span> </li>");
                        rootBean.setPostTime(ss);
                    }
    
                }
    
    
                rootBean.setTitle(split[0]);
                rootBean.setUrl(rootUrl + href);
    
                rootBeanList.add(rootBean);
                /*System.out.println();
                System.out.println(split[0]);
                System.out.println();*/
            }
            return rootBeanList;
        }
    

    类似,处理二级任务,这里使用到了正则表达式,原来没好好学,今天用的时候,完全蒙,还好慢慢悠悠整出来了,这块这要是观察源html,根据特性,使用jsoup提供的选择器选择,剪切,拼接出我们想要的内容,然后封装

    为啥说是垂直的小爬虫,它只适合爬取我学校新闻,看下面的代码,没办法,只能拼凑剪切,最坑的是,100条新闻中,99条标题放在里面,总有那么一条放在了里面, 这个时候,就不得不去改刚才写好的规则

    /**
         * 解析封装二级任务
         *
         * @param htmlSouce
         * @return
         */
        public static List<PojoBean> getPojoBeanByHtmlSource(String htmlSouce, RootBean bean) {
    
            LinkedList<PojoBean> list = new LinkedList<>();
            PojoBean pojoBean = new PojoBean();
    
            // 解析
            Document doc = Jsoup.parse(htmlSouce);
    
            // 编辑
            Elements elements1 = doc.select(".arti_metas");
    
            for (Element element : elements1) {
    
                String text = element.text();
    
                // 编辑
                String regex = "(责任编辑:.*)";
                Pattern compile = Pattern.compile(regex);
                Matcher matcher = compile.matcher(text);
                String editor = null;
                if (matcher.find()) {
                    //System.out.println(matcher.group(group));
                    editor = matcher.group(1);
                    editor = StringUtils.substring(editor, 5);
                    //System.out.println(editor);
                }
    
                // 作者
                regex = "(作者:.*出处)";
                compile = Pattern.compile(regex);
                matcher = compile.matcher(text);
                String author = null;
                if (matcher.find()) {
                    //System.out.println(matcher.group(group));
                    author = matcher.group(1);
                    author = StringUtils.substring(author, 3);
                    author = StringUtils.substringBefore(author, "出处");
                    //System.out.println(author);
                }
    
                // 出处
                regex = "(出处:.*责任编辑)";
                compile = Pattern.compile(regex);
                matcher = compile.matcher(text);
                String source = null;
                if (matcher.find()) {
                    source = matcher.group(1);
                    source = StringUtils.substring(source, 3);
                    source = StringUtils.substringBefore(source, "责任编辑");
                    //  System.out.println(source);
                }
    
                // 正文
                Elements EBody = doc.select(".wp_articlecontent");
                String body = EBody.first().text();
                // System.out.println(body);
    
                // 封装
                pojoBean.setAuthor(author);
                pojoBean.setBody(body);
                pojoBean.setEditor(editor);
                pojoBean.setSource(source);
                pojoBean.setUrl(bean.getUrl());
                pojoBean.setPostTime(bean.getPostTime());
                pojoBean.setTitle(bean.getTitle());
                list.add(pojoBean);
            }
            return list;
        }
    }
    

    持久化,使用的是底册的JDBC

    /**
         * 持久化单个pojo
         * @param pojo
         */
        public static void insertOnePojo(PojoBean pojo) throws ClassNotFoundException, SQLException {
            // 注册驱动
            Class.forName("com.mysql.jdbc.Driver");
            // 连接
            Connection connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/spider", "root", "root");
            String sql = "insert into qluspider (title,url,post_time,insert_time,author,source,editor,body) values (?,?,?,?,?,?,?,?)";
            PreparedStatement ps = connection.prepareStatement(sql);
            // 填充sql
            ps.setString(1,pojo.getTitle());
            ps.setString(2,pojo.getUrl());
            // 把字符串转换成日期
            ps.setTimestamp(3,new java.sql.Timestamp(SpiderUtil.stringToDate(pojo.getPostTime()).getTime()));
            ps.setTimestamp(4,new java.sql.Timestamp(new Date().getTime()));
            ps.setString(5,pojo.getAuthor());
            ps.setString(6,pojo.getSource());
            ps.setString(7,pojo.getEditor());
            ps.setString(8,pojo.getBody());
    
            ps.execute();
    
            connection.close();
    
        }
    

    拿到的新的url称作是二级

    
        public static Logger logger = Logger.getLogger(TPoolForDownLoadRootUrl.class);
    
        /**
         * 下载,解析 根url的线程池
         */
        public static void downLoadRootTaskPool(ConcurrentLinkedQueue queue) {
            ExecutorService executor = Executors.newCachedThreadPool();
            //ExecutorService executor = Executors.newFixedThreadPool(5);
            for (  int i=1;i<=queue.size();i++)
            {
                executor.execute(new Runnable() {
                    @Override
                    public void run() {
                        try {
                            logger.info("1号线程池开启,将要下载解析root任务");
                            // 获取根任务url
                            String url = (String) queue.poll();
    
                            logger.info("根URL==" + url);
                            if (StringUtils.isNotBlank(url)) {
                                // 下载当前url对应的rootHtml
                                String sourceHtml = downLoadHtml.downLoadHtmlByUrl(url);
                                // 解析rootHtml里面所有的RootBean对象
                                List<RootBean> rootBeanList = parseHtmlByJsoup.getRootBeanList(sourceHtml);
                                // 二级任务开始
                                for (RootBean rootBean : rootBeanList) {
                                    logger.info(this + "进入二级任务");
                                    String subUrl = rootBean.getUrl();
                                    // 下载二级任务 html
                                    String htmlSouce = downLoadHtml.downLoadHtmlByUrl(subUrl);
                                    // 解析封装
                                    List<PojoBean> pojoList = parseHtmlByJsoup.getPojoBeanByHtmlSource(htmlSouce, rootBean);
                                    // 持久化
                                    logger.info(this + "将持久化" + subUrl + "中的二级任务");
                                    Persistence.insertPojoListToDB(pojoList);
                                    logger.info("持久化完成.......");
                                }
                            }
                        } catch (IOException e) {
                            System.out.println();
                            e.printStackTrace();
                        }
    
                    }
                });
    
            }
    
  • 相关阅读:
    android加固系列—2.加固前先要学会破解,调试内存值修改程序走向
    算法—12.广度优先搜索
    算法—11.深度优先搜索
    算法—10.红黑二叉查找树
    算法—二叉查找树的相关一些操作及总结
    binary_search
    no title
    be face up to early
    Linux虚拟机网络配置
    网络工程问题历史遗留
  • 原文地址:https://www.cnblogs.com/ZhuChangwu/p/11150580.html
Copyright © 2011-2022 走看看