zoukankan      html  css  js  c++  java
  • 图书信息库完整解决方案(三)解析图书分类

    经过综合对比分析(此处省略几千字),最终选定了HtmlUnit作为网页解析的工具。

    通过maven来引入HtmlUnit资源包:

    <groupId>net.sourceforge.htmlunit</groupId>
      <artifactId>htmlunit</artifactId>
      <version>2.28</version>
    </dependency>

    下面是解析图书分类的核心逻辑,大量的精力是要放在分析网页源码上,从中找出一级级的节点规律,然后再解析出自己需要的数据。

    public List<BookCategory> categoryFromDangdang() {
    
            List<BookCategory> lsCategory = new LinkedList<BookCategory>();
            
            String categoryUrl = "http://category.dangdang.com/?ref=www-0-C";
            
            try {
                HtmlPage page = webClientGetPage(categoryUrl,false, false, null);
                
                List<DomElement> ll =  page.getElementsByTagName("div");
                            
                DomElement bookElement = null;
                for(int i=0;i<ll.size();i++) {
                    DomElement e = ll.get(i);
                    
                    String s = e.getAttribute("class");
    
                    if(s.equalsIgnoreCase("classify_con")) {
                        System.out.println("find book. class="+s);
                        
                        //在整个html中找到图书的一级节点
                        bookElement = e;
                        break;
                    }
                }
                
                if(bookElement != null) {
                    DomElement eClassify_books = bookElement.getFirstElementChild().getFirstElementChild();
                    String s = eClassify_books.getAttribute("class");
                    
                    //找到图书分类的解析区域
                    if(s.equalsIgnoreCase("classify_books")) {
                        System.out.println("find classify_books. class="+s);
    
                        String rootCategory = "";
                        
                        Iterable<DomElement> elementIterable = eClassify_books.getChildElements();
                        for (java.util.Iterator<DomElement> i = elementIterable.iterator(); i.hasNext(); ) {
                            DomElement e = (DomElement) i.next();
                            s = e.getAttribute("class");
                            
                            //图书分类的描述
                            if(s.equalsIgnoreCase("classify_books_detail")) {
                                DomElement eRoot = e.getElementsByTagName("h3").get(0).getFirstElementChild();
                                String url = eRoot.getAttribute("href");
                                String name = eRoot.getTextContent();
                                
                                rootCategory = urlToCategory(url);
                                System.out.println("find book rootCategory." + " name=" + name + " category=" + rootCategory);
                            }
                            //图书具体分类
                            else if(s.indexOf("classify_kind") != -1) {
                                DomElement eCategory = e.getFirstElementChild().getFirstElementChild();
                                String url = eCategory.getAttribute("href");
                                String name = eCategory.getTextContent();
                                
                                String category = urlToCategory(url);
                                if(category.equalsIgnoreCase("cp01.59.00.00.00.00"))//繁体字显示有问题
                                    name = "港台图书";
                                System.out.println("find book category. " + " name=" + name + " category=" + category);    
                                
                                BookCategory bookCategory = new BookCategory();
                                bookCategory.setTitle(name);
                                bookCategory.setCategory(category);
                                bookCategory.setCategory_parent(rootCategory);
                                bookCategory.setCache(0);
                                
                                lsCategory.add(bookCategory);
                                
                                //二级分类
                                DomElement ul = e.getElementsByTagName("ul").get(0);
                                DomNodeList<HtmlElement> ulList = ul.getElementsByTagName("li");
                                
                                for(int j=0;j<ulList.size();j++) {
                                    HtmlElement he = ulList.get(j);
                                    if(he.getAttribute("name").equalsIgnoreCase("cat_3")) {
                                        DomElement eSubCategory = he.getFirstElementChild();
                                        
                                        url = eSubCategory.getAttribute("href");
                                        name = eSubCategory.getTextContent();
                                        
                                        String subCategory = urlToCategory(url);
                                        System.out.println("===========find book sub category. " + " name=" + name + " category=" + subCategory);    
                                        
                                        BookCategory bookSubCategory = new BookCategory();
                                        bookSubCategory.setTitle(name);
                                        bookSubCategory.setCategory(subCategory);
                                        bookSubCategory.setCategory_parent(category);
                                        bookSubCategory.setCache(1);
                                        
                                        
                                        lsCategory.add(bookSubCategory);
                                    }
                                }
                            }
                        }
                    }
                }
                
                //stringToFile(result,"E:\category.html");
            } catch (Exception e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
                System.out.println("Exception="+e);
            }
            
            System.out.println("find book category finish. ");
            
            return lsCategory;
        }

    解析出来的分类如下图所示:

    这样就获取到了当当的所有图书分类,因为分类数据只有一个页面,所以相对比较简单一些。

    另外还可以解析分类下的第一个页面,从而可以获取到关联分类下的网页页数和图书数量。 

  • 相关阅读:
    赫尔维茨公式
    从解析几何的角度分析二次型
    Struts 1 Struts 2
    记一次服务器被入侵的调查取证
    契约式设计 契约式编程 Design by contract
    lsblk df
    Linux Find Out Last System Reboot Time and Date Command 登录安全 开关机 记录 帐号审计 历史记录命令条数
    Infrastructure for container projects.
    更新文档 版本控制 多版本并发控制
    Building Microservices: Using an API Gateway
  • 原文地址:https://www.cnblogs.com/mazhiyong/p/9468107.html
Copyright © 2011-2022 走看看