zoukankan      html  css  js  c++  java
  • 爬虫综合案例

    爬虫综合案例(jd爬虫)

    学习了HttpClient和Jsoup,就掌握了如何抓取数据和如何解析数据,接下来,我们完成我们的项目案例,把京东的手机数据抓取下来

    一、需求分析

    需求说明:

    本次爬取jd商城中所有手机商品数据:主要包含 商品的名称 商品价格 商品的id 商品图片 商品的详情的地址

     

     

     

     

     

     

     

     

     

     

     

     

     

     

    通过点击F12观察: 所需要爬取的数据在一下这几个地方

     

     

     

     

     

     

     

     

     

    对于商品的详情页: 通过分析发现 , 请详情页的url地址就是通过spu拼接而来的

     

     

     

     

     

     

     

     

     

     

     

     

     

    1. spu 和 sku的区别说明

    l SPU = Standard Product Unit (标准产品单位)

    SPU是商品信息聚合的最小单位,是一组可复用、易检索的标准化信息的集合,该集合描述了一个产品的特性。通俗点讲,属性值、特性相同的商品就可以称为一个SPU。

     

    例如 iPhone X 可以确定一个产品即为一个SPU

     

    l SKU=stock keeping unit(库存量单位)

    SKU即库存进出计量的单位, 可以是以件、盒、托盘等为单位。SKU是物理上不可分割的最小存货单元。在使用时要根据不同业态,不同管理模式来处理。在服装、鞋类商品中使用最多最普遍。

     

    例如  iPhone X 64G 银色 则是一个SKU。

    二、项目的准备工作

    1. 表结构的准备工作

    根据需求分析, 我们创建的表如下:

    CREATE DATABASE `day04_jdspider`;

    USE  `day04_jdspider`;

    CREATE TABLE `jd_item` (

      `id` bigint(10) NOT NULL AUTO_INCREMENT COMMENT '主键id',

      `spu` bigint(15) DEFAULT NULL COMMENT '商品集合id',

      `sku` bigint(15) DEFAULT NULL COMMENT '商品最小品类单元id',

      `title` varchar(1000) DEFAULT NULL COMMENT '商品标题',

      `price` double(10,0) DEFAULT NULL COMMENT '商品价格',

      `pic` varchar(200) DEFAULT NULL COMMENT '商品图片',

      `url` varchar(1500) DEFAULT NULL COMMENT '商品详情地址',

      `created` varchar(100) DEFAULT NULL COMMENT '创建时间',

      `updated` varchar(100) DEFAULT NULL COMMENT '更新时间',

      PRIMARY KEY (`id`),

      KEY `sku` (`sku`) USING BTREE

    ) ENGINE=InnoDB AUTO_INCREMENT=1116 DEFAULT CHARSET=utf8 COMMENT='京东商品';

     

    2. 项目准备

    l 1) 创建项目的模块

     
       
     
       

     

     

     
       
     
       

    2) 添加pom依赖

    <dependencies>
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.4</version>
        </dependency>
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.3</version>
        </dependency>
        <dependency>
            <groupId>com.mchange</groupId>
            <artifactId>c3p0</artifactId>
            <version>0.9.5.2</version>
        </dependency>

        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.38</version>
        </dependency>

        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <version>1.18.8</version>
            <scope>provided</scope>
        </dependency>

    </dependencies>
    <build>
        <plugins>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.2</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>

        </plugins>

    </build>

    l 3) 添加C3P0配置文件: c3p0-config.xml

    <c3p0-config>
        <!-- 使用默认的配置读取连接池对象 -->
        <default-config>

            <!--  连接参数 -->
            <property name="driverClass">com.mysql.jdbc.Driver</property>

            <property name="jdbcUrl">jdbc:mysql://localhost:3306/day04_jdspider</property>
            <property name="user">root</property>
            <property name="password">123456</property>

            <!-- 连接池参数 -->
            <property name="initialPoolSize">5</property>

            <property name="maxPoolSize">10</property>
            <property name="checkoutTimeout">3000</property>
        </default-config>
    </c3p0-config>

     

    l 4) 添加工具类

    public class C3P0Utils {

        private  static ComboPooledDataSource dataSource = new ComboPooledDataSource();

        private C3P0Utils() {
        }

        public static Connection getConnection(){


            Connection connection = null;
            try {
                connection = dataSource.getConnection();
            } catch (SQLException e) {
                e.printStackTrace();
            }
            return connection;
        }



        public static void  closeAll(ResultSet resultSet, Statement statement, Connection connection){
            try{
                if( resultSet!=null ){
                    resultSet.close();
                }

                if( statement!=null ){
                    statement.close();
                }

                if( connection!=null ){
                    connection.close();
                }

            }catch (Exception e) {
                e.printStackTrace();
            }

        }

    }

     

    l 5) 添加pojo类:

    注意: 使用此注解 ,前提必须在idea中安装好lombok插件, 并在pom中导入lombok依赖才可以使用, 否则手动实现 get set toString 以及 空参 和全参构造

    @Data
    @AllArgsConstructor
    @NoArgsConstructor
    public class Item {
        //主键
        private Long id;

        //标准产品单位(商品集合)
        private Long spu;

        //库存量单位(最小品类单元)
        private Long sku;

        //商品标题
        private String title;

        //商品价格
        private Double price;

        //商品图片
        private String pic;

        //商品详情地址
        private String url;

        //创建时间
        private String created;

        //更新时间
        private String updated;


    }

     

     

     

    3. 项目开发

    l 1) 发送请求, 获取数据

    public class JdSpider {

        public static void main(String[] args) throws Exception {
            //1. 确定首页URL
            String indexUrl = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&page=1&click=0";


            //2. 发送请求, 获取数据  httpClient
            //2.1: 创建HttpClient对象:
            CloseableHttpClient httpClient = HttpClients.createDefault();


            /2.2: 创建请求方式的对象: HttpGet  HttpPost
            HttpGet httpGet = new HttpGet(indexUrl);

            //2.3: 设置请求信息: 请求头
            httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36");


             //2.4: 发送请求, 获取响应对象
            CloseableHttpResponse response = httpClient.execute(httpGet);


            //2.5: 根据response 获取响应的数据
            int statusCode = response.getStatusLine().getStatusCode();

            System.out.println("状态码为:" + statusCode);
            if (statusCode == 200) {
                String html = EntityUtils.toString(response.getEntity(), "UTF-8");
                /2.6 释放资源
                response.close();

            }

       }

    }

     

    l 2) 解析数据: 注意红色部分为新增解析数据代码


    public class JdSpider {

        public static void main(String[] args) throws Exception {
            //1. 确定首页URL
            String indexUrl = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&page=0&click=0";


            //2. 发送请求, 获取数据  httpClient
            //2.1: 创建HttpClient对象:
            CloseableHttpClient httpClient = HttpClients.createDefault();

            //2.2: 创建请求方式的对象: HttpGet  HttpPost
            HttpGet httpGet = new HttpGet(indexUrl);

            //2.3: 设置请求信息: 请求头
            httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36");


            //2.4: 发送请求, 获取响应对象
            CloseableHttpResponse response = httpClient.execute(httpGet);


            //2.5: 根据response 获取响应的数据
            int statusCode = response.getStatusLine().getStatusCode();

            System.out.println("状态码为:" + statusCode);
            if (statusCode == 200) {

                String html = EntityUtils.toString(response.getEntity(), "UTF-8");


                //2.6 释放资源
                response.close();

                //3. 解析数据: jsoup
                //3.1: 根据html 获取其对应document对象
                Document document = Jsoup.parse(html);

                //3.2: 获取获取商品的li标签: 每个li就是一个商品详细信息
                Elements lis = document.select("#J_goodsList>ul[class=gl-warp clearfix]>li");

                List<Item> itemList = new ArrayList<>();
                for (Element li : lis) {
                    //3.3: 获取每件商品的图片的URL , 完成图片的下载
                    Elements imgs = li.select(".gl-i-wrap>.p-img>a>img");

                    String imgUrl = "https:" + imgs.attr("src");
                    //3.3.1: 根据获取图片的地址, 发送请求, 获取数据(字节流数据)
                    HttpGet imgGet = new HttpGet(imgUrl);

                    CloseableHttpResponse imgResonse = httpClient.execute(imgGet);
                    HttpEntity imgEntity = imgResonse.getEntity();
                    InputStream inputStream = imgEntity.getContent(); // 注意此处千万不能使用EntityUtils, 这个东东是用来获取文本内容的

                    //3.3.2: 创建一个本地的输出流 : 输出某一个文件上
                   String imgFileName = "E:\jdImg\" + UUID.randomUUID().toString() + imgUrl.substring(imgUrl.lastIndexOf("."));

                   FileOutputStream outputStream = new FileOutputStream(imgFileName);

                   //3.3.3: 两个流进行对接 将数据写入到本地磁盘中
                  int len;

                  byte[] b = new byte[1024];
                  while ((len = inputStream.read(b)) != -1) {
                       outputStream.write(b, 0, len);
                   }

                   //3.3.4: 释放资源
                   outputStream.close();

                   inputStream.close();
                   imgResonse.close();
                   //3.4: 解析 spu 和 sku
                   String skuValue = li.attr("data-sku");

                   String spuValue = li.attr("data-spu");
                   if (spuValue == null || "".equals(spuValue)) spuValue = skuValue;
                   //3.5: 解析商品名称
                   Elements ems = li.select(".gl-i-wrap>div[class=p-name p-name-type-2]>a>em");

                   String title = ems.text();
                   //3.6: 解析商品的价格
                   Elements priceLiEls = li.select(".gl-i-wrap>.p-price>strong>i");

                   String price = priceLiEls.text();
                   //3.7: 解析商品的URL
                   String itemUrl = "https://item.jd.com/" + skuValue + ".html";

                   //3.8: 封装数据
                   Item item = new Item(null,

                                Long.parseLong(spuValue),
                                Long.parseLong(skuValue),
                                title,
                                Double.parseDouble(price),
                                imgFileName,
                                itemUrl,
                                new Date().toLocaleString(),
                                new Date().toLocaleString()
                    );
                        //3.9: 把解析每一个item对象. 都封装到一个集合中
                    itemList.add(item);

               }

               System.out.println("获取到:" + itemList.size() + "个");
           }
           
        }
    }

     

    l 3) 保存数据

    n 3.1: 先构建一个 jdSpiderDao 用于执行保存数据

    public class JDItemDao {

        // 保存数据的操作
        public void  saveItem(List<Item> itemList) throws Exception {


            //1. 从连接池中获取连接对象
            Connection connection = C3P0Utils.getConnection();


            //2. 根据连接创建预处理的执行平台
            String sql = "insert into jd_item VALUES (null,?,?,?,?,?,?,?,?) ";

            PreparedStatement statement = connection.prepareStatement(sql);

            //3.执行SQL. 获取结果
            for (Item item : itemList) {


                //3.1: 有? 先 封装 ?
                statement.setLong(1,item.getSpu());

                statement.setLong(2,item.getSku());
                statement.setString(3,item.getTitle());
                statement.setDouble(4,item.getPrice());
                statement.setString(5,item.getPic());
                statement.setString(6,item.getUrl());
                statement.setString(7,item.getCreated());
                statement.setString(8,item.getUpdated());

                //3.2: 执行SQL
                statement.executeUpdate();


            }

            //4. 释放资源
            C3P0Utils.closeAll(null,statement,connection);

        }
    }

     

     

    n 3.2) 代码操作: 注意红色是新增地方

    public class JdSpider {

        public static void main(String[] args) throws Exception {

            //1. 确定首页URL
            String indexUrl = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&page=1&click=0";


            //2. 发送请求, 获取数据  httpClient
            //2.1: 创建HttpClient对象:
            CloseableHttpClient httpClient = HttpClients.createDefault();



            //2.2: 创建请求方式的对象: HttpGet  HttpPost
            HttpGet httpGet = new HttpGet(indexUrl);

            //2.3: 设置请求信息: 请求头
            httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36");


            //2.4: 发送请求, 获取响应对象
            CloseableHttpResponse response = httpClient.execute(httpGet);


            //2.5: 根据response 获取响应的数据
            int statusCode = response.getStatusLine().getStatusCode();

            System.out.println("状态码为:" + statusCode);
            if (statusCode == 200) {

                String html = EntityUtils.toString(response.getEntity(), "UTF-8");


                //2.6 释放资源
                response.close();



                //3. 解析数据: jsoup
                //3.1: 根据html 获取其对应document对象
                Document document = Jsoup.parse(html);

                //3.2: 获取获取商品的li标签: 每个li就是一个商品详细信息
                Elements lis = document.select("#J_goodsList>ul[class=gl-warp clearfix]>li");

                List<Item> itemList = new ArrayList<>();
                for (Element li : lis) {
                    //3.3: 获取每件商品的图片的URL , 完成图片的下载
                    Elements imgs = li.select(".gl-i-wrap>.p-img>a>img");

                    String imgUrl = "https:" + imgs.attr("src");


                    //3.3.1: 根据获取图片的地址, 发送请求, 获取数据(字节流数据)
                    HttpGet imgGet = new HttpGet(imgUrl);


                    CloseableHttpResponse imgResonse = httpClient.execute(imgGet);
                    HttpEntity imgEntity = imgResonse.getEntity();

                    InputStream inputStream = imgEntity.getContent(); // 注意此处千万不能使用EntityUtils, 这个东东是用来获取文本内容的

                    //3.3.2: 创建一个本地的输出流 : 输出某一个文件上
                    // http://img10.360buyimg.com/n7/jfs/t1/110811/33/3085/317953/5e8c4bafEf33aaa74/5531debb59f5350c.jpg
                    String imgFileName = "E:\jdImg\" + UUID.randomUUID().toString() + imgUrl.substring(imgUrl.lastIndexOf("."));

                    FileOutputStream outputStream = new FileOutputStream(imgFileName);

                    //3.3.3: 两个流进行对接 将数据写入到本地磁盘中

                    int len;

                    byte[] b = new byte[1024];
                    while ((len = inputStream.read(b)) != -1) {
                        outputStream.write(b, 0, len);
                    }

                    //3.3.4: 释放资源
                    outputStream.close();

                    inputStream.close();
                    imgResonse.close();


                    //3.4: 解析 spu 和 sku
                    String skuValue = li.attr("data-sku");

                    String spuValue = li.attr("data-spu");
                    if (spuValue == null || "".equals(spuValue)) spuValue = skuValue;


                    //3.5: 解析商品名称
                    Elements ems = li.select(".gl-i-wrap>div[class=p-name p-name-type-2]>a>em");

                    String title = ems.text();


                    //3.6: 解析商品的价格
                    Elements priceLiEls = li.select(".gl-i-wrap>.p-price>strong>i");

                    String price = priceLiEls.text();


                    //3.7: 解析商品的URL
                    String itemUrl = "https://item.jd.com/" + skuValue + ".html";


                    //3.8: 封装数据
                    Item item = new Item(null,

                            Long.parseLong(spuValue),
                            Long.parseLong(skuValue),
                            title,
                            Double.parseDouble(price),
                            imgFileName,
                            itemUrl,
                            new Date().toLocaleString(),
                            new Date().toLocaleString()
                    );
                    //3.9: 把解析每一个item对象. 都封装到一个集合中
                    itemList.add(item);

                }

                System.out.println("获取到:" + itemList.size() + "个");


                //4. 保存数据操作 : mysql

                JDItemDao jdItemDao = new JDItemDao();

                jdItemDao.saveItem(itemList);


            }


        }
    }

     

     

     

    l 4) 分页处理: 红色为分页代码处理

    public class JdSpider {

        public static void main(String[] args) throws Exception {
            int page = 1;
            //1. 确定首页URL
            String indexUrl = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&page=" + (page * 2 - 1) + "&click=0";


            //2. 发送请求, 获取数据  httpClient
            //2.1: 创建HttpClient对象:
            CloseableHttpClient httpClient = HttpClients.createDefault();


            while (page <= 100) {
                System.out.println("当前正在处理:" + page);
                System.out.println("当前正在处理页面地址为:" + indexUrl);

                //2.2: 创建请求方式的对象: HttpGet  HttpPost
                HttpGet httpGet = new HttpGet(indexUrl);

                //2.3: 设置请求信息: 请求头
                httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36");


                //2.4: 发送请求, 获取响应对象
                CloseableHttpResponse response = httpClient.execute(httpGet);


                //2.5: 根据response 获取响应的数据
                int statusCode = response.getStatusLine().getStatusCode();

                System.out.println("状态码为:" + statusCode);
                if (statusCode == 200) {

                    String html = EntityUtils.toString(response.getEntity(), "UTF-8");


                    //2.6 释放资源
                    response.close();



                    //3. 解析数据: jsoup
                    //3.1: 根据html 获取其对应document对象
                    Document document = Jsoup.parse(html);

                    //3.2: 获取获取商品的li标签: 每个li就是一个商品详细信息
                    Elements lis = document.select("#J_goodsList>ul[class=gl-warp clearfix]>li");

                    List<Item> itemList = new ArrayList<>();
                    for (Element li : lis) {
                        //3.3: 获取每件商品的图片的URL , 完成图片的下载
                        Elements imgs = li.select(".gl-i-wrap>.p-img>a>img");

                        String imgUrl = "https:" + imgs.attr("src");


                        //3.3.1: 根据获取图片的地址, 发送请求, 获取数据(字节流数据)
                        HttpGet imgGet = new HttpGet(imgUrl);


                        CloseableHttpResponse imgResonse = httpClient.execute(imgGet);
                        HttpEntity imgEntity = imgResonse.getEntity();

                        InputStream inputStream = imgEntity.getContent(); // 注意此处千万不能使用EntityUtils, 这个东东是用来获取文本内容的

                        //3.3.2: 创建一个本地的输出流 : 输出某一个文件上
                        // http://img10.360buyimg.com/n7/jfs/t1/110811/33/3085/317953/5e8c4bafEf33aaa74/5531debb59f5350c.jpg
                        String imgFileName = "E:\jdImg\" + UUID.randomUUID().toString() + imgUrl.substring(imgUrl.lastIndexOf("."));

                        FileOutputStream outputStream = new FileOutputStream(imgFileName);

                        //3.3.3: 两个流进行对接 将数据写入到本地磁盘中

                        int len;

                        byte[] b = new byte[1024];
                        while ((len = inputStream.read(b)) != -1) {
                            outputStream.write(b, 0, len);
                        }

                        //3.3.4: 释放资源
                        outputStream.close();

                        inputStream.close();
                        imgResonse.close();


                        //3.4: 解析 spu 和 sku
                        String skuValue = li.attr("data-sku");

                        String spuValue = li.attr("data-spu");
                        if (spuValue == null || "".equals(spuValue)) spuValue = skuValue;


                        //3.5: 解析商品名称
                        Elements ems = li.select(".gl-i-wrap>div[class=p-name p-name-type-2]>a>em");

                        String title = ems.text();


                        //3.6: 解析商品的价格
                        Elements priceLiEls = li.select(".gl-i-wrap>.p-price>strong>i");

                        String price = priceLiEls.text();


                        //3.7: 解析商品的URL
                        String itemUrl = "https://item.jd.com/" + skuValue + ".html";


                        //3.8: 封装数据
                        Item item = new Item(null,

                                Long.parseLong(spuValue),
                                Long.parseLong(skuValue),
                                title,
                                Double.parseDouble(price),
                                imgFileName,
                                itemUrl,
                                new Date().toLocaleString(),
                                new Date().toLocaleString()
                        );
                        //3.9: 把解析每一个item对象. 都封装到一个集合中
                        itemList.add(item);

                    }

                    System.out.println("获取到:" + itemList.size() + "个");


                    //4. 保存数据操作 : mysql

                    JDItemDao jdItemDao = new JDItemDao();

                    jdItemDao.saveItem(itemList);

                    //5. 获取下一页
                    page++;

                    indexUrl = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&page=" + (page * 2 - 1) + "&click=0";
                }
            }

            // 6. 释放资源 : 千万不要放置在while循环里面
            httpClient.close();


        }
    }

     

    到此 基础jd爬虫案例全部实现

    三、爬虫项目优化

    将各个阶段的代码抽取为方法

    l 抽取一个根据指定的url来获取html的方法

    public static String getHtml(String indexUrl, CloseableHttpClient httpClient) throws Exception {

        //2.2: 创建请求方式的对象: HttpGet  HttpPost
        HttpGet httpGet = new HttpGet(indexUrl);

        //2.3: 设置请求信息: 请求头
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36");


        //2.4: 发送请求, 获取响应对象
        CloseableHttpResponse response = httpClient.execute(httpGet);


        //2.5: 根据response 获取响应的数据
        int statusCode = response.getStatusLine().getStatusCode();

        System.out.println("状态码为:" + statusCode);
        if (statusCode == 200) {

            String html = EntityUtils.toString(response.getEntity(), "UTF-8");


            //2.6 释放资源
            response.close();



            return html;
        }

        return null;

    }

     

     

    l 抽取一个用于解析每页数据的方法

    public static List<Item> parseHtmlToListItem(CloseableHttpClient httpClient, String html) throws IOException {
        //3. 解析数据: jsoup
        //3.1: 根据html 获取其对应document对象
        Document document = Jsoup.parse(html);

        //3.2: 获取获取商品的li标签: 每个li就是一个商品详细信息
        Elements lis = document.select("#J_goodsList>ul[class=gl-warp clearfix]>li");

        List<Item> itemList = new ArrayList<>();
        for (Element li : lis) {
            //3.3: 获取每件商品的图片的URL , 完成图片的下载
            Elements imgs = li.select(".gl-i-wrap>.p-img>a>img");

            String imgUrl = "https:" + imgs.attr("src");


            //3.3.1: 根据获取图片的地址, 发送请求, 获取数据(字节流数据)
            HttpGet imgGet = new HttpGet(imgUrl);


            CloseableHttpResponse imgResonse = httpClient.execute(imgGet);
            HttpEntity imgEntity = imgResonse.getEntity();

            InputStream inputStream = imgEntity.getContent(); // 注意此处千万不能使用EntityUtils, 这个东东是用来获取文本内容的

            //3.3.2: 创建一个本地的输出流 : 输出某一个文件上
            // http://img10.360buyimg.com/n7/jfs/t1/110811/33/3085/317953/5e8c4bafEf33aaa74/5531debb59f5350c.jpg
            String imgFileName = "E:\jdImg\" + UUID.randomUUID().toString() + imgUrl.substring(imgUrl.lastIndexOf("."));

            FileOutputStream outputStream = new FileOutputStream(imgFileName);

            //3.3.3: 两个流进行对接 将数据写入到本地磁盘中

            int len;

            byte[] b = new byte[1024];
            while ((len = inputStream.read(b)) != -1) {
                outputStream.write(b, 0, len);
            }

            //3.3.4: 释放资源
            outputStream.close();

            inputStream.close();
            imgResonse.close();


            //3.4: 解析 spu 和 sku
            String skuValue = li.attr("data-sku");

            String spuValue = li.attr("data-spu");
            if (spuValue == null || "".equals(spuValue)) spuValue = skuValue;


            //3.5: 解析商品名称
            Elements ems = li.select(".gl-i-wrap>div[class=p-name p-name-type-2]>a>em");

            String title = ems.text();


            //3.6: 解析商品的价格
            Elements priceLiEls = li.select(".gl-i-wrap>.p-price>strong>i");

            String price = priceLiEls.text();


            //3.7: 解析商品的URL
            String itemUrl = "https://item.jd.com/" + skuValue + ".html";


            //3.8: 封装数据
            Item item = new Item(null,

                    Long.parseLong(spuValue),
                    Long.parseLong(skuValue),
                    title,
                    Double.parseDouble(price),
                    imgFileName,
                    itemUrl,
                    new Date().toLocaleString(),
                    new Date().toLocaleString()
            );
            //3.9: 把解析每一个item对象. 都封装到一个集合中
            itemList.add(item);

        }
        return itemList;
    }

     

     

    l 最终的抽取后的整个代码的

    public class JdSpider {

        public static void main(String[] args) throws Exception {
            int page = 1;
            //1. 确定首页URL
            String indexUrl = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&page=" + (page * 2 - 1) + "&click=0";


            //2. 发送请求, 获取数据  httpClient
            //2.1: 创建HttpClient对象:
            CloseableHttpClient httpClient = HttpClients.createDefault();


            while (page <= 100) {
                System.out.println("当前正在处理:" + page);
                System.out.println("当前正在处理页面地址为:" + indexUrl);

                String html = getHtml(indexUrl, httpClient);
                if(html!=null){
                    //3. 解析数据: jsoup
                    List<Item> itemList = parseHtmlToListItem(httpClient, html);

                    System.out.println("获取到:" + itemList.size() + "个");
                    //4. 保存数据操作 : mysql
                    JDItemDao jdItemDao = new JDItemDao();

                    jdItemDao.saveItem(itemList);

                    //5. 获取下一页
                    page++;

                    indexUrl = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&page=" + (page * 2 - 1) + "&click=0";
                }
            }

            // 6. 释放资源 : 千万不要放置在while循环里面
            httpClient.close();

        }


        // 解析数据
        public static List<Item> parseHtmlToListItem(CloseableHttpClient httpClient, String html) throws IOException {

            //3. 解析数据: jsoup
            //3.1: 根据html 获取其对应document对象
            Document document = Jsoup.parse(html);

            //3.2: 获取获取商品的li标签: 每个li就是一个商品详细信息
            Elements lis = document.select("#J_goodsList>ul[class=gl-warp clearfix]>li");

            List<Item> itemList = new ArrayList<>();
            for (Element li : lis) {
                //3.3: 获取每件商品的图片的URL , 完成图片的下载
                Elements imgs = li.select(".gl-i-wrap>.p-img>a>img");

                String imgUrl = "https:" + imgs.attr("src");


                //3.3.1: 根据获取图片的地址, 发送请求, 获取数据(字节流数据)
                HttpGet imgGet = new HttpGet(imgUrl);


                CloseableHttpResponse imgResonse = httpClient.execute(imgGet);
                HttpEntity imgEntity = imgResonse.getEntity();

                InputStream inputStream = imgEntity.getContent(); // 注意此处千万不能使用EntityUtils, 这个东东是用来获取文本内容的

                //3.3.2: 创建一个本地的输出流 : 输出某一个文件上
                // http://img10.360buyimg.com/n7/jfs/t1/110811/33/3085/317953/5e8c4bafEf33aaa74/5531debb59f5350c.jpg
                String imgFileName = "E:\jdImg\" + UUID.randomUUID().toString() + imgUrl.substring(imgUrl.lastIndexOf("."));

                FileOutputStream outputStream = new FileOutputStream(imgFileName);

                //3.3.3: 两个流进行对接 将数据写入到本地磁盘中

                int len;

                byte[] b = new byte[1024];
                while ((len = inputStream.read(b)) != -1) {
                    outputStream.write(b, 0, len);
                }

                //3.3.4: 释放资源
                outputStream.close();

                inputStream.close();
                imgResonse.close();


                //3.4: 解析 spu 和 sku
                String skuValue = li.attr("data-sku");

                String spuValue = li.attr("data-spu");
                if (spuValue == null || "".equals(spuValue)) spuValue = skuValue;


                //3.5: 解析商品名称
                Elements ems = li.select(".gl-i-wrap>div[class=p-name p-name-type-2]>a>em");

                String title = ems.text();


                //3.6: 解析商品的价格
                Elements priceLiEls = li.select(".gl-i-wrap>.p-price>strong>i");

                String price = priceLiEls.text();


                //3.7: 解析商品的URL
                String itemUrl = "https://item.jd.com/" + skuValue + ".html";


                //3.8: 封装数据
                Item item = new Item(null,

                        Long.parseLong(spuValue),
                        Long.parseLong(skuValue),
                        title,
                        Double.parseDouble(price),
                        imgFileName,
                        itemUrl,
                        new Date().toLocaleString(),
                        new Date().toLocaleString()
                );
                //3.9: 把解析每一个item对象. 都封装到一个集合中
                itemList.add(item);

            }
            return itemList;
        }


        public static String getHtml(String indexUrl, CloseableHttpClient httpClient) throws Exception {

            //2.2: 创建请求方式的对象: HttpGet  HttpPost
            HttpGet httpGet = new HttpGet(indexUrl);

            //2.3: 设置请求信息: 请求头
            httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36");


            //2.4: 发送请求, 获取响应对象
            CloseableHttpResponse response = httpClient.execute(httpGet);


            //2.5: 根据response 获取响应的数据
            int statusCode = response.getStatusLine().getStatusCode();

            System.out.println("状态码为:" + statusCode);
            if (statusCode == 200) {

                String html = EntityUtils.toString(response.getEntity(), "UTF-8");


                //2.6 释放资源
                response.close();



                return html;
            }

            return null;

        }
    }

  • 相关阅读:
    [MSDN] How to Debug a Release Build
    抽象成员 虚方法
    强制类型转换符 和 as 运算符
    一份超长的MySQL学习笔记
    Java反射基础
    c3p0config.xml
    一个JDBC封装工具类
    Spring5——IOC操作Bean管理(基于xml文件)
    Android游戏开发大全
    移除项目里的所有.svn命令
  • 原文地址:https://www.cnblogs.com/shan13936/p/13969718.html
Copyright © 2011-2022 走看看