zoukankan      html  css  js  c++  java
  • JAVA爬虫入门学习

    之前分享一个简单的获取数据流的方法,

    今天分享下学习到的SpringBoot+Maven的框架

    首先pom.xml

     1 <!-- 继承父包 -->
     2     <parent>
     3         <groupId>org.springframework.boot</groupId>
     4         <artifactId>spring-boot-starter-parent</artifactId>
     5         <version>1.5.9.RELEASE</version>
     6     </parent>
     7 
     8     <dependencies>
     9         <!-- spring-boot使用jetty容器配置begin -->
    10         <dependency>
    11             <groupId>org.springframework.boot</groupId>
    12             <artifactId>spring-boot-starter-web</artifactId>
    13             <!-- 排除默认的tomcat,引入jetty容器. -->
    14             <exclusions>
    15                 <exclusion>
    16                     <groupId>org.springframework.boot</groupId>
    17                     <artifactId>spring-boot-starter-tomcat</artifactId>
    18                 </exclusion>
    19             </exclusions>
    20         </dependency>
    21         <!-- jetty 容器. -->
    22         <dependency>
    23             <groupId>org.springframework.boot</groupId>
    24             <artifactId>spring-boot-starter-jetty</artifactId>
    25         </dependency>
    26         <dependency>
    27             <groupId>org.springframework.boot</groupId>
    28             <artifactId>spring-boot-starter-test</artifactId>
    29             <scope>test</scope>
    30         </dependency>
    31         <!-- spring-boot使用jetty容器配置end -->
    32 
    33         <dependency>
    34             <groupId>org.jsoup</groupId>
    35             <artifactId>jsoup</artifactId>
    36             <version>1.10.3</version>
    37         </dependency>
    38 
    39     </dependencies>
    其中Jetty 是一个开源的servlet容器,它为基于Java的web容器,例如JSP和servlet提供运行环境。
    Jetty是使用JAVA编写的,它的API以一组JAR包的形式发布。
    开发人员可以将Jetty容器实例化成一个对象,可以迅速为一些独立运行(stand-alone)的Java应用提供网络和web连接。
    Jetty提供了一下HTTPClient的一些发包方法。
    下面是控制器内写的接口
     1 @Controller  /*返回视图页名称*/
     2 //@RestController  返回json数据类型
     3 public class MainController {
     4 
     5     @Autowired//根据类型匹配需要配合Qualifier
     6     //@Qualifier//名称
     7     //@Resource//根据名称自动匹配,区别于@Autowired
     8     public GetCaptcha Captcha;
     9 
    10     @GetMapping("Captcha")//等同于下面
    11     //@RequestMapping(value = "name",method = RequestMethod.GET)
    12     public @ResponseBody String index(HttpServletRequest request){//错误需要另外处理
    13 
    14         //初始化
    15         HttpClient httpClient = Captcha.GetHttpClient();
    16         CaptchaData captchaData = new CaptchaData();
    17         captchaData.setHttpClient(httpClient);
    18 
    19         if(httpClient==null){
    20             return "创建对象错误,检查代理是否到期";
    21         }
    22 
    23         CaptchaData imgValue = Captcha.GetCaptcha(captchaData);
    24 
    25         request.getSession().setAttribute("captchaData",imgValue);
    26 
    27         String htmlText = imgValue.getImg()+"<form action="/json" method="POST">
    " +
    28                 "请输入用户名:<br>
    " +
    29                 "<input type="text" name="name" value="张三">
    " +
    30                 "<br>
    " +
    31                 "身份证号:<br>
    " +
    32                 "<input type="text" name="cardId" value="123456789098765432">
    " +
    33                 "<br>
    " +
    34                 "验证码:<br>
    " +
    35                 "<input type="text" name="captcha" value="">
    " +
    36                 "<br><br>
    " +
    37                 "<input type="submit" value="提交">
    " +
    38                 "</form> ";
    39         return htmlText;//返回视图名称
    40     }
    41 
    42     @PostMapping("json")
    43     public @ResponseBody String resp(Userdb user,HttpServletRequest request){
    44 
    45         CaptchaData captchaData = (CaptchaData) request.getSession().getAttribute("captchaData");
    46         HttpClient httpClient = captchaData.getHttpClient();
    47 
    48         String returnValue = "";
    49         try {
    50             returnValue = Captcha.PostData(user,captchaData);
    51         }catch (Exception e){
    52             returnValue = "数据传输出现错误,请检查数据信息";
    53         }finally {
    54             try{
    55                 httpClient.stop();
    56             }
    57             catch (Exception e){
    58                 returnValue += " 
     HttpClient关闭异常";
    59             }
    60         }
    61         return returnValue;
    62     }
    63 }

    在控制器中调用的方法分别包括:

    获取验证码、发送数据到目标地址、初始化httpClient对象

    代码如下:

     1 public CaptchaData GetCaptcha(CaptchaData captchaData) {
     2         HttpClient httpClient = GetHttpClient();
     3         try {
     4             ContentResponse response1 = httpClient.newRequest("https://www.baidu.com").send();
     5             //System.out.println(response1.getContentAsString());
     6             //首页
     7             String TOKEN;
     8             response1 = httpClient.newRequest("XXXXXXXX/index1.do")
     9                     .method(HttpMethod.GET)
    10                     .header(HttpHeader.REFERER, "https://XXXXXXXX/")
    11                     .send();
    12 
    13             //点击进入页面
    14             response1 = httpClient.newRequest("XXXXXXXX/xxxxx.xxx?xxxx=initLogin")
    15                     .method(HttpMethod.GET)
    16                     .header(HttpHeader.REFERER, "https://XXXXXXXX/index1.do")
    17                     .send();
    18 
    19             //点击注册页面
    20             response1 = httpClient.newRequest("https://XXXXXXXX/xxxxx?method=initReg")
    21                     .method(HttpMethod.POST)
    22                     .header("Referer", "https://XXXXXXXX/loginreg.jsp")
    23                     .send();
    24 
    25             //获取TOKEN
    26             Document documentUrlOne = Jsoup.parse(response1.getContentAsString());
    27             Elements firstNode = documentUrlOne.select("div.white-con");
    28             Elements formNode = firstNode.select("form[action]");
    29             Element inputNode = formNode.select("input").first();
    30             TOKEN = inputNode.attr("value");
    31             //存入
    32             captchaData.setTOKEN(TOKEN);
    33             //TOKEN需要在注册提交页面生成的才有效
    34             //System.out.println(TOKEN+"---获取org.apache.struts.taglib.html.TOKEN----");
    35 
    36             //获取验证码。
    37             String timeString = String.valueOf(new Date().getTime());
    38             String ImgStr = "https://XXXXXXXX/imgrc.do?a=" + timeString;
    39 
    40             response1 = httpClient.newRequest(ImgStr)
    41                     .method(HttpMethod.GET)
    42                     .header("Referer", "https://XXXXXXXXXXXXXXXX.do")
    43                     .send();
    44             //图片类型转换
    45             String encodeBase64 = Base64.getEncoder().encodeToString(response1.getContent());
    46             //<img src=''/>
    47             //String filepath = "XXXXXXXXa.png";
    48             //convertBase64DataToImage(encodeBase64, filepath);
    49             //System.out.printf("图片下载成功")
    50             captchaData.setImg("<img src='data:image/png;base64," + encodeBase64 + "'/>");
    51             captchaData.setResponse1(response1);
    52             return captchaData;
    53         } catch (Exception e) {
    54             e.printStackTrace();
    55         } finally {
    56             return captchaData;
    57         }
    58     }
    public String PostData(Userdb User, CaptchaData captchaData) {
    
            String Name = User.getName();
            String CardId = User.getCardId();
            String CAPTCHA = User.getCaptcha();
            //发送信息提交
            String TOKEN =captchaData.getTOKEN();
            String postdata = "org.apache.struts.taglib.html.TOKEN=" + TOKEN +
                    "&method=checkIdentity" +
                    "&userInfoVO.name=" + Name +
                    "&userInfoVO.certType=0" +
                    "&userInfoVO.certNo=" + CardId +
                    "&_%40IMGRC%40_=" + CAPTCHA +
                    "&1=on";
            HttpClient httpClient = captchaData.getHttpClient();
    
    
            try{
                ContentResponse response1 = httpClient.newRequest("https://xxxxxxxxxdo?" + postdata)
                        .method(HttpMethod.POST)
                        .header("Referer", "https:/xxxxxxxx.do")
                        .send();
    
                //  获取结果。检查是否出现错误
                Document documentPage = Jsoup.parse(response1.getContentAsString());
                Elements ErrorNode = documentPage.select("div.erro_div1");
                Elements ErrorValueNode = ErrorNode.select("span[id=_error_field_]");
                String ErrorValue = ErrorValueNode.text();
                if (ErrorValue.equals("")) {
                    return "身份信息正确---END";
                    //System.out.println("身份信息正确---END");
                } else {
                    return ErrorValue + "cw---END";
                    //System.out.println(ErrorValue+"---END");
                }
            }catch (Exception e){
                return "异常哎";
            }
    
        }
    
    
    public HttpClient GetHttpClient() {
            //创建HTTPClient对象,
            //初始化
            SslContextFactory sslContextFactory = new SslContextFactory();
            sslContextFactory.setTrustAll(true);
            HttpClient httpClient = new HttpClient(sslContextFactory);
            httpClient.setConnectTimeout(10000);
            httpClient.setFollowRedirects(true);
            httpClient.setUserAgentField(new HttpField("User-Agent", "xxxxxxxxxxxxxxxxxxxxxxxxxx"));
    
            ProxyConfiguration proxyConfig = httpClient.getProxyConfiguration();
            HttpProxy proxy = new HttpProxy("xxxxxx", xxx);
            proxyConfig.getProxies().add(proxy);
            try {
                httpClient.start();
                return httpClient;
            } catch (Exception e) {
                return null;
            }
    
        }
    
    
    1 //下载图片
    2     public void convertBase64DataToImage(String base64ImgData, String filePath) throws IOException {
    3         BASE64Decoder d = new BASE64Decoder();
    4         byte[] bs = d.decodeBuffer(base64ImgData);
    5         FileOutputStream os = new FileOutputStream(filePath);
    6         os.write(bs);
    7         os.close();
    8     }
    最后是application.yml配置
     1 server:
     2     port: 8080
     3 
     4 
     5 spring:
     6     http:
     7         encoding:
     8         charset: utf-8
     9         enabled: true
    10         force: true
    11     thymeleaf:
    12         mode: HTML5
    13         encoding: UTF-8
    14         content-type: text/html
    15         cache: false
    16         prefix: classpath:/templates/
    17         suffix: .html
    18     resources:
    19         static-locations: classpath:/static/
    
    
    
    一个完整的爬虫就实现了,使用服务接口的形式
  • 相关阅读:
    GUI 监听事件 (两个按钮,实现同一个监听)
    GUI 监听事件
    GUI 练习
    GUI 之表格布局
    GUI 之边界布局
    GUI 之流布局
    [转帖]Linux 下解压 rar 文件
    Linux 启动、停止、重启jar包脚本
    关于linux下,ls vi等命令失效的解决方法(配置下环境变量出现问题)
    超好用的UnixLinux 命令技巧 大神为你详细解读
  • 原文地址:https://www.cnblogs.com/yishilin/p/8398142.html
Copyright © 2011-2022 走看看