zoukankan      html  css  js  c++  java
  • HttpClient(二)-- 模拟浏览器抓取网页

    一、设置请求头消息 User-Agent模拟浏览器

       1.当使用第一节的代码 来 访问推酷的时候,会返回给我们如下信息:

    网页内容:<!DOCTYPE html>
    <html>
        <head>
              <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
        </head>
        <body>
            <p>系统检测亲不是真人行为,因系统资源限制,我们只能拒绝你的请求。如果你有疑问,可以通过微博 http://weibo.com/tuicool2012/ 联系我们。</p>
        </body>
    </html>

      这是因为网站做了限制,限制别人爬。解决方式可以设置请求头消息 User-Agent模拟浏览器。代码如下:

    /**
         * 抓取网页信息使用 get请求
         * @param args
         * @throws IOException 
         * @throws ClientProtocolException 
         */
        public static void main(String[] args) throws ClientProtocolException, IOException {
            // 创建httpClient实例
            CloseableHttpClient httpClient = HttpClients.createDefault();
            // 创建httpGet实例
            HttpGet httpGet = new HttpGet("http://www.tuicool.com");
            httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0");
            CloseableHttpResponse response = httpClient.execute(httpGet);
            if(response != null){
                HttpEntity entity = response.getEntity();   // 获取网页内容
                String result = EntityUtils.toString(entity, "UTF-8"); 
                System.out.println("网页内容:" + result);
            }
            if(response != null){
                response.close();
            }
            if(httpClient != null){
                httpClient.close();
            }
        }

       给HttpGet方法设置头消息,即可模拟浏览器访问。

    二、获取响应内容Content-Type  

       使用  entity.getContentType().getValue()  来获取Content-Type,代码如下:

    public static void main(String[] args) throws ClientProtocolException, IOException {
            // 创建httpClient实例
            CloseableHttpClient httpClient = HttpClients.createDefault();
            // 创建httpGet实例
            HttpGet httpGet = new HttpGet("http://www.tuicool.com");
            httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0");
            CloseableHttpResponse response = httpClient.execute(httpGet);
            if(response != null){
                HttpEntity entity = response.getEntity();   // 获取网页内容
                System.out.println("Content-Type:" + entity.getContentType().getValue());   // 获取Content-Type
            }
            if(response != null){
                response.close();
            }
            if(httpClient != null){
                httpClient.close();
            }
        }

    三、获取响应状态

      200 -- 正常

      403 -- 拒绝

      500 -- 服务器报错

      400 -- 未找到页面

      使用 response.getStatusLine().getStatusCode() 获取响应状态,代码如下:

    public static void main(String[] args) throws ClientProtocolException, IOException {
            // 创建httpClient实例
            CloseableHttpClient httpClient = HttpClients.createDefault();
            // 创建httpGet实例
            HttpGet httpGet = new HttpGet("http://www.tuicool.com");
            httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0");
            CloseableHttpResponse response = httpClient.execute(httpGet);
            if(response != null){
                int state = response.getStatusLine().getStatusCode();
                System.out.println("响应状态:" + state);
            }
            if(response != null){
                response.close();
            }
            if(httpClient != null){
                httpClient.close();
            }
        }

     四、HttpClient学习地址

      开源博客系统-HttpClient

  • 相关阅读:
    动态Webapi参考资料
    解决异步事务好文章
    .net core 插件开发
    端口被占用代码
    性能测试
    .NET/.NET Core 单元测试:Specflow
    Autofac 替换默认控制器骚操作
    Swagger非常好的文章
    sqlserver入门到精通(2016安装教程)
    springboot 学习之路 27(实现ip白名单功能)
  • 原文地址:https://www.cnblogs.com/xbq8080/p/7507854.html
Copyright © 2011-2022 走看看