zoukankan      html  css  js  c++  java
  • (二)模拟浏览器抓取网页

    第一节: 设置请求头消息 User-Agent 模拟浏览器

    HttpClient设置请求头消息User-Agent模拟浏览器

    比如我们请求 www.tuicool.com

    用前面的代码:

     1 package com.javaxk.httpclient.chap02;
     2 
     3 import org.apache.http.HttpEntity;
     4 import org.apache.http.client.methods.CloseableHttpResponse;
     5 import org.apache.http.client.methods.HttpGet;
     6 import org.apache.http.impl.client.CloseableHttpClient;
     7 import org.apache.http.impl.client.HttpClients;
     8 import org.apache.http.util.EntityUtils;
     9 
    10 public class Demo1 {
    11     
    12     public static void main(String[] args)throws Exception {
    13         CloseableHttpClient httpClient=HttpClients.createDefault(); // 创建httpClient实例
    14         HttpGet httpGet=new HttpGet("http://www.tuicool.com/"); // 创建httpget实例
    15         CloseableHttpResponse response=httpClient.execute(httpGet); // 执行http get请求
    16         HttpEntity entity=response.getEntity(); // 获取返回实体
    17         System.out.println("网页内容:"+EntityUtils.toString(entity, "utf-8")); // 获取网页内容
    18         response.close(); // response关闭
    19         httpClient.close(); // httpClient关闭
    20     }
    21 
    22 }

    返回内容:

    网页内容:<!DOCTYPE html>
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    </head>
    <body>
    <p>系统检测亲不是真人行为,因系统资源限制,我们只能拒绝你的请求。如果你有疑问,可以通过微博 http://weibo.com/tuicool2012/ 联系我们。</p>
    </body>
    </html>

    我们模拟下浏览器 设置下User-Agent头消息:

    加下 httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0"); // 设置请求头消息User-Agent

     1 package com.javaxk.httpclient.chap02;
     2 
     3 import org.apache.http.HttpEntity;
     4 import org.apache.http.client.methods.CloseableHttpResponse;
     5 import org.apache.http.client.methods.HttpGet;
     6 import org.apache.http.impl.client.CloseableHttpClient;
     7 import org.apache.http.impl.client.HttpClients;
     8 import org.apache.http.util.EntityUtils;
     9 
    10 public class Demo1 {
    11     
    12     public static void main(String[] args)throws Exception {
    13         CloseableHttpClient httpClient=HttpClients.createDefault(); // 创建httpClient实例
    14         HttpGet httpGet=new HttpGet("http://www.tuicool.com/"); // 创建httpget实例
    15         httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0");    // 设置请求头消息User-Agent
    16         CloseableHttpResponse response=httpClient.execute(httpGet); // 执行http get请求
    17         HttpEntity entity=response.getEntity(); // 获取返回实体
    18         System.out.println("网页内容:"+EntityUtils.toString(entity, "utf-8")); // 获取网页内容
    19         response.close(); // response关闭
    20         httpClient.close(); // httpClient关闭
    21     }
    22 
    23 }

    运行:

    当然通过火狐firebug,我们还可以看到其他请求头消息:

    都是可以通过setHeader方法 设置key value;来得到模拟浏览器请求;


    第二节: 获取响应内容类型 Content-Type

    HttpClient获取响应内容类型Content-Type

    响应的网页内容都有类型也就是Content-Type

    通过火狐firebug,我们看响应头信息:

    当然我们可以通过HttpClient接口来获取;

    HttpEntity的getContentType().getValue() 就能获取到响应类型;  

     1 package com.javaxk.httpclient.chap02;
     2 
     3 import org.apache.http.HttpEntity;
     4 import org.apache.http.client.methods.CloseableHttpResponse;
     5 import org.apache.http.client.methods.HttpGet;
     6 import org.apache.http.impl.client.CloseableHttpClient;
     7 import org.apache.http.impl.client.HttpClients;
     8 import org.apache.http.util.EntityUtils;
     9 
    10 public class Demo2 {
    11     
    12     public static void main(String[] args) throws Exception{
    13         CloseableHttpClient httpClient=HttpClients.createDefault(); // 创建httpClient实例
    14         HttpGet httpGet=new HttpGet("http://www.javaxk.com"); // 创建httpget实例
    15         //HttpGet httpGet=new HttpGet("http://central.maven.org/maven2/HTTPClient/HTTPClient/0.3-3/HTTPClient-0.3-3.jar"); // 创建httpget实例
    16         httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0"); // 设置请求头消息User-Agent
    17         CloseableHttpResponse response=httpClient.execute(httpGet); // 执行http get请求
    18         HttpEntity entity=response.getEntity(); // 获取返回实体
    19         System.out.println("Content-Type:"+entity.getContentType().getValue());
    20         //System.out.println("网页内容:"+EntityUtils.toString(entity, "utf-8")); // 获取网页内容
    21         response.close(); // response关闭
    22         httpClient.close(); // httpClient关闭
    23     }
    24 
    25 }

    运行输出:

    Content-Type:text/html; charset=utf-8

    一般网页是text/html当然有些是带编码的,

    比如请求www.tuicool.com:输出:

    Content-Type:text/html; charset=utf-8

    假如请求js文件,比如 http://www.javaxk.com/include/dedeajax2.js

    运行输出:

    Content-Type:application/javascript

    假如请求的是文件,比如 http://central.maven.org/maven2/HTTPClient/HTTPClient/0.3-3/HTTPClient-0.3-3.jar

    运行输出:

    Content-Type:application/java-archive

    当然Content-Type还有一堆,那这东西对于我们爬虫有啥用的,我们再爬取网页的时候 ,可以通过

    Content-Type来提取我们需要爬取的网页或者是爬取的时候,需要过滤掉的一些网页;


    第三节: 获取响应状态 Status

    200 正常
    403 拒绝
    500 服务器报错
    400 未找到页面



    HttpClient获取响应状态Status

    我们HttpClient向服务器请求时,

    正常情况 执行成功 返回200状态码,

    不一定每次都会请求成功,

    比如这个请求地址不存在 返回404

    服务器内部报错 返回500

    有些服务器有防采集,假如你频繁的采集数据,则返回403 拒绝你请求。

    当然 我们是有办法的 后面会讲到用代理IP。

    这个获取状态码,我们可以用 CloseableHttpResponse对象的getStatusLine().getStatusCode()

     1 package com.javaxk.httpclient.chap02;
     2 
     3 import org.apache.http.HttpEntity;
     4 import org.apache.http.client.methods.CloseableHttpResponse;
     5 import org.apache.http.client.methods.HttpGet;
     6 import org.apache.http.impl.client.CloseableHttpClient;
     7 import org.apache.http.impl.client.HttpClients;
     8 import org.apache.http.util.EntityUtils;
     9 
    10 public class Demo2 {
    11     
    12     public static void main(String[] args) throws Exception{
    13         CloseableHttpClient httpClient=HttpClients.createDefault(); // 创建httpClient实例
    14         HttpGet httpGet=new HttpGet("http://www.javaxk.com"); // 创建httpget实例
    15         httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0"); // 设置请求头消息User-Agent
    16         CloseableHttpResponse response=httpClient.execute(httpGet); // 执行http get请求
    17         System.out.println("Status:"+response.getStatusLine().getStatusCode());
    18         HttpEntity entity=response.getEntity(); // 获取返回实体
    19         System.out.println("Content-Type:"+entity.getContentType().getValue());
    20         //System.out.println("网页内容:"+EntityUtils.toString(entity, "utf-8")); // 获取网页内容
    21         response.close(); // response关闭
    22         httpClient.close(); // httpClient关闭
    23     }
    24 
    25 }

    运行输出:

    Status:200

    Content-Type:text/html;charset=UTF-8

    假如换个页面 http://www.javaxk.com/a.jsp

    因为不存在,

    所以返回 404

     1 package com.javaxk.httpclient.chap02;
     2 
     3 import org.apache.http.HttpEntity;
     4 import org.apache.http.client.methods.CloseableHttpResponse;
     5 import org.apache.http.client.methods.HttpGet;
     6 import org.apache.http.impl.client.CloseableHttpClient;
     7 import org.apache.http.impl.client.HttpClients;
     8 import org.apache.http.util.EntityUtils;
     9 
    10 public class Demo2 {
    11     
    12     public static void main(String[] args) throws Exception{
    13         CloseableHttpClient httpClient=HttpClients.createDefault(); // 创建httpClient实例
    14         HttpGet httpGet=new HttpGet("http://www.javaxk.com/a.jsp"); // 创建httpget实例
    15         httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0"); // 设置请求头消息User-Agent
    16         CloseableHttpResponse response=httpClient.execute(httpGet); // 执行http get请求
    17         System.out.println("Status:"+response.getStatusLine().getStatusCode());
    18         HttpEntity entity=response.getEntity(); // 获取返回实体
    19         System.out.println("Content-Type:"+entity.getContentType().getValue());
    20         //System.out.println("网页内容:"+EntityUtils.toString(entity, "utf-8")); // 获取网页内容
    21         response.close(); // response关闭
    22         httpClient.close(); // httpClient关闭
    23     }
    24 
    25 }

    运行输出:

    Status:404
    Content-Type:text/html

  • 相关阅读:
    cpp:博文_注意
    Algs4-1.2(非习题)String
    Algs4-1.2(非习题)几何对象中的一个2D用例
    Algs4-1.2.19字符串解析
    Algs4-1.2.18累加器的方差
    Algs4-1.2.17有理数实现的健壮性
    Algs4-1.2.16有理数
    Algs4-1.2.15基于String的split()的方法实现In中的静态方法readInts()
    Algs4-1.2.13实现Transaction类型
    Algs4-1.2.14实现Transaction中的equals()方法
  • 原文地址:https://www.cnblogs.com/wishwzp/p/7059040.html
Copyright © 2011-2022 走看看