zoukankan      html  css  js  c++  java
  • 编写爬虫(spider)的预备知识:用java发送HTTP请求

      使用原生API来发送http请求,而不是使用apache的库,原因在于这个第三方库变化实在太快了,每个版本都有不小的变化。对于程序员来说,使用它反而会有很多麻烦,比如自己曾经写过的代码将无法复用。

    原理简介

    使用Java发送这两种请求的代码大同小异,只是一些参数设置的不同。步骤如下:
    1.生成统一资源定位器(java.net.URL),并据此生成一个连接(java.net.URLConnection)
    2.设置请求的参数
    3.发送请求(get和post有区别)
    4.以输入流的形式获取返回内容
    5.关闭输入流

    抓取百度网页

    上代码:

    package test;
    
    import java.io.BufferedReader;
    import java.io.IOException;
    import java.io.InputStreamReader;
    import java.io.PrintWriter;
    import java.net.URL;
    import java.net.URLConnection;
    import java.util.List;
    import java.util.Map;
    
    public class HttpRequest {
        /**
         * 向指定URL发送GET方法的请求
         * 
         * @param url
         *            发送请求的URL
         * @param param
         *            请求参数,请求参数应该是 name1=value1&name2=value2 的形式。
         * @return URL 所代表远程资源的响应结果
         */
        public static String sendGet(String url, String param) {
            String result = "";
            BufferedReader in = null;
            try {
                String urlNameString = url + "?" + param;
                URL realUrl = new URL(urlNameString);
                // 打开和URL之间的连接
                URLConnection connection = realUrl.openConnection();
                // 设置通用的请求属性
                connection.setRequestProperty("accept", "*/*");
                connection.setRequestProperty("connection", "Keep-Alive");
                connection.setRequestProperty("user-agent",
                        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;SV1)");
                // 建立实际的连接
                connection.connect();
                // 获取所有响应头字段
                Map<String, List<String>> map = connection.getHeaderFields();
                // 遍历所有的响应头字段
                for (String key : map.keySet()) {
                    System.out.println(key + "--->" + map.get(key));
                }
                // 定义 BufferedReader输入流来读取URL的响应
                in = new BufferedReader(new InputStreamReader(
                        connection.getInputStream()));
                String line;
                while ((line = in.readLine()) != null) {
                    result += line;
                    System.out.println("@2" + line);
                }
            } catch (Exception e) {
                System.out.println("发送GET请求出现异常!" + e);
                e.printStackTrace();
            }
            // 使用finally块来关闭输入流
            finally {
                try {
                    if (in != null) {
                        in.close();
                    }
                } catch (Exception e2) {
                    e2.printStackTrace();
                }
            }
            return result;
        }
    
        /**
         * 向指定 URL 发送POST方法的请求
         * 
         * @param url
         *            发送请求的 URL
         * @param param
         *            请求参数,请求参数应该是 name1=value1&name2=value2 的形式。
         * @return 所代表远程资源的响应结果
         */
        public static String sendPost(String url, String param) {
            PrintWriter out = null;
            BufferedReader in = null;
            String result = "";
            try {
                URL realUrl = new URL(url);
                // 打开和URL之间的连接
                URLConnection conn = realUrl.openConnection();
                // 设置通用的请求属性
                conn.setRequestProperty("accept", "*/*");
                conn.setRequestProperty("connection", "Keep-Alive");
                conn.setRequestProperty("user-agent",
                        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;SV1)");
                // 发送POST请求必须设置如下两行
                conn.setDoOutput(true);
                conn.setDoInput(true);
                // 获取URLConnection对象对应的输出流
                out = new PrintWriter(conn.getOutputStream());
                // 发送请求参数
                out.print(param);
                // flush输出流的缓冲
                out.flush();
                // 定义BufferedReader输入流来读取URL的响应
                in = new BufferedReader(
                        new InputStreamReader(conn.getInputStream()));
                String line;
                while ((line = in.readLine()) != null) {
                    result += line;
                }
            } catch (Exception e) {
                System.out.println("发送 POST 请求出现异常!"+e);
                e.printStackTrace();
            }
            //使用finally块来关闭输出流、输入流
            finally{
                try{
                    if(out!=null){
                        out.close();
                    }
                    if(in!=null){
                        in.close();
                    }
                }
                catch(IOException ex){
                    ex.printStackTrace();
                }
            }
            return result;
        }
        
        public static void main(String[] args) {
            //发送 GET 请求
            //String s=HttpRequest.sendGet("https://www.baidu.com", "key=123&v=456");
            String s=HttpRequest.sendPost("https://www.baidu.com", "");
            System.out.println("@1" + s);
            
            //发送 POST 请求
            //String sr=HttpRequest.sendPost("https://www.baidu.com", "key=123&v=456");
            //System.out.println(sr);
        }    
    }

      两种方法在控制台中的输出是不同的,用post请求才能得到整个html。原因是,如果用get方法发送请求,会被服务器要求重定向到http协议的url:

      <noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>

    tip:

    httpUrlConnection.setDoOutput(true);以后就可以使用conn.getOutputStream().write()
    httpUrlConnection.setDoInput(true);以后就可以使用conn.getInputStream().read();

    get请求用不到conn.getOutputStream(),因为参数直接追加在地址后面,因此默认是false。
    post请求(比如:文件上传)需要往服务区传输大量的数据,这些数据是放在http的body里面的,因此需要在建立连接以后,往服务端写数据。

    因为总是使用conn.getInputStream()获取服务端的响应,因此默认值是true。

    下面分别详细介绍。

    使用Get方法:

    package test;
    
    import java.io.BufferedReader;
    import java.io.InputStream;
    import java.io.InputStreamReader;
    import java.net.HttpURLConnection;
    import java.net.URL;
    import java.net.URLConnection;
    
    public class HttpGetRequest {
    
        /**
         * Main
         * @param args
         * @throws Exception 
         */
        public static void main(String[] args) throws Exception {
            System.out.println(doGet());
        }
        
        /**
         * Get Request
         * @return
         * @throws Exception
         */
        public static String doGet() throws Exception {
            URL localURL = new URL("http://localhost:8080/OneHttpServer/");
            URLConnection connection = localURL.openConnection();
            HttpURLConnection httpURLConnection = (HttpURLConnection)connection;
            
            httpURLConnection.setRequestProperty("Accept-Charset", "utf-8");
            httpURLConnection.setRequestProperty("Content-Type", "application/x-www-form-urlencoded");
            
            InputStream inputStream = null;
            InputStreamReader inputStreamReader = null;
            BufferedReader reader = null;
            StringBuffer resultBuffer = new StringBuffer();
            String tempLine = null;
            
            if (httpURLConnection.getResponseCode() >= 300) {
                throw new Exception("HTTP Request is not success, Response code is " + httpURLConnection.getResponseCode());
            }
            
            try {
                inputStream = httpURLConnection.getInputStream();
                inputStreamReader = new InputStreamReader(inputStream);
                reader = new BufferedReader(inputStreamReader);
                
                while ((tempLine = reader.readLine()) != null) {
                    resultBuffer.append(tempLine);
                }
                
            } finally {
                
                if (reader != null) {
                    reader.close();
                }
                
                if (inputStreamReader != null) {
                    inputStreamReader.close();
                }
                
                if (inputStream != null) {
                    inputStream.close();
                }
                
            }
            
            return resultBuffer.toString();
        }
        
    }
    
    HttpGetRequest
    View Code

    使用Post方法:

    package test;
    
    import java.io.BufferedReader;
    import java.io.InputStream;
    import java.io.InputStreamReader;
    import java.io.OutputStream;
    import java.io.OutputStreamWriter;
    import java.net.HttpURLConnection;
    import java.net.URL;
    import java.net.URLConnection;
    
    public class HttpPostRequest {
    
        /**
         * Main
         * @param args
         * @throws Exception 
         */
        public static void main(String[] args) throws Exception {
            System.out.println(doPost());
        }
        
        /**
         * Post Request
         * @return
         * @throws Exception
         */
        public static String doPost() throws Exception {
            String parameterData = "username=nickhuang&blog=http://www.cnblogs.com/nick-huang/";
            
            URL localURL = new URL("http://localhost:8080/OneHttpServer/");
            URLConnection connection = localURL.openConnection();
            HttpURLConnection httpURLConnection = (HttpURLConnection)connection;
            
            httpURLConnection.setDoOutput(true);
            httpURLConnection.setRequestMethod("POST");
            httpURLConnection.setRequestProperty("Accept-Charset", "utf-8");
            httpURLConnection.setRequestProperty("Content-Type", "application/x-www-form-urlencoded");
            httpURLConnection.setRequestProperty("Content-Length", String.valueOf(parameterData.length()));
            
            OutputStream outputStream = null;
            OutputStreamWriter outputStreamWriter = null;
            InputStream inputStream = null;
            InputStreamReader inputStreamReader = null;
            BufferedReader reader = null;
            StringBuffer resultBuffer = new StringBuffer();
            String tempLine = null;
            
            try {
                outputStream = httpURLConnection.getOutputStream();
                outputStreamWriter = new OutputStreamWriter(outputStream);
                
                outputStreamWriter.write(parameterData.toString());
                outputStreamWriter.flush();
                
                if (httpURLConnection.getResponseCode() >= 300) {
                    throw new Exception("HTTP Request is not success, Response code is " + httpURLConnection.getResponseCode());
                }
                
                inputStream = httpURLConnection.getInputStream();
                inputStreamReader = new InputStreamReader(inputStream);
                reader = new BufferedReader(inputStreamReader);
                
                while ((tempLine = reader.readLine()) != null) {
                    resultBuffer.append(tempLine);
                }
                
            } finally {
                
                if (outputStreamWriter != null) {
                    outputStreamWriter.close();
                }
                
                if (outputStream != null) {
                    outputStream.close();
                }
                
                if (reader != null) {
                    reader.close();
                }
                
                if (inputStreamReader != null) {
                    inputStreamReader.close();
                }
                
                if (inputStream != null) {
                    inputStream.close();
                }
                
            }
    
            return resultBuffer.toString();
        }
    
    }
    
    HttpPostRequest
    View Code

    封装&复用

    这样,这个类的实例就引用了一个请求器,帮助线程完成抓取任务

    package test;
    
    import java.io.BufferedReader;
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.InputStreamReader;
    import java.io.OutputStream;
    import java.io.OutputStreamWriter;
    import java.net.HttpURLConnection;
    import java.net.InetSocketAddress;
    import java.net.Proxy;
    import java.net.URL;
    import java.net.URLConnection;
    import java.util.Iterator;
    import java.util.Map;
    
    
    public class HttpRequestor {
        
        private String charset = "utf-8";
        private Integer connectTimeout = null;
        private Integer socketTimeout = null;
        private String proxyHost = null;
        private Integer proxyPort = null;
        
        /**
         * Do GET request
         * @param url
         * @return
         * @throws Exception
         * @throws IOException
         */
        public String doGet(String url) throws Exception {
            
            URL localURL = new URL(url);
            
            URLConnection connection = openConnection(localURL);
            HttpURLConnection httpURLConnection = (HttpURLConnection)connection;
            
            httpURLConnection.setRequestProperty("Accept-Charset", charset);
            httpURLConnection.setRequestProperty("Content-Type", "application/x-www-form-urlencoded");
            
            InputStream inputStream = null;
            InputStreamReader inputStreamReader = null;
            BufferedReader reader = null;
            StringBuffer resultBuffer = new StringBuffer();
            String tempLine = null;
            
            if (httpURLConnection.getResponseCode() >= 300) {
                throw new Exception("HTTP Request is not success, Response code is " + httpURLConnection.getResponseCode());
            }
            
            try {
                inputStream = httpURLConnection.getInputStream();
                inputStreamReader = new InputStreamReader(inputStream);
                reader = new BufferedReader(inputStreamReader);
                
                while ((tempLine = reader.readLine()) != null) {
                    resultBuffer.append(tempLine);
                }
                
            } finally {
                
                if (reader != null) {
                    reader.close();
                }
                
                if (inputStreamReader != null) {
                    inputStreamReader.close();
                }
                
                if (inputStream != null) {
                    inputStream.close();
                }
                
            }
    
            return resultBuffer.toString();
        }
        
        /**
         * Do POST request
         * @param url
         * @param parameterMap
         * @return
         * @throws Exception 
         */
        public String doPost(String url, Map parameterMap) throws Exception {
            
            /* Translate parameter map to parameter date string */
            StringBuffer parameterBuffer = new StringBuffer();
            if (parameterMap != null) {
                Iterator iterator = parameterMap.keySet().iterator();
                String key = null;
                String value = null;
                while (iterator.hasNext()) {
                    key = (String)iterator.next();
                    if (parameterMap.get(key) != null) {
                        value = (String)parameterMap.get(key);
                    } else {
                        value = "";
                    }
                    
                    parameterBuffer.append(key).append("=").append(value);
                    if (iterator.hasNext()) {
                        parameterBuffer.append("&");
                    }
                }
            }
            
            System.out.println("POST parameter : " + parameterBuffer.toString());
            
            URL localURL = new URL(url);
            
            URLConnection connection = openConnection(localURL);
            HttpURLConnection httpURLConnection = (HttpURLConnection)connection;
            
            httpURLConnection.setDoOutput(true);
            httpURLConnection.setRequestMethod("POST");
            httpURLConnection.setRequestProperty("Accept-Charset", charset);
            httpURLConnection.setRequestProperty("Content-Type", "application/x-www-form-urlencoded");
            httpURLConnection.setRequestProperty("Content-Length", String.valueOf(parameterBuffer.length()));
            
            OutputStream outputStream = null;
            OutputStreamWriter outputStreamWriter = null;
            InputStream inputStream = null;
            InputStreamReader inputStreamReader = null;
            BufferedReader reader = null;
            StringBuffer resultBuffer = new StringBuffer();
            String tempLine = null;
            
            try {
                outputStream = httpURLConnection.getOutputStream();
                outputStreamWriter = new OutputStreamWriter(outputStream);
                
                outputStreamWriter.write(parameterBuffer.toString());
                outputStreamWriter.flush();
                
                if (httpURLConnection.getResponseCode() >= 300) {
                    throw new Exception("HTTP Request is not success, Response code is " + httpURLConnection.getResponseCode());
                }
                
                inputStream = httpURLConnection.getInputStream();
                inputStreamReader = new InputStreamReader(inputStream);
                reader = new BufferedReader(inputStreamReader);
                
                while ((tempLine = reader.readLine()) != null) {
                    resultBuffer.append(tempLine);
                }
                
            } finally {
                
                if (outputStreamWriter != null) {
                    outputStreamWriter.close();
                }
                
                if (outputStream != null) {
                    outputStream.close();
                }
                
                if (reader != null) {
                    reader.close();
                }
                
                if (inputStreamReader != null) {
                    inputStreamReader.close();
                }
                
                if (inputStream != null) {
                    inputStream.close();
                }
                
            }
    
            return resultBuffer.toString();
        }
    
        private URLConnection openConnection(URL localURL) throws IOException {
            URLConnection connection;
            if (proxyHost != null && proxyPort != null) {
                Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(proxyHost, proxyPort));
                connection = localURL.openConnection(proxy);
            } else {
                connection = localURL.openConnection();
            }
            return connection;
        }
        
        /**
         * Render request according setting
         * @param request
         */
        private void renderRequest(URLConnection connection) {
            
            if (connectTimeout != null) {
                connection.setConnectTimeout(connectTimeout);
            }
            
            if (socketTimeout != null) {
                connection.setReadTimeout(socketTimeout);
            }
            
        }
    
        /*
         * Getter & Setter
         */
        public Integer getConnectTimeout() {
            return connectTimeout;
        }
    
        public void setConnectTimeout(Integer connectTimeout) {
            this.connectTimeout = connectTimeout;
        }
    
        public Integer getSocketTimeout() {
            return socketTimeout;
        }
    
        public void setSocketTimeout(Integer socketTimeout) {
            this.socketTimeout = socketTimeout;
        }
    
        public String getProxyHost() {
            return proxyHost;
        }
    
        public void setProxyHost(String proxyHost) {
            this.proxyHost = proxyHost;
        }
    
        public Integer getProxyPort() {
            return proxyPort;
        }
    
        public void setProxyPort(Integer proxyPort) {
            this.proxyPort = proxyPort;
        }
    
        public String getCharset() {
            return charset;
        }
    
        public void setCharset(String charset) {
            this.charset = charset;
        }
        
    }
    
    HttpRequestor
    View Code

    HttpRequestor的测试代码

    客户端代码:

    package test;
    
    import java.util.HashMap;
    import java.util.Map;
    
    public class Call {
    
        public static void main(String[] args) throws Exception {
            
            /* Post Request */
            Map dataMap = new HashMap();
            dataMap.put("username", "Nick Huang");
            dataMap.put("blog", "IT");
            System.out.println(new HttpRequestor().doPost("http://localhost:8080/OneHttpServer/", dataMap));
            
            /* Get Request */
            System.out.println(new HttpRequestor().doGet("http://localhost:8080/OneHttpServer/"));
        }
    
    }
    
    Call

    服务端代码:

    import java.io.IOException;
    import javax.servlet.ServletException;
    import javax.servlet.http.HttpServlet;
    import javax.servlet.http.HttpServletRequest;
    import javax.servlet.http.HttpServletResponse;
    
    public class LoginServlet extends HttpServlet {
        private static final long serialVersionUID = 1L;
           
        public LoginServlet() {
            super();
        }
    
        protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
            this.doPost(request, response);
        }
    
        protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
            String username = request.getParameter("username");
            String blog = request.getParameter("blog");
            
            System.out.println(username);
            System.out.println(blog);
            
            response.setContentType("text/plain; charset=UTF-8");
            response.setCharacterEncoding("UTF-8");
            response.getWriter().write("It is ok!");
        }
    
    }
    
    LoginServlet

    web.xml

    <?xml version="1.0" encoding="UTF-8"?>
    <web-app xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://java.sun.com/xml/ns/javaee" xmlns:web="http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd" xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd" id="WebApp_ID" version="2.5">
      <display-name>OneHttpServer</display-name>
      <welcome-file-list>
        <welcome-file>LoginServlet</welcome-file>
      </welcome-file-list>
      
      <servlet>
        <description></description>
        <display-name>LoginServlet</display-name>
        <servlet-name>LoginServlet</servlet-name>
        <servlet-class>LoginServlet</servlet-class>
      </servlet>
      <servlet-mapping>
        <servlet-name>LoginServlet</servlet-name>
        <url-pattern>/LoginServlet</url-pattern>
      </servlet-mapping>
      
    </web-app>
    
    web.xml
  • 相关阅读:
    《30天自制操作系统》笔记(10)——定时器
    《30天自制操作系统》笔记(09)——绘制窗口
    《30天自制操作系统》笔记(08)——叠加窗口刷新
    《30天自制操作系统》笔记(07)——内存管理
    《30天自制操作系统》笔记(06)——CPU的32位模式
    《30天自制操作系统》笔记(05)——启用鼠标键盘
    《30天自制操作系统》笔记(04)——显示器256色
    《30天自制操作系统》笔记(03)——使用Vmware
    《30天自制操作系统》笔记(02)——导入C语言
    《30天自制操作系统》笔记(01)——hello bitzhuwei’s OS!
  • 原文地址:https://www.cnblogs.com/xinchrome/p/4929061.html
Copyright © 2011-2022 走看看