zoukankan      html  css  js  c++  java
  • webmagic 下载页面

    下面是webmagic官方的默认实现HttpClientDownloader中的下载方法。

        @Override
        public Page download(Request request, Task task) {
            Site site = null;
            if (task != null) {
                site = task.getSite();
            }
            Set<Integer> acceptStatCode;
            String charset = null;
            Map<String, String> headers = null;
            if (site != null) {
                acceptStatCode = site.getAcceptStatCode();
                charset = site.getCharset();
                headers = site.getHeaders();
            } else {
                acceptStatCode = Sets.newHashSet(200);
            }
            logger.info("downloading page {}", request.getUrl());
            CloseableHttpResponse httpResponse = null;
            int statusCode=0;
            try {
                HttpUriRequest httpUriRequest = getHttpUriRequest(request, site, headers);
                httpResponse = getHttpClient(site).execute(httpUriRequest);
                statusCode = httpResponse.getStatusLine().getStatusCode();
                request.putExtra(Request.STATUS_CODE, statusCode);
                if (statusAccept(acceptStatCode, statusCode)) {
                    Page page = handleResponse(request, charset, httpResponse, task);
                    onSuccess(request);
                    return page;
                } else {
                    logger.warn("code error " + statusCode + "	" + request.getUrl());
                    return null;
                }
            } catch (IOException e) {
                logger.warn("download page " + request.getUrl() + " error", e);
                if (site.getCycleRetryTimes() > 0) {
                    return addToCycleRetry(request, site);
                }
                onError(request);
                return null;
            } finally {
                request.putExtra(Request.STATUS_CODE, statusCode);
                try {
                    if (httpResponse != null) {
                        //ensure the connection is released back to pool
                        EntityUtils.consume(httpResponse.getEntity());
                    }
                } catch (IOException e) {
                    logger.warn("close response fail", e);
                }
            }
        }

    上面第一个标黄的方法,构造org.apache.http.client.methods.HttpUriRequest。这是一个挺重要的方法,这里面涉及到各种请求头文件之类的东西。

    还有最重要的代理ip这里也是底层实现的地方。

        protected HttpUriRequest getHttpUriRequest(Request request, Site site, Map<String, String> headers) {
            RequestBuilder requestBuilder = selectRequestMethod(request).setUri(request.getUrl());
            if (headers != null) {
                for (Map.Entry<String, String> headerEntry : headers.entrySet()) {
                    requestBuilder.addHeader(headerEntry.getKey(), headerEntry.getValue());
                }
            }
            RequestConfig.Builder requestConfigBuilder = RequestConfig.custom()
                    .setConnectionRequestTimeout(site.getTimeOut())
                    .setSocketTimeout(site.getTimeOut())
                    .setConnectTimeout(site.getTimeOut())
                    .setCookieSpec(CookieSpecs.BEST_MATCH);
            if (site.getHttpProxyPool() != null && site.getHttpProxyPool().isEnable()) {
                HttpHost host = site.getHttpProxyFromPool();
                requestConfigBuilder.setProxy(host);
                request.putExtra(Request.PROXY, host);
            }else if(site.getHttpProxy()!= null){
                HttpHost host = site.getHttpProxy();
                requestConfigBuilder.setProxy(host);
                request.putExtra(Request.PROXY, host);    
            }
            requestBuilder.setConfig(requestConfigBuilder.build());
            return requestBuilder.build();
        }

    下面进入download方法中标黄的第二个方法,这个方法返回一个org.apache.http.impl.client.CloseableHttpClient类型对象:

        private CloseableHttpClient getHttpClient(Site site) {
            if (site == null) {
                return httpClientGenerator.getClient(null);
            }
            String domain = site.getDomain();
         //Map<String, CloseableHttpClient> httpClients CloseableHttpClient httpClient
    = httpClients.get(domain); if (httpClient == null) { synchronized (this) { httpClient = httpClients.get(domain); if (httpClient == null) { httpClient = httpClientGenerator.getClient(site); httpClients.put(domain, httpClient); } } } return httpClient; }

    进入download第三个标黄的方法,该方法返回一个us.codecraft.webmagic.Page对象,这个page对象是webmagic自己封装的对象:

        protected Page handleResponse(Request request, String charset, HttpResponse httpResponse, Task task) throws IOException {
            String content = getContent(charset, httpResponse);
            Page page = new Page();
            page.setRawText(content);
            page.setUrl(new PlainText(request.getUrl()));
            page.setRequest(request);
            page.setStatusCode(httpResponse.getStatusLine().getStatusCode());
            return page;
        }
  • 相关阅读:
    [bzoj3218] a+b problem [最小割+数据结构优化建图]
    [bzoj3456] 城市规划 [递推+多项式求逆]
    [ARC068F] Solitaire [DP]
    [bzoj3601] 一个人的数论 [莫比乌斯反演+高斯消元]
    [中山市选2011][bzoj2440] 完全平方数 [二分+莫比乌斯容斥]
    [bzoj2159] Crash的文明世界 [斯特林数+树形dp]
    [bzoj2839] 集合计数
    通用解题方法—回溯法
    分支限界法—单源最短路径问题
    分支限界法
  • 原文地址:https://www.cnblogs.com/guazi/p/6676260.html
Copyright © 2011-2022 走看看