zoukankan      html  css  js  c++  java
  • Httpclient 和jsoup结和提取网页内容(某客学院视频链接)

         最近在极客学院获得体验会员3个月,然后就去上面看了看,感觉课程讲的还不错。整好最近学习Android,然后去上面找点视频看看。发现只有使用RMB买的会员才能在上面下载视频。抱着试一试的态度,去看他的网页源码,不巧发现有视频地址链接。然后想起来jsoup提取网页元素挺方便的,没事干就写了一个demo。

        jsoup 是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。

        jsoup的主要功能如下:

       1. 从一个URL,文件或字符串中解析HTML;
       2. 使用DOM或CSS选择器来查找、取出数据;
       3. 可操作HTML元素、属性、文本;
       jsoup的用法中文文档地址:http://www.open-open.com/jsoup/
     
         使用jsoup提取网页中指定的内容需要提前做好网页分析工作。我找到在极客学院一个课程的页面源码,很快找到了视频链接部分;如下图:<scource/> 标签中就是视频链接,通过这个链接我们可以通过迅雷下载。
      <source src="http://cv3.jikexueyuan.com/201508081934/f8f3f9f8088f1ba0a6c75594448d96ab/course/1501-1600/1557/video/4278_b_h264_sd_960_540.mp4" type="video/mp4"></source>
    View Code

         我们获取整个html源码,然后根据<scource/>对源码进行提取,很容易获取下载链接。

        接着通过分析网页,我们可以得到一门课程所有视频信息。网页源码如下:

       

     1 <dl class="lessonvideo-list"> 
     2    <dd class="playing"> 
     3     <h2> <span class="sm-icon "></span> <a href="http://www.jikexueyuan.com/course/1748_1.html?ss=1" jktag="&amp;posGP=103001&amp;posArea=0002&amp;posOper=8005&amp;posColumn=1.1">1.编写自己的自定义 View(上)</a> <span class="lesson-time">00:10:24</span> </h2> 
     4     <blockquote>
     5      本课时主要讲解最简单的自定义 View,然后加入绘制元素(文字、图形等),并且可以像使用系统控件一样在布局中使用。
     6     </blockquote> 
     7    </dd> 
     8    <dd> 
     9     <h2> <span class="sm-icon "></span> <a href="http://www.jikexueyuan.com/course/1748_2.html?ss=1" jktag="&amp;posGP=103001&amp;posArea=0002&amp;posOper=8005&amp;posColumn=2.2">2.编写自己的自定义 View(下)</a> <span class="lesson-time">00:12:05</span> </h2> 
    10     <blockquote>
    11      本课时主要讲解最简单的自定义 View,然后加入绘制元素(文字、图形等),并且可以像使用系统控件一样在布局中使用。
    12     </blockquote> 
    13    </dd> 
    14    <dd> 
    15     <h2> <span class="sm-icon "></span> <a href="http://www.jikexueyuan.com/course/1748_3.html?ss=1" jktag="&amp;posGP=103001&amp;posArea=0002&amp;posOper=8005&amp;posColumn=3.3">3.加入逻辑线程</a> <span class="lesson-time">00:20:34</span> </h2> 
    16     <blockquote>
    17      本课时需要让绘制的元素动起来,但是又不阻塞主线程,所以引入逻辑线程。在子线程更新 UI 是不被允许的,但是 View 提供了方法。让我们来看看吧。
    18     </blockquote> 
    19    </dd> 
    20    <dd> 
    21     <h2> <span class="sm-icon "></span> <a href="http://www.jikexueyuan.com/course/1748_4.html?ss=1" jktag="&amp;posGP=103001&amp;posArea=0002&amp;posOper=8005&amp;posColumn=4.4">4.提取和封装自定义 View</a> <span class="lesson-time">00:15:41</span> </h2> 
    22     <blockquote>
    23      本课时主要讲解在上个课程的基础上,进行提取代码来构造自定义 View 的基类,主要目的是:创建新的自定义 View 时,只需继承此类并只关心绘制和逻辑,其他工作由父类完成。这样既减少重复编码,也简化了逻辑。
    24     </blockquote> 
    25    </dd> 
    26    <dd> 
    27     <h2> <span class="sm-icon "></span> <a href="http://www.jikexueyuan.com/course/1748_5.html?ss=1" jktag="&amp;posGP=103001&amp;posArea=0002&amp;posOper=8005&amp;posColumn=5.5">5.在 xml 中定义样式来影响显示效果</a> <span class="lesson-time">00:14:05</span> </h2> 
    28     <blockquote>
    29      本课时主要讲解的是在 xml 中定义样式及其属性,怎么来影响自定义 View 中的显示效果的过程和步骤。
    30     </blockquote> 
    31    </dd> 
    32   </dl>
    View Code

      通过 Elements results1 = doc.getElementsByClass("lessonvideo-list"); 我们可以获得视频列表。然后我们接着对从视频列表获取课程每节课视频地址使用jsoup遍历获取视频链接。

    以上是主要思路,另外使用jsoup get方法获取网页Docment是是没有cooike状态的,有些视频需要VIP会员登录才能获取到视频播放地址。因此我们需要用httpclient来模拟用户登录状态。

     一下是整个工程源码。

    1 、 课程course类,用于存储课程每一节课的课程名和课程url地址。

     1 public class Course {
     2 
     3     /**
     4      * 链接的地址
     5      */
     6     private String linkHref;
     7     /**
     8      * 链接的标题
     9      */
    10     private String linkText;
    11 
    12     public String getLinkHref() {
    13         return linkHref;
    14     }
    15 
    16     public void setLinkHref(String linkHref) {
    17         this.linkHref = linkHref;
    18     }
    19 
    20     public String getLinkText() {
    21         return linkText;
    22     }
    23 
    24     public void setLinkText(String linkText) {
    25         this.linkText = linkText;
    26     }
    27 
    28     @Override
    29     public String toString() {
    30         return "Video [linkHref=" + linkHref + ", linkText=" + linkText + "]";
    31     }
    32 
    33 }
    View Code

    2、HttpUtils类,用于模拟用户登录状态。

      1 import java.io.IOException;
      2 import java.io.InputStream;
      3 import java.io.UnsupportedEncodingException;
      4 
      5 import org.apache.http.Header;
      6 import org.apache.http.HttpEntity;
      7 import org.apache.http.HttpHeaders;
      8 import org.apache.http.HttpResponse;
      9 import org.apache.http.HttpStatus;
     10 import org.apache.http.client.ClientProtocolException;
     11 import org.apache.http.client.HttpClient;
     12 import org.apache.http.client.methods.CloseableHttpResponse;
     13 import org.apache.http.client.methods.HttpGet;
     14 import org.apache.http.client.methods.HttpPost;
     15 import org.apache.http.entity.StringEntity;
     16 import org.apache.http.impl.client.CloseableHttpClient;
     17 import org.apache.http.impl.client.DefaultHttpClient;
     18 import org.apache.http.impl.client.HttpClients;
     19 import org.apache.http.util.EntityUtils;
     20 
     21 @SuppressWarnings("deprecation")
     22 public class HttpUtils {
     23     String cookieStr = "";
     24 
     25     public String getCookieStr() {
     26         return cookieStr;
     27     }
     28 
     29     CloseableHttpResponse response = null;
     30 
     31     public CloseableHttpResponse getResponse() {
     32         return response;
     33     }
     34 
     35     public HttpUtils(String cookieStr) {
     36         this.cookieStr = cookieStr;
     37     }
     38 
     39     public HttpUtils() {
     40 
     41     }
     42 
     43     public String Get(String url) {
     44         CloseableHttpClient httpclient = HttpClients.createDefault();
     45         HttpGet httpget = new HttpGet(url);
     46         httpget.setHeader("cookie", cookieStr);
     47         httpget.setHeader(
     48                 HttpHeaders.USER_AGENT,
     49                 "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36");
     50 
     51         try {
     52             response = httpclient.execute(httpget);
     53             HttpEntity entity = response.getEntity();
     54             String res = EntityUtils.toString(entity, "UTF-8");
     55 
     56             return res;
     57         } catch (Exception e) {
     58             System.err.println(String.format("HTTP GET error %s",
     59                     e.getMessage()));
     60         } finally {
     61             try {
     62                 httpclient.close();
     63             } catch (IOException e) {
     64                 // e.printStackTrace();
     65             }
     66         }
     67         return null;
     68     }
     69 
     70     public String Post(String url) {
     71         CloseableHttpClient httpclient = HttpClients.createDefault();
     72         HttpPost httppost = new HttpPost(url.split("\?")[0]);
     73         StringEntity reqEntity = null;
     74         try {
     75             reqEntity = new StringEntity(url.split("\?")[1], "UTF-8");
     76         } catch (UnsupportedEncodingException e1) {
     77             // TODO Auto-generated catch block
     78             e1.printStackTrace();
     79         }
     80         httppost.setHeader("cookie", cookieStr);
     81         reqEntity
     82                 .setContentType("application/x-www-form-urlencoded;charset=UTF-8");
     83         httppost.setEntity(reqEntity);
     84         httppost.setHeader(
     85                 HttpHeaders.USER_AGENT,
     86                 "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36");
     87         try {
     88             response = httpclient.execute(httppost);
     89             Header[] headers = response.getAllHeaders();
     90             for (Header h : headers) {
     91                 String name = h.getName();
     92                 String value = h.getValue();
     93                 if ("Set-Cookie".equalsIgnoreCase(name)) {
     94                     cookieStr += subCookie(value);
     95                     //System.out.println(cookieStr);
     96                     // break;
     97                 }
     98             }
     99             HttpEntity entity = response.getEntity();
    100 
    101             return EntityUtils.toString(entity, "UTF-8");
    102         } catch (Exception e) {
    103             System.err.println(String.format("HTTP POST error %s",
    104                     e.getMessage()));
    105         } finally {
    106             try {
    107                 httpclient.close();
    108             } catch (IOException e) {
    109                 // e.printStackTrace();
    110             }
    111         }
    112         return null;
    113     }
    114 
    115     public String GetLoginCookie(String url) {
    116         CloseableHttpClient httpclient = HttpClients.createDefault();
    117         HttpGet httpget = new HttpGet(url);
    118         httpget.setHeader("Cookie", cookieStr);
    119         httpget.setHeader(
    120                 HttpHeaders.USER_AGENT,
    121                 "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36");
    122         try {
    123             response = httpclient.execute(httpget);
    124             Header[] headers = response.getAllHeaders();
    125             for (Header h : headers) {
    126                 String name = h.getName();
    127                 String value = h.getValue();
    128                 if ("Set-Cookie".equalsIgnoreCase(name)) {
    129                     cookieStr = subCookie(value);
    130                     return cookieStr;
    131                 }
    132 
    133             }
    134         } catch (Exception e) {
    135             System.err.println(String.format("HTTP GET error %s",
    136                     e.getMessage()));
    137         } finally {
    138             try {
    139                 httpclient.close();
    140             } catch (IOException e) {
    141                 // e.printStackTrace();
    142             }
    143         }
    144         return "4";// 错误码
    145     }
    146 
    147     public String subCookie(String value) {
    148         int end = value.indexOf(";");
    149         return value.substring(0, end + 1);
    150     }
    151 
    152     public InputStream GetImage(String url) {
    153         InputStream is = null;
    154         HttpClient httpclient = new DefaultHttpClient();
    155         HttpGet httpGet = new HttpGet(url);
    156         if (cookieStr != null)
    157             httpGet.setHeader("Cookie", cookieStr);
    158         HttpResponse response;
    159         try {
    160             response = httpclient.execute(httpGet);
    161             if (HttpStatus.SC_OK == response.getStatusLine().getStatusCode()) {
    162                 HttpEntity entity = response.getEntity();
    163                 if (entity != null) {
    164                     //System.out.println(entity.getContentType());
    165                     // 可以判断是否是文件数据流
    166                     //System.out.println(entity.isStreaming());
    167                     // File storeFile = new File("F:\code.jpg");
    168                     // FileOutputStream output = new
    169                     // FileOutputStream(storeFile);
    170                     // 得到网络资源并写入文件
    171                     InputStream input = entity.getContent();
    172                     is = input;
    173                     // byte b[] = new byte[1024];
    174                     // int j = 0;
    175                     // while ((j = input.read(b)) != -1) {
    176                     // output.write(b, 0, j);
    177                     // }
    178                     // output.flush();
    179                     // output.close();
    180                 }
    181             }
    182         } catch (ClientProtocolException e) {
    183             // TODO Auto-generated catch block
    184             e.printStackTrace();
    185         } catch (IOException e) {
    186             // TODO Auto-generated catch block
    187             e.printStackTrace();
    188         }
    189         return is;
    190     }
    View Code

    3、简单的测试Test类。

     1 package com.debughao.down;
     2 
     3 import java.util.ArrayList;
     4 import java.util.List;
     5 import java.util.Scanner;
     6 
     7 import org.jsoup.Jsoup;
     8 import org.jsoup.nodes.Document;
     9 import org.jsoup.nodes.Element;
    10 import org.jsoup.select.Elements;
    11 
    12 import com.debughao.bean.Course;
    13 
    14 public class Test {
    15 
    16     public static void main(String[] args) {
    17         HttpUtils http = new HttpUtils("stat_uuid=1436867409341663197461; uname=qq_rwe4zg5t; uid=3812752; code=LZ8XF1; "
    18                 + "authcode=b809MIxLGp8syQcnuAAdIT9PuCEH2%2FuiyvRuuLALSxb6z6iGoM3xcihNJKzHK%2BAZWzVIGFAW0QrBYiSLmHN1qnhi0YQLmBeWeqkJHXh5xsoylWuRCFmRDJZyUtAGr3U; "
    19                 + "level_id=3; is_expire=0; domain=debughao; stat_fromWebUrl=; stat_ssid=1439813138264;"
    20                 + " connect.sid=s%3A5xux57xcLyCBheevR40DUa0beJD_ok-S.0aTnwfjSvm7A49zydLGbtXy7vdCGfH7lB7MwmZURppQ; "
    21                 + "QINGCLOUDELB=37e16e60f0cd051b754b0acf9bdfd4b5d562b81daa2a899c46d3a1e304c7eb2b|VcWiq|VcWiq; "
    22                 + "_ga=GA1.2.889563867.1436867384; _gat=1; Hm_lvt_f3c68d41bda15331608595c98e9c3915=1438945833,1438947627,1438995076,1438995133;"
    23                 + " Hm_lpvt_f3c68d41bda15331608595c98e9c3915=1439015591; MECHAT_LVTime=1439015591174; MECHAT_CKID=cookieVal=006600143686858016573509; "
    24                 + "undefined=; stat_isNew=0");
    25         Scanner sc=new Scanner(System.in);
    26         String url= sc.nextLine();
    27         sc.close();
    28         String res = http.Get(url);
    29         Document doc = getDocByRes(res);
    30         List<Course> videos = getVideoList(doc);
    31         for (Course video : videos) {
    32             System.out.println(video.getLinkText());
    33         }
    34         for (Course video : videos) {
    35             String urls = video.getLinkHref();
    36             String res2 = http.Get(urls);
    37             Document doc1 = getDocByRes(res2);
    38             getVideoLink(doc1);
    39 
    40         }
    41     }
    42 
    43     private static Document getDocByRes(String res) {
    44         // TODO Auto-generated method stub
    45         Document doc = null;
    46         doc = Jsoup.parse(res);
    47         return doc;
    48     }
    49 
    50     public static List<Course> getVideoList(Element doc) {
    51         Elements links;
    52         List<Course> courses = new ArrayList<Course>();
    53         Course course = null;
    54         Elements results1 = doc.getElementsByClass("lessonvideo-list");
    55         String title = doc.getElementsByTag("title").text();
    56         System.out.println(title);
    57         for (Element element : results1) {
    58             links = element.getElementsByTag("a");
    59             for (Element link : links) {
    60                 String linkList = link.attr("href");
    61                 String linkText = link.text();
    62                 // System.out.println(linkText);
    63                 course = new Course();
    64                 course.setLinkHref(linkList);
    65                 course.setLinkText(linkText);
    66                 courses.add(course);
    67             }
    68         }
    69         return courses;
    70     }
    71 
    72     public static void getVideoLink(Document doc) {
    73         Elements results2 = doc.select("source");
    74         String mp4Links = results2.attr("src");
    75         System.out.println(mp4Links);
    76     }
    77 }
    View Code

    4、以下是运行结果:

     1 http://www.jikexueyuan.com/course/1748.html
     2 自定义 View 基础和原理-极客学院
     3 1.编写自己的自定义 View(上)
     4 2.编写自己的自定义 View(下)
     5 3.加入逻辑线程
     6 4.提取和封装自定义 View
     7 5.在 xml 中定义样式来影响显示效果
     8 http://cv3.jikexueyuan.com/201508082007/99549fa37069a39a2e128278ee60768c/course/1501-1600/1557/video/4278_b_h264_sd_960_540.mp4
     9 http://cv3.jikexueyuan.com/201508082007/a068be74f7f31900e128f109523b0925/course/1501-1600/1557/video/4279_b_h264_sd_960_540.mp4
    10 http://cv3.jikexueyuan.com/201508082008/bf216e06770e9a9b0adda34ea4d01dfc/course/1501-1600/1557/video/4280_b_h264_sd_960_540.mp4
    11 http://cv3.jikexueyuan.com/201508082008/75b51573a75458848136e61e848d1ae7/course/1501-1600/1557/video/4281_b_h264_sd_960_540.mp4
    12 http://cv3.jikexueyuan.com/201508082008/ca20fad3e1bc622aa64bbfa7d2b768dd/course/1501-1600/1557/video/5159_b_h264_sd_960_540.mp4

     打开迅雷新建任务就可以下载。

        

                               

    专业抠代码,详情咨询QQ:863260364
  • 相关阅读:
    如何最快速的找到页面某一元素所绑定的点击事件,并查看js代码
    Python3 实现 JS 中 RSA 加密的 NoPadding 模式
    Python实现京东自动登录
    使用Chrome或Fiddler抓取WebSocket包
    python的ws库功能,实时获取服务器ws协议返回的数据
    js遍历对象所有的属性名称和值
    selenium webdriver 实现Canvas画布自动化测试
    CE教程
    How to Get Text inside a Canvas using Webdriver or Protractor
    HTML <​canvas> testing with Selenium and OpenCV
  • 原文地址:https://www.cnblogs.com/deBug-hao/p/4713780.html
Copyright © 2011-2022 走看看