zoukankan      html  css  js  c++  java
  • 【Java】爬资源案例

    也不知道为什么喜欢叫爬虫

    搞明白原理之后原来就是解析网页代码获取关键字符串

    现在的网页有很多解析出来就是JS了,根本不暴露资源地址

    依赖一个JSOUP,其他靠百度CV实现

            <!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
            <dependency>
                <groupId>org.jsoup</groupId>
                <artifactId>jsoup</artifactId>
                <version>1.13.1</version>
            </dependency>

    我爬取的资源页面代码结构是固定的,所以程序设计起来相对简单

    查看网页源码之后就可以看这个标签是直接设有id值的,我们知道id属性是不可重复的,所以可以凭借这个属性来准确获取dom元素

    得到元素之后再来获取src属性的值,再通过文件下载提供这个资源地址即可获取文件了

    <source id="webmSource" src="https://xxx.com/xxx.webm" type="video/webm">

    恰好我想得到的资源正好就是这么干的

    下面就是代码了:

    package cn.dzz;
    
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    
    import java.io.BufferedReader;
    import java.io.FileOutputStream;
    import java.io.InputStream;
    import java.io.InputStreamReader;
    import java.net.HttpURLConnection;
    import java.net.URL;
    import java.net.URLConnection;
    
    public class Main {
    
        private static String resolving(String urlStr) {
            StringBuffer stringBuffer = new StringBuffer();
    
            URL url;
            try {
                // 通过提供的地址封装成网络对象
                url = new URL(urlStr);
                // 获取连接 目前的网站都不再允许没有HTTP请求头的请求访问,这里至少要设置一个头信息模拟浏览器访问
                // URLConnection urlConnection = url.openConnection();
    
                HttpURLConnection httpURLConnection = ((HttpURLConnection)url.openConnection());
                httpURLConnection.addRequestProperty("User-Agent", "Mozilla/4.0");
    
    
                // 获取输入流对象
                InputStream inputStream = httpURLConnection.getInputStream();
                // 创建输入流读取对象
                InputStreamReader inputStreamReader = new InputStreamReader(inputStream, "utf-8");
                // 创建缓冲流读取对象
                BufferedReader bufferedReader = new BufferedReader(inputStreamReader);
                String htmlCodeLine = "";
    
                // 遍历读取缓冲流读取对象的一行,字符串缓冲对象逐行追加,直至结束
                while ((htmlCodeLine = bufferedReader.readLine()) != null) {
                    stringBuffer.append(htmlCodeLine);
                }
                // 得到完整的页面代码
                return stringBuffer.toString();
    
            } catch (Exception e) {
                e.printStackTrace();
            }
            return null;
        }
    
        private static String getSourceAddress(String completeHtmlCode) {
            // 先由JSOUP解析封装成Document对象
            Document document = Jsoup.parse(completeHtmlCode);
            Elements elementList = document.select("#webmSource");
            System.out.println(elementList);
            Element element = elementList.get(0);
            String src = element.attr("src");
    
            return src;
        }
    
        private static void downloadWebmVideo(String sourceAddress) {
            final String DIR_PATH = "D:/Porn/";
            String fileName;
            int byteSum = 0;
            int byteRead = 0;
            try {
                URL url = new URL(sourceAddress);
    
                fileName = sourceAddress.substring(sourceAddress.lastIndexOf("/") + 1);
                System.out.println(fileName);
    
                URLConnection urlConnection = url.openConnection();
                InputStream inputStream = urlConnection.getInputStream();
                FileOutputStream fileOutputStream = new FileOutputStream(DIR_PATH + fileName);
    
                byte[] bufferBytes = new byte[(int)Math.pow(2,10)]; //1024
    
    
                while ((byteRead = inputStream.read(bufferBytes)) != -1) {
                    byteSum += byteRead;
                    System.out.println(byteRead);
                    fileOutputStream.write(bufferBytes, 0, byteRead);
                }
    
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    
    
    
        public static void main(String[] args) {
            // downloadWebmVideo(getSourceAddress(resolving(args[0])));
            String url = "https://xxx/xxx/";
            downloadWebmVideo(getSourceAddress(resolving(url)));
        }
    }
    

    能够实现文件获取,但是比较简陋

  • 相关阅读:
    poj 2312 Battle City
    poj 2002 Squares
    poj 3641 Pseudoprime numbers
    poj 3580 SuperMemo
    poj 3281 Dining
    poj 3259 Wormholes
    poj 3080 Blue Jeans
    poj 3070 Fibonacci
    poj 2887 Big String
    poj 2631 Roads in the North
  • 原文地址:https://www.cnblogs.com/mindzone/p/14450136.html
Copyright © 2011-2022 走看看