zoukankan      html  css  js  c++  java
  • 【Java】爬资源案例

    也不知道为什么喜欢叫爬虫

    搞明白原理之后原来就是解析网页代码获取关键字符串

    现在的网页有很多解析出来就是JS了,根本不暴露资源地址

    依赖一个JSOUP,其他靠百度CV实现

            <!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
            <dependency>
                <groupId>org.jsoup</groupId>
                <artifactId>jsoup</artifactId>
                <version>1.13.1</version>
            </dependency>

    我爬取的资源页面代码结构是固定的,所以程序设计起来相对简单

    查看网页源码之后就可以看这个标签是直接设有id值的,我们知道id属性是不可重复的,所以可以凭借这个属性来准确获取dom元素

    得到元素之后再来获取src属性的值,再通过文件下载提供这个资源地址即可获取文件了

    <source id="webmSource" src="https://xxx.com/xxx.webm" type="video/webm">

    恰好我想得到的资源正好就是这么干的

    下面就是代码了:

    package cn.dzz;
    
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    
    import java.io.BufferedReader;
    import java.io.FileOutputStream;
    import java.io.InputStream;
    import java.io.InputStreamReader;
    import java.net.HttpURLConnection;
    import java.net.URL;
    import java.net.URLConnection;
    
    public class Main {
    
        private static String resolving(String urlStr) {
            StringBuffer stringBuffer = new StringBuffer();
    
            URL url;
            try {
                // 通过提供的地址封装成网络对象
                url = new URL(urlStr);
                // 获取连接 目前的网站都不再允许没有HTTP请求头的请求访问,这里至少要设置一个头信息模拟浏览器访问
                // URLConnection urlConnection = url.openConnection();
    
                HttpURLConnection httpURLConnection = ((HttpURLConnection)url.openConnection());
                httpURLConnection.addRequestProperty("User-Agent", "Mozilla/4.0");
    
    
                // 获取输入流对象
                InputStream inputStream = httpURLConnection.getInputStream();
                // 创建输入流读取对象
                InputStreamReader inputStreamReader = new InputStreamReader(inputStream, "utf-8");
                // 创建缓冲流读取对象
                BufferedReader bufferedReader = new BufferedReader(inputStreamReader);
                String htmlCodeLine = "";
    
                // 遍历读取缓冲流读取对象的一行,字符串缓冲对象逐行追加,直至结束
                while ((htmlCodeLine = bufferedReader.readLine()) != null) {
                    stringBuffer.append(htmlCodeLine);
                }
                // 得到完整的页面代码
                return stringBuffer.toString();
    
            } catch (Exception e) {
                e.printStackTrace();
            }
            return null;
        }
    
        private static String getSourceAddress(String completeHtmlCode) {
            // 先由JSOUP解析封装成Document对象
            Document document = Jsoup.parse(completeHtmlCode);
            Elements elementList = document.select("#webmSource");
            System.out.println(elementList);
            Element element = elementList.get(0);
            String src = element.attr("src");
    
            return src;
        }
    
        private static void downloadWebmVideo(String sourceAddress) {
            final String DIR_PATH = "D:/Porn/";
            String fileName;
            int byteSum = 0;
            int byteRead = 0;
            try {
                URL url = new URL(sourceAddress);
    
                fileName = sourceAddress.substring(sourceAddress.lastIndexOf("/") + 1);
                System.out.println(fileName);
    
                URLConnection urlConnection = url.openConnection();
                InputStream inputStream = urlConnection.getInputStream();
                FileOutputStream fileOutputStream = new FileOutputStream(DIR_PATH + fileName);
    
                byte[] bufferBytes = new byte[(int)Math.pow(2,10)]; //1024
    
    
                while ((byteRead = inputStream.read(bufferBytes)) != -1) {
                    byteSum += byteRead;
                    System.out.println(byteRead);
                    fileOutputStream.write(bufferBytes, 0, byteRead);
                }
    
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    
    
    
        public static void main(String[] args) {
            // downloadWebmVideo(getSourceAddress(resolving(args[0])));
            String url = "https://xxx/xxx/";
            downloadWebmVideo(getSourceAddress(resolving(url)));
        }
    }
    

    能够实现文件获取,但是比较简陋

  • 相关阅读:
    CF627A Xor Equation
    CF865C Gotta Go Fast
    HDU 2222 Keywords Search
    BZOJ 2038: [2009国家集训队]小Z的袜子(hose)
    BZOJ 3781: 小B的询问
    BZOJ 1086: [SCOI2005]王室联邦
    BZOJ 2120: 数颜色
    BZOJ 1503: [NOI2004]郁闷的出纳员
    BZOJ 3757: 苹果树
    BZOJ 1861: [Zjoi2006]Book 书架
  • 原文地址:https://www.cnblogs.com/mindzone/p/14450136.html
Copyright © 2011-2022 走看看