也不知道为什么喜欢叫爬虫
搞明白原理之后原来就是解析网页代码获取关键字符串
现在的网页有很多解析出来就是JS了,根本不暴露资源地址
依赖一个JSOUP,其他靠百度CV实现
<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup --> <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.13.1</version> </dependency>
我爬取的资源页面代码结构是固定的,所以程序设计起来相对简单
查看网页源码之后就可以看这个标签是直接设有id值的,我们知道id属性是不可重复的,所以可以凭借这个属性来准确获取dom元素
得到元素之后再来获取src属性的值,再通过文件下载提供这个资源地址即可获取文件了
<source id="webmSource" src="https://xxx.com/xxx.webm" type="video/webm">
恰好我想得到的资源正好就是这么干的
下面就是代码了:
package cn.dzz; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.io.BufferedReader; import java.io.FileOutputStream; import java.io.InputStream; import java.io.InputStreamReader; import java.net.HttpURLConnection; import java.net.URL; import java.net.URLConnection; public class Main { private static String resolving(String urlStr) { StringBuffer stringBuffer = new StringBuffer(); URL url; try { // 通过提供的地址封装成网络对象 url = new URL(urlStr); // 获取连接 目前的网站都不再允许没有HTTP请求头的请求访问,这里至少要设置一个头信息模拟浏览器访问 // URLConnection urlConnection = url.openConnection(); HttpURLConnection httpURLConnection = ((HttpURLConnection)url.openConnection()); httpURLConnection.addRequestProperty("User-Agent", "Mozilla/4.0"); // 获取输入流对象 InputStream inputStream = httpURLConnection.getInputStream(); // 创建输入流读取对象 InputStreamReader inputStreamReader = new InputStreamReader(inputStream, "utf-8"); // 创建缓冲流读取对象 BufferedReader bufferedReader = new BufferedReader(inputStreamReader); String htmlCodeLine = ""; // 遍历读取缓冲流读取对象的一行,字符串缓冲对象逐行追加,直至结束 while ((htmlCodeLine = bufferedReader.readLine()) != null) { stringBuffer.append(htmlCodeLine); } // 得到完整的页面代码 return stringBuffer.toString(); } catch (Exception e) { e.printStackTrace(); } return null; } private static String getSourceAddress(String completeHtmlCode) { // 先由JSOUP解析封装成Document对象 Document document = Jsoup.parse(completeHtmlCode); Elements elementList = document.select("#webmSource"); System.out.println(elementList); Element element = elementList.get(0); String src = element.attr("src"); return src; } private static void downloadWebmVideo(String sourceAddress) { final String DIR_PATH = "D:/Porn/"; String fileName; int byteSum = 0; int byteRead = 0; try { URL url = new URL(sourceAddress); fileName = sourceAddress.substring(sourceAddress.lastIndexOf("/") + 1); System.out.println(fileName); URLConnection urlConnection = url.openConnection(); InputStream inputStream = urlConnection.getInputStream(); FileOutputStream fileOutputStream = new FileOutputStream(DIR_PATH + fileName); byte[] bufferBytes = new byte[(int)Math.pow(2,10)]; //1024 while ((byteRead = inputStream.read(bufferBytes)) != -1) { byteSum += byteRead; System.out.println(byteRead); fileOutputStream.write(bufferBytes, 0, byteRead); } } catch (Exception e) { e.printStackTrace(); } } public static void main(String[] args) { // downloadWebmVideo(getSourceAddress(resolving(args[0]))); String url = "https://xxx/xxx/"; downloadWebmVideo(getSourceAddress(resolving(url))); } }
能够实现文件获取,但是比较简陋