zoukankan      html  css  js  c++  java
  • Jsoup-简单爬取知乎推荐页面(附:get_agent())

    总览

    今天我们就来小用一下Jsoup,从一个整体的角度来看一看爬虫

    一个基本的爬虫框架包括:

    • [x] 解析网页
    • [x] 失败重试
    • [x] 抓取内容保存至本地
    • [x] 多线程抓取

    分模块讲解

    将上述基本框架的模块按逻辑顺序讲解,一步一步复现代码实现过程

    • 失败重试

    一个好的模块必然有异常捕捉和处理

    在之前的内容中,我们提到过一个简单的异常处理,小伙伴还记得么
    简易版

        // 爬取的网址
        val url = "https://www.zhihu.com/explore/recommendations"
        // 加上TryCatch框架
        Try(Jsoup.connect(url).get())match {
          case Failure(e) =>
            // 打印异常信息
            println(e.getMessage)
          case Success(doc:Document) =>
            // 解析正常则返回Document,然后提取Document内所需信息
            println(doc.body())
        }
    

    今天我们来在之上稍微丰富一下,把他包装的更健壮一点
    丰富版

       var count = 0	//解析网页时统计抓取数用
      //用于记录总数,和失败次数
      val sum, fail: AtomicInteger = new AtomicInteger(0)
      //当出现异常时1s后重试,异常重复100次
      def requestGetUrl(times:Int=100,delay:Long=1000) : Unit ={
        Try(Jsoup.connect(Url).userAgent(get_agent()).get())match {
     case Failure(e) =>{
            if(times!=0){
              println(e.getMessage) //打印错误信息
              Thread.sleep(delay) //等待1s
              fail.addAndGet(1) //失败次数+1
              requestGetUrl(times-1,delay)  //times-1后,重调方法
            }else throw e
          }
          case Success(doc) =>
            parseDoc(doc)
            if (count==0){  // 解析网页时用来统计是否抓取为空
              Thread.sleep(delay)
              requestGetUrl(times-1,delay)
            }
            sum.addAndGet(1)  //成功次数+1
        }
      }
    
    • get_agent()说明
      //自己设置一下user-agent,或者更好的是,可以从一系列的user-agent里随机挑出一个符合标准的使用
      def get_agent()={
      //模拟header的user-agent字段,返回一个随机的user-agent字典类型的键值对
        val agents=Array("Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",
          "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv,2.0.1) Gecko/20100101 Firefox/4.0.1",
          "Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
          "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
          "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11")
        val ran = new Random().nextInt(agents.length)
        agents(ran)
      }
    

    • 解析网页

    沿用上一篇我们写过的方法就可以

      //解析Document
      var count = 0
      //用一个hashmap来保存住区的内容
      val text = new ConcurrentHashMap[String,String]()
      def parseDoc(doc:Document): Unit ={
        // 解析正常则返回Document,然后提取Document内所需信息
        val links = doc.select("div.zm-item") //选取class为"zm-item"的div
        for (link<-links.asScala) { //遍历每一个这样的div
          val title = link.select("h2").text() //选取div中的所有"h2"标签,并读取它的文本内容
          val approve = link.select("div.zm-item-vote").text() //找到赞同的位置,选中它并读取它的文本内容
          //逐层找到唯一识别的标签,然后选中(唯一识别很关键)
          val author = link.select("div.answer-head").select("span.author-link-line").select("a").text()
          val content = link.select("div.zh-summary.summary.clearfix").text() //多个class类型,直接加.就行,如.A.B.C
          
          text.put(title,author+"	"+approve+"	"+content)
          count+=1
        }
        count
      }
    

    • 抓取内容保存至本地
      // 获取当前日期
      def getNowDate(): String ={
        new SimpleDateFormat("yyMMdd").format(new Date())
      }
    
      // 爬取内容写入文件
      def output(zone:String): Unit ={
        val writer = new PrintWriter(new File(getNowDate()+"_"+zone++".txt"))
        for((title,value)<-text){
          writer.println(title+value)
        }
        writer.flush()
        writer.close()
      }
    

    抓取内容展示


    • 多线程抓取
      //多线程抓取
      def concurrentCrawler(zone: String,maxPage:Int,threadNum:Int)={
        var loopar = (1 to maxPage).par
        loopar.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(threadNum))
        loopar.foreach(x=>requestGetUrl())
        output(zone)
      }
    

    • get_agent()补充说明及福利
    def get_agent()={
        //模拟header的user-agent字段,返回一个随机的user-agent字典类型的键值对
        val agents=Array("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
          "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
          "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
          "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
          "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
          "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
          "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
          "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
          "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
          "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
          "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
          "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
          "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
          "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
          "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
          "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
          "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
          "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
          "Mozilla/5.0 (Macintosh; U; Mac OS X Mach-O; en-US; rv:2.0a) Gecko/20040614 Firefox/3.0.0 ",
          "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.5; en-US; rv:1.9.0.3) Gecko/2008092414 Firefox/3.0.3",
          "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1) Gecko/20090624 Firefox/3.5",
          "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.14) Gecko/20110218 AlexaToolbar/alxf-2.0 Firefox/3.6.14",
          "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.5; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15",
          "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
          "Mozilla/5.0(Macintosh;U;IntelMacOSX10_6_8;en-us)AppleWebKit/534.50(KHTML,likeGecko)Version/5.1Safari/534.50")
        val ran = new Random().nextInt(agents.length)
        agents(ran)
      }
    

    结尾唠叨两句

    如果你对我的文章感兴趣,欢迎你点开我下一篇文章,后面我将手把手带你一起完成一个个小case,对了如果你也有好的想法,欢迎沟通交流
    今天主要是带大家一起完成了知乎网站的爬取,练一练手,熟能生巧!

  • 相关阅读:
    Windows服务的安装及配合定时器编写简单的程序
    关于VS2019使用Git时遇到的Bug
    记一次工作中的小BUG
    .Net WebApi接口Swagger集成简单使用
    kettle 创建数据库资源库
    C# 语法 i++;++i;i--;--i
    MSDN
    Jenkins持续集成(下)-Jenkins部署Asp.Net网站自动发布
    Jenkins持续集成(上)-Windows下安装Jenkins
    自动发布-asp.net自动发布、IIS站点自动发布(集成SLB、配置管理、Jenkins)
  • 原文地址:https://www.cnblogs.com/wxplmm/p/10308774.html
Copyright © 2011-2022 走看看