zoukankan      html  css  js  c++  java
  • 【专项学习】—— 使用TypeScript编写爬虫工具

    一、爬虫概述及正版密钥获取

    爬取的页面: www.dell-lee.com/typescript/demo.html?secret=secretKey

    密钥secretKey值获取(会不定期变更):https://git.imooc.com/coding-412/source-code

    二、TypeScript基础环境搭建

    ①nodejs下载安装

    ②VSCode点击左下角齿轮状的图标,在弹出的菜单中选择【Settings】,打开设置窗口

        quote引号 —— 选single

        tab缩进 —— Tab Size选2

        save保存格式化 —— 勾选Format On Save

    ③VSCode点击左边方块Extensions,搜索插件Prettier, install下载开启

    ④安装TypeScript

        VSCode点击上方Terminal,New Terminal开启TERMINAL面板

    npm install typescript@3.6.4 -g  

    ⑤typescript compile使用typescript对demo.ts进行编译,生成demo.js文件

    tsc demo.ts

    运行demo.js

    node demo.js  

    ⑥简化上一步,安装ts-node工具

    npm install -g ts-node@8.4.1  

    直接运行demo.ts

    ts-node demo.ts
    

    三、使用SuperAgent和类型定义文件获取页面内容

    ①生成package.json文件

    npm init -y  

    ②生成tsconfig.json文件

    tsc --init  

    ③卸载全局安装的ts-node,安装在项目中

    npm uninstall ts-node
    
    npm install -D ts-node@8.4.1 (-D = --save-dev)  

    ④项目中安装typescript

    npm install typescript@3.6.4 -D  

    ⑤新建src目录,创建crowller.ts文件;pagekage.json中修改命令

    "scripts": {
         "dev": "ts-node ./src/crowller.ts"
    },  

    运行npm run dev

    ⑥安装superagent工具在node中发送ajax请求取得数据

    npm install superagent@5.1.1 --save  

    superagent是js语法,在ts里运行js会飘红,不知道怎么引用

    ts -> .d.ts(翻译文件、类型定义文件)-> js

    npm install @types/superagent@4.1.4 -D
    

    ⑦获取页面内容  

    import superagent from 'superagent';
    
    class Crowller {
         private secret = 'secretKey';
         private url = `http://www.dell-lee.com/typescript/demo.html?secret=${this.secret}`;
         private rawHtml = '';
    
         async getRawHtml() {
              const result = await superagent.get(this.url);
              this.rawHtml = result.text;
         }
    
         constructor() {
              this.getRawHtml();
         }
    }
    
    const crowller = new Crowller()
    

       

    四、使用cheerio进行数据提取

    ①安装cheerio库通过jquery获取页面区块内容

    npm install cheerio --save  

    ②安装类型定义文件@types/cheerio

    npm install @types/cheerio -D
    

    ③数据提取  

    import superagent from 'superagent';
    import cheerio from 'cheerio';
    
    interface Course {
      title: string
    }
    
    class Crowller {
      private secret = 'secretKey';
      private url = `http://www.dell-lee.com/typescript/demo.html?secret=${this.secret}`;
    
      getCourseInfo(html: string) {
         const $ = cheerio.load(html);
         const courseItems = $('.course-item');
         const courseInfos: Course[] = [];
         courseItems.map((index, element) => {
           const descs = $(element).find('.course-desc');
           const title = descs.eq(0).text();
           courseInfos.push({
             title
           })
         })
         const result = {
           time: new Date().getTime(),
           data: courseInfos
         }
         console.log(result);
      }
    
      async getRawHtml() {
        const result = await superagent.get(this.url);
        this.getCourseInfo(result.text);
      }
    
      constructor() {
         this.getRawHtml();
      }
    }
    
    const crowller = new Crowller()
    

    五、爬取数据的结构设计和存储

    //ts -> .d.ts(翻译文件)-> js
    import fs from 'fs';
    import path from 'path';
    import superagent from 'superagent';
    import cheerio from 'cheerio';
    
    interface Course {
        title: string
    }
    
    interface CourseResult {
        time: number;
        data: Course[];
    }
    
    interface Content {
        [propName: number]: Course[];
    }
    
    class Crowller {
          private secret = 'secretKey';
          private url = `http://www.dell-lee.com/typescript/demo.html?secret=${this.secret}`;
    
          getCourseInfo(html: string) {
                const $ = cheerio.load(html);
                const courseItems = $('.course-item');
                const courseInfos: Course[] = [];
                courseItems.map((index, element) => {
                      const descs = $(element).find('.course-desc');
                      const title = descs.eq(0).text();
                      courseInfos.push({
                              title
                      })
                })
                return {
                      time: new Date().getTime(),
                      data: courseInfos
                }
          }
    
          async getRawHtml() {
                const result = await superagent.get(this.url);
                return result.text;
          }
    
          generateJsonContent(courseInfo: CourseResult) {
               const filePath = path.resolve(__dirname, '../data/course.json')
               let fileContent: Content = {};
               //如果文件存在,读取以前的内容
               if(fs.existsSync(filePath)){
                  fileContent = JSON.parse(fs.readFileSync(filePath, 'utf-8'));
               }
               fileContent[courseInfo.time] = courseInfo.data;
               return fileContent;
          }
    
          async initSpiderProcess() {
                const filePath = path.resolve(__dirname, '../data/course.json')
                const html = await this.getRawHtml();
                const courseInfo = this.getCourseInfo(html);
                const fileContent = this.generateJsonContent(courseInfo);
                //存入现在的内容
                fs.writeFileSync(filePath, JSON.stringify(fileContent));
          }
    
          constructor() {
                this.initSpiderProcess();
          }
    }
    
    const crowller = new Crowller()
    

     六、使用组合设计模式优化代码

    ①爬虫通用类 - crowller.ts

    //ts -> .d.ts(翻译文件)-> js
    import fs from 'fs';
    import path from 'path';
    import superagent from 'superagent';
    import DellAnalyzer from './dellAnaiyzer';
    
    export interface Analyzer {
          analyze: (html: string, filePath: string) => string
    }
    
    class Crowller {
          private filePath = path.resolve(__dirname, '../data/course.json');
    
          async getRawHtml() {
                const result = await superagent.get(this.url);
                return result.text;
          }
    
          writeFile(content: string){
                fs.writeFileSync(this.filePath, content); 
          }
    
          async initSpiderProcess() {
                const html = await this.getRawHtml();
                const fileContent = this.analyzer.analyze(html, this.filePath);
                //存入现在的内容
                this.writeFile(fileContent)
          }
    
          constructor(private url: string, private analyzer: Analyzer) {
                this.initSpiderProcess();
          }
    }
    
    const secret = 'secretKey';
    const url = `http://www.dell-lee.com/typescript/demo.html?secret=${secret}`;
    
    const analyzer = new DellAnalyzer();
    new Crowller(url, analyzer)  

    ②爬虫某一网页专向策略 - dellAnaiyzer.ts

    import cheerio from 'cheerio';
    import fs from 'fs';
    import { Analyzer } from './crowller';
    
    interface Course {
      title: string
    }
    
    interface CourseResult {
      time: number;
      data: Course[];
    }
    
    interface Content {
      [propName: number]: Course[];
    }
    
    //分析器
    export default class DellAnalyzer implements Analyzer{
        private getCourseInfo(html: string) {
          const $ = cheerio.load(html);
          const courseItems = $('.course-item');
          const courseInfos: Course[] = [];
          courseItems.map((index, element) => {
                const descs = $(element).find('.course-desc');
                const title = descs.eq(0).text();
                courseInfos.push({
                        title
                })
          })
          return {
                time: new Date().getTime(),
                data: courseInfos
          }
        }
    
        generateJsonContent(courseInfo: CourseResult, filePath: string) {
            let fileContent: Content = {};
            //如果文件存在,读取以前的内容
            if(fs.existsSync(filePath)){
              fileContent = JSON.parse(fs.readFileSync(filePath, 'utf-8'));
            }
            fileContent[courseInfo.time] = courseInfo.data;
            return fileContent;
        }
    
        public analyze(html: string, filePath: string) {
          const courseInfo = this.getCourseInfo(html);
          const fileContent = this.generateJsonContent(courseInfo, filePath);
          return JSON.stringify(fileContent)
        }
    }
    

    七、单例模式实战复习

    DellAnalyzer改为单例模式 

     private static instance: DellAnalyzer;
    
     static getInstance() {
          if(!DellAnalyzer.instance) {
            DellAnalyzer.instance = new DellAnalyzer();
          }
          return DellAnalyzer.instance;
     }
    
    …… ……
    
    private constructior(){}
    

    ②crowller.ts  

    const analyzer = DellAnalyzer.getInstance();
    new Crowller(url, analyzer)
    

    八、TypeScript的编译运转过程的进一步理解  

    ①package.json增加命令

    "build": "tsc -w"   //对ts文件统一编译 -w当ts文件发生改变时自动检测编译

    ②tsconfig.json文件中控制编译生成的文件放入build目录

    "ourDir": "./build"  

    ③安装nodemon工具,通过监控项目文件的变化做一些事情

    npm install nodemon -D

    ④package.json增加命令

    "start": "nodemon node ./build/crowller.js"  

    ⑤package.json配置nodemonConfig,忽略数据data发生变化时的编译

    "nodemonConfig": {
              "ignore": [
                 "data/*"
              ]
    }

    ⑥安装concurrently工具,并行执行build和start命令

    npm install concurrently -D

    ⑦package.json更改命令

    “scripts": {
          "dev:build": "tsc -w",
          "dev:start":  "nodemon node ./build/crowller.js",
          "dev": "concurrently npm:dev:*"
    }
    

    注:项目源自慕课网 

    人与人的差距是:每天24小时除了工作和睡觉的8小时,剩下的8小时,你以为别人都和你一样发呆或者刷视频
  • 相关阅读:
    数据类型及转换
    进制转换
    精通libGDX-RPG开发实战
    github上最好的开源MMORPG
    同步mysql数据到ElasticSearch的最佳实践
    在libGDX中使用Spine骨骼动画
    window下Kafka最佳实践
    linux 系统的负载与CPU、内存、硬盘、用户数监控脚本[marked]
    源码安装cmake(或者叫升级cmake)
    Rust-HayStack
  • 原文地址:https://www.cnblogs.com/ljq66/p/14512920.html
Copyright © 2011-2022 走看看