zoukankan      html  css  js  c++  java
  • 我也来写个小爬虫 ^_^

    今天下班抽了点时间看了下印象笔记,整理了一个礼拜node的api笔记。。。。然后去慕课网看了Scott老师讲的node系列视频教程。于是自己写了一个小小的爬虫,爬的是自己写的博客章节 ,里面的一些es6语法和api我就不一一细说,大家可以去看文档,http://nodeapi.ucdok.com/#/api/,好了话不多说,直接上代码

    'use strict';
    {
        const http = require(`http`);
        
        const cheerio = require(`cheerio`);
    
        const fs = require(`fs`);
    
        let url = `http://www.cnblogs.com/tween`;
    
        http.get(url, (res) => {
    
            let content = ``;
    
            res.on(`data`, (data) => {
    
                content += data;
    
            }).on(`end`, () => {
    
                let html = getContent(content);
    
                creatTxt(html);
    
            });
    
        }).on(`error`,() => console.log(`获取数据失败`));
    
        let creatTxt = content => {
    
            let txt = ``;
    
            for(let v of content){
                txt += v.time;
                let blog = v.blog;
                for(let v of blog){
                    let json = v;
                    for(let name in json){
                        txt += json[name];
                    }
                    txt += `
    `;
                }
            }
            
            fs.writeFile(`blog.txt`,txt,'utf-8',(err) => {
    
                err?console.log(err):console.log(`写入成功`);
    
            });
    
        };
        let getContent = content => {
    
            let $ = cheerio.load(content);
    
            let blogs = $(`.day`);
    
            let arr = [];
    
            blogs.each( (index, item) => {
    
                let _this = $(item);
    
                let time = _this.find(`.dayTitle`).text();
    
                let indexBlog = [];
    
                _this.find(`.postTitle`).each((index, item) => {
    
                    let title = $(item).text().trim();
    
                    let list = _this.find(`.postDesc`).eq(index).text();
    
                    let read = list.match(/(d+)/g)[0].trim();
    
                    let comment = list.match(/(d+)/g)[1].trim();
    
                    indexBlog[index] = {
                        title:`	${title}
    `,
                        read:`	阅读:${read} 评论:${comment}
    `,
                    };
                });
                arr[index] = {
                    time:`${index+1} 、${time.trim()}
    `,
                    blog:indexBlog
                };
    
            });
            return arr;
        };
    }

    运行后会在同目录下创建个blog.txt,里面的内容就是爬到的数据

  • 相关阅读:
    CXB 闯关游戏
    CXB 移动“哨兵棋子”
    GHOJ 300 Hanoi塔
    攻防世界 web 进阶区 刷题记录
    攻防世界 web 新手练习 刷题记录
    TensorFlow01:增加变量显示+tensorboard可视化
    TensorFlow01:梯度下降
    TensorFlow01:张量
    01深度学习介绍
    05Python爬虫:响应内容写入文件
  • 原文地址:https://www.cnblogs.com/tween/p/5419490.html
Copyright © 2011-2022 走看看