zoukankan      html  css  js  c++  java
  • 基于scrapy框架的爬虫项目(一)

    ['skræpi:]

    一、参考资料

    1.官方中文文档 https://scrapy-chs.readthedocs.io/zh_CN/latest/index.html

    2.简单易操作的爬虫框架(simplified-scrapy)

    3.爬虫框架Scrapy的安装与基本使用  https://www.jianshu.com/p/6bc5a4641629

    二、simplified-scrapy的使用方法

    1.导入simplified-scrapy包

    pip install simplified-scrapy

    2.编辑运行python文件

    from simplified_scrapy.core.spider import Spider
    class ScrapydSpider(Spider):
    name = 'scrapyd-spider' #定义爬虫名称
    start_urls = ['http://www.scrapyd.cn/'] #初始化入口链接
    # models = ['auto_main','auto_obj'] #配置抽取模型

    def urlFilter(self,url):
    return url.find('/jiaocheng/')>0# 添加采集过滤器,只采集教程数据

    # from simplified_scrapy.core.mongo_objstore import MongoObjStore
    # obj_store = MongoObjStore(name,{'host':'127.0.0.1','port':27017})

    # from simplified_scrapy.core.mongo_urlstore import MongoUrlStore
    # url_store = MongoUrlStore(name,{"multiQueue":True})

    # from simplified_scrapy.core.mongo_htmlstore import MongoHtmlStore
    # html_store = MongoHtmlStore(name)
    #自定义抽取数据方法
    def extract(self, url, html, models, modelNames):
    try:
    html = self.removeScripts(html)# 去掉脚本数据,也可以不去
    lstA = self.listA(html,url["url"])#抽取页面中的链接
    data = []
    ele = self.getElementByTag("h1",html)#取标题
    if(ele):
    title = ele.text
    ele = self.getElementByClass("cont",html,"</h1>")#取正文
    if(ele):
    content = ele.innerHtml
    ele = self.getElementsByTag("span",html,'class="title-2"','class="cont"')#取作者和时间
    author = None
    time = None
    if(ele and len(ele)>1):
    time = ele[0].text
    author = ele[1].text
    data.append({"Url": url["url"], "Title": title, "Content": content, "Author":author, "Time":time})

    return [{"Urls": lstA, "Data": data}]#将数据返回给框架,由框架处理
    except Exception as e:
    print (e)

    from simplified_scrapy.simplified_main import SimplifiedMain #主函数
    SimplifiedMain.startThread(ScrapydSpider())#启动爬虫

    3.抽取的数据默认的情况是存放在同级目录的文件夹data下面,格式为json

  • 相关阅读:
    神通广大的CSS3选择器
    CF1153E Serval and Snake【构造】
    CF1153F Serval and Bonus Problem 【期望】
    李超线段树学习笔记
    Luogu5327【ZJOI2019】语言【树上差分,线段树合并】
    Luogu4191 [CTSC2010]性能优化【多项式,循环卷积】
    Codeforces Round #564 比赛总结
    CF917D Stranger Trees【矩阵树定理,高斯消元】
    【CTS2019】珍珠【生成函数,二项式反演】
    斯特林数学习笔记
  • 原文地址:https://www.cnblogs.com/StarZhai/p/12120848.html
Copyright © 2011-2022 走看看