zoukankan      html  css  js  c++  java
  • web全栈应用【爬取(scrapy)数据 -> 通过restful接口存入数据库 -> websocket推送展示到前台】

    作为

    https://github.com/fanqingsong/web_full_stack_application

    子项目的一功能的核心部分,使用scrapy抓取数据,解析完的数据,使用 python requets库,将数据推送到 webservice接口上, webservice接口负责保存数据到mongoDB数据库。

    实现步骤:

    1、 使用requests库,与webservice接口对接。

    2、 使用scrapy抓取数据。

    3、 结合1 2 实现完整功能。

    Requests库 (Save to DB through restful api)

    库的安装和快速入门见:

    http://docs.python-requests.org/en/master/user/quickstart/#response-content

    给出测试通过示例代码:

    insert_to_db.py

    import requests

    resp = requests.get('http://localhost:3000/api/v1/summary')


    # ------------- GET --------------
    if resp.status_code != 200:
         # This means something went wrong.
         raise ApiError('GET /tasks/ {}'.format(resp.status_code))

    for todo_item in resp.json():
         print('{} {}'.format(todo_item['Technology'], todo_item['Count']))

    # ------------- POST --------------
    Technology = {"Technology": "Django", "Count": "50" }

    resp = requests.post('http://localhost:3000/api/v1/summary', json=Technology)
    if resp.status_code != 201:
         raise ApiError('POST /Technologys/ {}'.format(resp.status_code))

    print("-------------------")
    print(resp.text)


    print('Created Technology. ID: {}'.format(resp.json()["_id"])

    Python VirutalEnv运行环境

    https://realpython.com/python-virtual-environments-a-primer/

    Create a new virtual environment inside the directory:

    # Python 2:
    $ virtualenv env
    
    # Python 3
    $ python3 -m venv env
    

    Note: By default, this will not include any of your existing site packages.

    windows 激活:

    envScriptsactivate

    Scrapy(Scratch data)

    https://scrapy.org/

    An open source and collaborative framework for extracting the data you need from websites.

    In a fast, simple, yet extensible way.

    https://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/architecture.html

    Scrapy architecture

    安装和使用参考:

    https://www.cnblogs.com/lightsong/p/8732537.html

    安装和运行过程报错解决办法:

    1、 Scrapy运行ImportError: No module named win32api错误

    https://blog.csdn.net/u013687632/article/details/57075514

    pip install pypiwin32

    2、 error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

    https://www.cnblogs.com/baxianhua/p/8996715.html

    1. http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted 下载twisted对应版本的whl文件(我的Twisted‑17.5.0‑cp36‑cp36m‑win_amd64.whl),cp后面是python版本,amd64代表64位,

    2. 运行命令:

    pip install C:UsersCRDownloadsTwisted-17.5.0-cp36-cp36m-win_amd64.whl

    给出示例代码:

    quotes_spider.py

    import scrapy


    class QuotesSpider(scrapy.Spider):
         name = "quotes"
         start_urls = [
             'http://quotes.toscrape.com/tag/humor/',
         ]

        def parse(self, response):
             for quote in response.css('div.quote'):
                 yield {
                     'text': quote.css('span.text::text').extract_first(),
                     'author': quote.xpath('span/small/text()').extract_first(),
                 }

            next_page = response.css('li.next a::attr("href")').extract_first()
             if next_page is not None:
                 yield response.follow(next_page, self.parse)

    在此目录下,运行

    scrapy runspider quotes_spider.py -o quotes.json
    
    

    输出结果

    [
    {"text": "u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.u201d", "author": "Jane Austen"},
    {"text": "u201cA day without sunshine is like, you know, night.u201d", "author": "Steve Martin"},
    {"text": "u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.u201d", "author": "Garrison Keillor"},
    {"text": "u201cBeauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.u201d", "author": "Jim Henson"},
    {"text": "u201cAll you need is love. But a little chocolate now and then doesn't hurt.u201d", "author": "Charles M. Schulz"},
    {"text": "u201cRemember, we're madly in love, so it's all right to kiss me anytime you feel like it.u201d", "author": "Suzanne Collins"},
    {"text": "u201cSome people never go crazy. What truly horrible lives they must lead.u201d", "author": "Charles Bukowski"},
    {"text": "u201cThe trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.u201d", "author": "Terry Pratchett"},
    {"text": "u201cThink left and think right and think low and think high. Oh, the thinks you can think up if only you try!u201d", "author": "Dr. Seuss"},
    {"text": "u201cThe reason I talk to myself is because Iu2019m the only one whose answers I accept.u201d", "author": "George Carlin"},
    {"text": "u201cI am free of all prejudice. I hate everyone equally. u201d", "author": "W.C. Fields"},
    {"text": "u201cA lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.u201d", "author": "Jane Austen"}
    ]

    业务全流程实例

    https://github.com/fanqingsong/web_data_visualization

    由于zhipin网站对爬虫有反制策略, 本例子采用scrapy的官方爬取实例quotes为研究对象。

    流程为:

    1、 爬取数据,  scrapy 的两个组件 spider & item pipeline

    2、 存数据库, requests库的post方法推送数据到 webservice_quotes服务器的api

    3、 webservice_quotes将数据保存到mongoDB

    4、 浏览器访问vue页面, 与websocket_quotes服务器建立连接

    5、 websocket_quotes定期(每隔1s)从mongoDB中读取数据,推送给浏览器端,缓存为Vue应用的data,data绑定到模板视图

    scrapy item pipeline 推送数据到webservice接口

    # -*- coding: utf-8 -*-

    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


    import requests

    class ScratchZhipinPipeline(object):
         def process_item(self, item, spider):

            print("--------------------")
             print(item['text'])
             print(item['author'])
             print("--------------------")

            # save to db through web service
             resp = requests.post('http://localhost:3001/api/v1/quote', json=item)
             if resp.status_code != 201:
                 raise ApiError('POST /item/ {}'.format(resp.status_code))
             print(resp.text)
             print('Created Technology. ID: {}'.format(resp.json()["_id"]))

            return item

    爬虫运行: scrapy crawl quotes

    image

    webservice运行: npm run webservice_quotes

    image

    websocket运行: npm run websocket_quotes

    image

    vue调试环境运行: npm run dev

    image

    chrome:

    image

    db:

    image

    Python生成requirement.text文件

    http://www.cnblogs.com/zhaoyingjie/p/6645811.html

    快速生成requirement.txt的安装文件
    (CenterDesigner) xinghe@xinghe:~/PycharmProjects/CenterDesigner$ pip freeze > requirements.txt
    安装所需要的文件
    
    pip install -r requirement.txt

    image

    image

  • 相关阅读:
    [转]create a basic sql server 2005 trigger to send email alerts
    SDUT OJ 2783 小P寻宝记
    联想杨元庆:互联网不包治百病 概念被夸大
    【Stackoverflow好问题】Java += 操作符实质
    poj 2513 Colored Sticks (trie 树)
    Nginx基础教程PPT
    POJ 1753 Flip Game (DFS + 枚举)
    poj 3020 Antenna Placement (最小路径覆盖)
    Unable to boot : please use a kernel appropriate for your cpu
    HDU 2844 Coins (多重背包)
  • 原文地址:https://www.cnblogs.com/lightsong/p/9624433.html
Copyright © 2011-2022 走看看