zoukankan      html  css  js  c++  java
  • Druid学习之路 (四)Druid的数据采集格式

    作者:Syn良子 出处:https://www.cnblogs.com/cssdongl/p/9715735.html 转载请注明出处

    Druid的数据采集格式


    Druid可以采集非标准化的数据诸如JSON,CSV或者以某种分隔符隔开的TSV格式,当然还支持自定义格式.虽然大部分的文档使用JSON格式,但是通过druid来配置支持其他的限定格式也不是很难.

    当前支持的格式化数据


    1. 列表项

    JSON

    {"timestamp": "2013-08-31T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143}
    

    CSV

    2013-08-31T01:02:33Z,"Gypsy Danger","en","nuclear","true","true","false","false","article","North America","United States","Bay Area","San Francisco",57,200,-143
    

    TSV

    2013-08-31T01:02:33Z    "Gypsy Danger"  "en"    "nuclear"   "true"  "true"  "false" "false" "article"   "North America" "United States" "Bay Area"  "San Francisco" 57  200 -143
    

    需要注意的是CSV,TSV不能包含列头,这点在数据采集的时候一定要注意

    自定义格式


    Druid支持使用正则解析和JavaScript来自定义数据格式.但是这种方式并没有自己实现的Java解析器或者额外的流式处理工具效率更高.

    配置数据采集的schema


    什么是data schema?其实就是Druid的index数据摄取任务需要的数据源的描述的元数据.它主要描述要采集的数据类型,数据由哪些列构成,哪些是指标列,哪些是维度列,时间的粒度等.

    以CSV格式举例

    "parseSpec": {
    "format" : "csv",
    "timestampSpec" : {
      "column" : "timestamp"
    },
    "columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"],
    "dimensionsSpec" : {
      "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"]
    }}
    

    parseSpec指明了数据源格式,这里是format中表明是CSV格式,然后说明时间戳字段名是timestamp,数据字段名是columns里面那一堆,dimensionsSpec则代表哪些字段可以作为维度.

    参考资料:Druid的数据格式

  • 相关阅读:
    2020重新出发,NOSQL,MongoDB分布式集群架构
    2020重新出发,NOSQL,MongoDB的操作和索引
    2020重新出发,NOSQL,MongoDB是什么?
    2020重新出发,NOSQL,redis高并发系统的分析和设计
    2020重新出发,NOSQL,redis互联网架构分析
    2020重新出发,NOSQL,Redis和数据库结合
    2020重新出发,NOSQL,Redis主从复制
    collections模块
    常用模块
    python面向对象的内置函数和反射
  • 原文地址:https://www.cnblogs.com/cssdongl/p/9715735.html
Copyright © 2011-2022 走看看