zoukankan      html  css  js  c++  java
  • Druid 0.17入门(4)—— 数据查询方式大全

    本文介绍Druid查询数据的方式,首先我们保证数据已经成功载入。

    Druid查询基于HTTP,Druid提供了查询视图,并对结果进行了格式化。

    Druid提供了三种查询方式,SQL,原生JSON,CURL。

    一、SQL查询

    我们用wiki的数据为例

    查询10条最多的页面编辑

    SELECT page, COUNT(*) AS Edits
    FROM wikipedia
    WHERE TIMESTAMP '2015-09-12 00:00:00' <= "__time" AND "__time" < TIMESTAMP '2015-09-13 00:00:00'
    GROUP BY page
    ORDER BY Edits DESC
    LIMIT 10
    

    我们在Query视图中操作

    会有提示

    选择Smart query limit会自动限制行数

    Druid还提供了命令行查询sql 可以运行bin/dsql进行操作

    Welcome to dsql, the command-line client for Druid SQL.
    Type "h" for help.
    dsql>
    

    提交sql

    dsql> SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 10;
    ┌──────────────────────────────────────────────────────────┬───────┐
    │ page                                                     │ Edits │
    ├──────────────────────────────────────────────────────────┼───────┤
    │ Wikipedia:Vandalismusmeldung                             │    33 │
    │ User:Cyde/List of candidates for speedy deletion/Subpage │    28 │
    │ Jeremy Corbyn                                            │    27 │
    │ Wikipedia:Administrators' noticeboard/Incidents          │    21 │
    │ Flavia Pennetta                                          │    20 │
    │ Total Drama Presents: The Ridonculous Race               │    18 │
    │ User talk:Dudeperson176123                               │    18 │
    │ Wikipédia:Le Bistro/12 septembre 2015                    │    18 │
    │ Wikipedia:In the news/Candidates                         │    17 │
    │ Wikipedia:Requests for page protection                   │    17 │
    └──────────────────────────────────────────────────────────┴───────┘
    Retrieved 10 rows in 0.06s.
    

    还可以通过Http发送SQL

    curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/tutorial/wikipedia-top-pages-sql.json http://localhost:8888/druid/v2/sql
    

    可以得到如下结果

    [
      {
        "page": "Wikipedia:Vandalismusmeldung",
        "Edits": 33
      },
      {
        "page": "User:Cyde/List of candidates for speedy deletion/Subpage",
        "Edits": 28
      },
      {
        "page": "Jeremy Corbyn",
        "Edits": 27
      },
      {
        "page": "Wikipedia:Administrators' noticeboard/Incidents",
        "Edits": 21
      },
      {
        "page": "Flavia Pennetta",
        "Edits": 20
      },
      {
        "page": "Total Drama Presents: The Ridonculous Race",
        "Edits": 18
      },
      {
        "page": "User talk:Dudeperson176123",
        "Edits": 18
      },
      {
        "page": "Wikipédia:Le Bistro/12 septembre 2015",
        "Edits": 18
      },
      {
        "page": "Wikipedia:In the news/Candidates",
        "Edits": 17
      },
      {
        "page": "Wikipedia:Requests for page protection",
        "Edits": 17
      }
    ]
    

    更多SQL示例

    时间查询

    SELECT FLOOR(__time to HOUR) AS HourTime, SUM(deleted) AS LinesDeleted
    FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00'
    GROUP BY 1
    

    分组查询

    SELECT channel, page, SUM(added)
    FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00'
    GROUP BY channel, page
    ORDER BY SUM(added) DESC
    

    查询原始数据

    SELECT user, page
    FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 02:00:00' AND TIMESTAMP '2015-09-12 03:00:00'
    LIMIT 5
    

    定时查询

    也可以在dsql里操作

    dsql> EXPLAIN PLAN FOR SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 10;
    
    │ DruidQueryRel(query=[{"queryType":"topN","dataSource":{"type":"table","name":"wikipedia"},"virtualColumns":[],"dimension":{"type":"default","dimension":"page","outputName":"d0","outputType":"STRING"},"metric":{"type":"numeric","metric":"a0"},"threshold":10,"intervals":{"type":"intervals","intervals":["2015-09-12T00:00:00.000Z/2015-09-13T00:00:00.001Z"]},"filter":null,"granularity":{"type":"all"},"aggregations":[{"type":"count","name":"a0"}],"postAggregations":[],"context":{},"descending":false}], signature=[{d0:STRING, a0:LONG}]) │
    └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
    Retrieved 1 row in 0.03s.
    

    二、原生JSON查询

    Druid支持基于Json的查询

    {
      "queryType" : "topN",
      "dataSource" : "wikipedia",
      "intervals" : ["2015-09-12/2015-09-13"],
      "granularity" : "all",
      "dimension" : "page",
      "metric" : "count",
      "threshold" : 10,
      "aggregations" : [
        {
          "type" : "count",
          "name" : "count"
        }
      ]
    }
    

    把json粘贴到json 查询模式窗口

    Json查询是通过向router和broker发送请求

    curl -X POST '<queryable_host>:<port>/druid/v2/?pretty' -H 'Content-Type:application/json' -H 'Accept:application/json' -d @<query_json_file>
    

    Druid提供了丰富的查询方式

    Aggregation查询

    Timeseries查询
    {
      "queryType": "timeseries",
      "dataSource": "sample_datasource",
      "granularity": "day",
      "descending": "true",
      "filter": {
        "type": "and",
        "fields": [
          { "type": "selector", "dimension": "sample_dimension1", "value": "sample_value1" },
          { "type": "or",
            "fields": [
              { "type": "selector", "dimension": "sample_dimension2", "value": "sample_value2" },
              { "type": "selector", "dimension": "sample_dimension3", "value": "sample_value3" }
            ]
          }
        ]
      },
      "aggregations": [
        { "type": "longSum", "name": "sample_name1", "fieldName": "sample_fieldName1" },
        { "type": "doubleSum", "name": "sample_name2", "fieldName": "sample_fieldName2" }
      ],
      "postAggregations": [
        { "type": "arithmetic",
          "name": "sample_divide",
          "fn": "/",
          "fields": [
            { "type": "fieldAccess", "name": "postAgg__sample_name1", "fieldName": "sample_name1" },
            { "type": "fieldAccess", "name": "postAgg__sample_name2", "fieldName": "sample_name2" }
          ]
        }
      ],
      "intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ]
    }
    
    TopN查询
    {
      "queryType": "topN",
      "dataSource": "sample_data",
      "dimension": "sample_dim",
      "threshold": 5,
      "metric": "count",
      "granularity": "all",
      "filter": {
        "type": "and",
        "fields": [
          {
            "type": "selector",
            "dimension": "dim1",
            "value": "some_value"
          },
          {
            "type": "selector",
            "dimension": "dim2",
            "value": "some_other_val"
          }
        ]
      },
      "aggregations": [
        {
          "type": "longSum",
          "name": "count",
          "fieldName": "count"
        },
        {
          "type": "doubleSum",
          "name": "some_metric",
          "fieldName": "some_metric"
        }
      ],
      "postAggregations": [
        {
          "type": "arithmetic",
          "name": "average",
          "fn": "/",
          "fields": [
            {
              "type": "fieldAccess",
              "name": "some_metric",
              "fieldName": "some_metric"
            },
            {
              "type": "fieldAccess",
              "name": "count",
              "fieldName": "count"
            }
          ]
        }
      ],
      "intervals": [
        "2013-08-31T00:00:00.000/2013-09-03T00:00:00.000"
      ]
    }
    
    GroupBy查询
    {
      "queryType": "groupBy",
      "dataSource": "sample_datasource",
      "granularity": "day",
      "dimensions": ["country", "device"],
      "limitSpec": { "type": "default", "limit": 5000, "columns": ["country", "data_transfer"] },
      "filter": {
        "type": "and",
        "fields": [
          { "type": "selector", "dimension": "carrier", "value": "AT&T" },
          { "type": "or",
            "fields": [
              { "type": "selector", "dimension": "make", "value": "Apple" },
              { "type": "selector", "dimension": "make", "value": "Samsung" }
            ]
          }
        ]
      },
      "aggregations": [
        { "type": "longSum", "name": "total_usage", "fieldName": "user_count" },
        { "type": "doubleSum", "name": "data_transfer", "fieldName": "data_transfer" }
      ],
      "postAggregations": [
        { "type": "arithmetic",
          "name": "avg_usage",
          "fn": "/",
          "fields": [
            { "type": "fieldAccess", "fieldName": "data_transfer" },
            { "type": "fieldAccess", "fieldName": "total_usage" }
          ]
        }
      ],
      "intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ],
      "having": {
        "type": "greaterThan",
        "aggregation": "total_usage",
        "value": 100
      }
    }
    

    Metadata查询

    TimeBoundary 查询
    {
        "queryType" : "timeBoundary",
        "dataSource": "sample_datasource",
        "bound"     : < "maxTime" | "minTime" > # optional, defaults to returning both timestamps if not set
        "filter"    : { "type": "and", "fields": [<filter>, <filter>, ...] } # optional
    }
    
    SegmentMetadata查询
    {
      "queryType":"segmentMetadata",
      "dataSource":"sample_datasource",
      "intervals":["2013-01-01/2014-01-01"]
    }
    
    DatasourceMetadata查询
    {
        "queryType" : "dataSourceMetadata",
        "dataSource": "sample_datasource"
    }
    

    Search查询

    {
      "queryType": "search",
      "dataSource": "sample_datasource",
      "granularity": "day",
      "searchDimensions": [
        "dim1",
        "dim2"
      ],
      "query": {
        "type": "insensitive_contains",
        "value": "Ke"
      },
      "sort" : {
        "type": "lexicographic"
      },
      "intervals": [
        "2013-01-01T00:00:00.000/2013-01-03T00:00:00.000"
      ]
    }
    

    查询建议

    用Timeseries和TopN替代GroupBy

    取消查询

    DELETE /druid/v2/{queryId}
    
    curl -X DELETE "http://host:port/druid/v2/abc123"
    

    查询失败

    {
      "error" : "Query timeout",
      "errorMessage" : "Timeout waiting for task.",
      "errorClass" : "java.util.concurrent.TimeoutException",
      "host" : "druid1.example.com:8083"
    }
    

    三、CURL

    基于Http的查询

    curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/tutorial/wikipedia-top-pages.json http://localhost:8888/druid/v2?pretty
    

    四、客户端查询

    客户端查询是基于json的

    具体查看 https://druid.apache.org/libraries.html

    比如python查询的pydruid

    from pydruid.client import *
    from pylab import plt
    
    query = PyDruid(druid_url_goes_here, 'druid/v2')
    
    ts = query.timeseries(
        datasource='twitterstream',
        granularity='day',
        intervals='2014-02-02/p4w',
        aggregations={'length': doublesum('tweet_length'), 'count': doublesum('count')},
        post_aggregations={'avg_tweet_length': (Field('length') / Field('count'))},
        filter=Dimension('first_hashtag') == 'sochi2014'
    )
    df = query.export_pandas()
    df['timestamp'] = df['timestamp'].map(lambda x: x.split('T')[0])
    df.plot(x='timestamp', y='avg_tweet_length', ylim=(80, 140), rot=20,
            title='Sochi 2014')
    plt.ylabel('avg tweet length (chars)')
    plt.show()
    

    实时流式计算整理了Druid入门指南
    持续更新中~

    更多实时数据分析相关博文与科技资讯,欢迎关注 “实时流式计算”

    获取《Druid实时大数据分析》电子书,请在公号后台回复 “Druid”

  • 相关阅读:
    up_modembin.sh
    cpu主频信息
    计算机网络中通信协议都有哪些
    可导与连续的关系
    linux块设备驱动之实例
    CentOs 设置静态IP 方法
    phalcon:非空字段不能在beforeCreate赋值,可以改用beforeValidationOnCreate
    phalcon: crypt-encrypt/decrypt用法
    phalcon: 缓存片段,文件缓存,memcache缓存
    phalcon: 视图分层渲染,或包含其他页面
  • 原文地址:https://www.cnblogs.com/tree1123/p/12892928.html
Copyright © 2011-2022 走看看