zoukankan      html  css  js  c++  java
  • Amazon Redshift and Massively Parellel Processing

    Today, Yelp held a tech talk in Columbia University about the data warehouse adopted by Yelp.

    Yelp used Amazon Redshift as data warehouse.

    There are several features for Redshift:

    1. Massively Parellel Processing

    2. SQL access

    3. Column-based Datastore

    Benefits are:

    1. Data is structured, accessible and well documented.
    2. Architecture allows for easy extensibility and sharing across teams.
    3. Allows use of entire SQL-compatible tool ecosystem.

    Details:

    Massively Parellel Processing (MMP)

    Traditional BigData always uses Hadoop + MapReduce. MapReduce's native control mechanism is Java code (to implement the Map and Reduce logic), whereas MPP products are queried with SQL(Structural Query Language). You can refer detail here.

    Below is the structure for implementing MMP.

    Similarly, Data is distributed across each segment database to achieve data and processing parallelism. This is achieved by creating a database table with DISTRIBUTED BY clause. By using this clause data is automatically distributed across segment databases. (referrence: Introduction to MMP)

    Typical query sentence in MMP

    Column-based Datastore

    Enables sparse table definitions
    Enables compact storage
    Improve scanning/filtering

    (Benefits: wiki)

    Column-based Datastore

    1. Column-oriented organizations are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data.
    2. Column-oriented organizations are more efficient when new values of a column are supplied for all rows at once, because that column data can be written efficiently and replace old column data without touching any other columns for the rows.
    3. Row-oriented organizations are more efficient when many columns of a single row are required at the same time, and when row-size is relatively small, as the entire row can be retrieved with a single disk seek.
    4. Row-oriented organizations are more efficient when writing a new row if all of the row data is supplied at the same time, as the entire row can be written with a single disk seek.

    In practice, row-oriented storage layouts are well-suited for OLTP-like workloads which are more heavily loaded with interactive transactions. Column-oriented storage layouts are well-suited for OLAP-like workloads (e.g., data warehouses) which typically involve a smaller number of highly complex queries over all data (possibly terabytes).

  • 相关阅读:
    Java母牛繁殖问题
    【转】区块链交易的并发执行
    Cinder 架构分析、高可用部署与核心功能解析
    【转】区块链的隐私保护方案介绍
    如何在ES6中判断类中是否包含某个属性和方法
    阿里云MySQL及Redis灵异断连现象:安全组静默丢包解决方法
    如何区别ES5和ES6创建类(异同点)
    这两家独角兽企业在强敌缠斗中崛起
    来看一看那些已经倒闭的互联网公司
    《并行计算的编程模型》一3.6.1 全局同步屏障
  • 原文地址:https://www.cnblogs.com/ireneyanglan/p/4856666.html
Copyright © 2011-2022 走看看