zoukankan      html  css  js  c++  java
  • My journey introducing the data build tool (dbt) in project’s analytical stacks

    转自:https://www.lantrns.co/my-journey-introducing-the-data-build-tool-dbt-in-projects-analytical-stacks/

    Not sure I remember how, but I had the good luck a few weeks ago to stumble upon posts from Tristan Handy where he mentioned a tool his team built, simply called the Data Build Tool (dbt). I immediately took interest, as I was struggling with ETL (Extract, Transform, Load) at that moment, and had the extreme good fortune of having 2 clients who were willing to let me experiment with it, and even eventually use it for their own data warehouse.

    Starting out with such a tool can sometimes be a bit confusing, as you’re not sure how and where to start. You can envision how dbt will fit in your overall analytical stack, but you might need a little bit of guidance to get things started. Although the documentation is really good, I feel like it’s always a good thing to read about someone else’s experimenting before starting my own.

    So I decided to put together this little article that just goes over my own experience working with dbt, what I’ve picked up along the way, and how it all fits in our overall data engineering journey. I’m no power user of dbt and this may seem superficial to more experienced users, but the goal here is to help new users by sharing my own journey so far.

    Getting started

    As I mentioned, I have the good fortune to be working with clients who are really keen on adopting the technologies that will make their analytical stack more robust and scalable. And that was actually one problem I was facing with one project. What had been deployed so far worked perfectly for the needs that were specified, but as new requirements came in, it was obvious that our solution wasn’t scalable. We were using python scripts to do the ETL stuff, and that was neither agile, nor DRY (Don’t Repeat Yourself), nor fun. So we wanted to improve our ETL approach, and dbt just fell on our lap right there and then.

    Another project is with a startup that just started investing into their analytical infrastructure. This is pretty rare, but here I was with the opportunity to propose and build an analytical stack from scratch. Of course, I wanted dbt to be part of it.

    So this is how I started looking more seriously at what dbt had to offer, how it could be used in a production environment, how we could test it and how it could handle future needs.

    What’s interesting with dbt is how instead of doing ETL, you’re really taking advantage of the power of cloud-based data warehouses to do ETL instead.

    What we are doing is that instead of adopting a traditional ETL approach that requires the maintenance of scripts that does all 3 steps (extract, transform and load data towards the data warehouse), we are moving to a ELT approach where data is first moved (extract + load) to a data warehouse and transformation is done on the data warehouse directly.

    That has the benefit of reducing the complexity of scripts (we are now only dealing with pure SQL scripts), maintenance (all models are simple select statements that are ran through a tool called dbt – data build tool) and integrity (once business logic is added within a model, all further models down the chain will always reuse that same business logic).

    What is dbt and how does it fit

    I should first mention that dbt is the work of the good folks at Fishtown Analytics. I believe that dbt was built while the co-founders were working for another company and that they kinda kept dbt with them as they built their new company (I might be completely wrong on this one). But the point is that they developed that tool, made it publicly available and are building a great community around it. My hat’s off to them – we should all aspire to bring such excellent tools to our data engineering community.

    So, what is dbt?

    I like to think of it as SQL on steroids.

    Essentially, what we do is that we are building SQL models on top of source tables (coming from your production database, cloud services, website events, etc.) which are then referred to by further models, acting as building blocks towards a final model that’s part of a mart within your data warehouse.

    For example, let’s say I want to build a fact table for transactions. But transactions may be coming from different sources. I would build models that refer to one another until you get to that final table. Here’s an example taken from a project I’m working on…

    dbt - an exemple of data modeling

    dbt – an exemple of data modeling

    That graphic above might seem like a nightmare, but there is method to that madness (I think, lol). Essentially, you’re looking at models on the left that refer to tables in staging area, a bunch of transformations in the middle, and the final entity at the far right. We’ll talk more about this below, as well as how to generate that graph.

     

    HAVE YOU SEEN OUR PRODUCT ANALYTICS NEWSLETTER?

    OUR MISSION IS TO SHARE DATA ANALYTICS’ BEST PRACTICES AND NEW IDEAS WITH PRODUCT OWNERS, SO THEY CAN IMPROVE PRODUCT-MARKET FIT AND ACCELERATE THEIR DEVELOPMENT CYCLE.

    HAVE A LOOK HERE.

    Community

    Now that we have a high-level understanding of what dbt is and how it fits within your analytical stack, I think we should talk a little bit about its community, because it does make working with dbt such a positive experience.

    We should mention first that dbt is totally open source and you can go play inside the beast if that’s your thing.

    Once you’ve satisfied your curiosity, you should definitely consider joining the dbt community on Slack. This is a pretty active community composed of members who are really helpful to one another. And the creators of dbt are also always present and ready to help whenever you stumble upon a problem. But what’s pretty cool also is that the discussions will definitely make you a better data engineer, as some of the questions/answers and the resources shared do open up your horizons to many other facets on how to build a modern analytical stack.

    One other resource that’s starting to pick up steam is dbt’s Discourse environment. This is where you can find some pretty thorough answers to commonly asked questions. Or at least that’s what the idea of it is. This is still pretty new, but I think there’s a lot of potential with this and there are already some gems that are living there.

    And last but not least, dbt’s documentation. This of course will be a resource you refer to often as you start working with dbt, but also as you get more sophisticated with the use of the tool. There are solid documentations here, which pretty much answers all the questions you might have in regards to using the different features of this tool.

    So, all in all, I think that this community is as important to the success of dbt as the tool itself. You can always count on the documentation, the already answered questions in Discourse, or the availability of other community members on Slack, to point you in the right direction.

    First Steps

    Alright, now that we completed that grand tour of dbt, it’s about time we get our hands dirty.

    To get things started, I would definitely recommend going through that dbt tutorial provided in the documentation. You’ll learn how to install dbt and create a project.

    There are also useful guides, which are to be consulted as you discover some of those aspects of dbt. But I would definitely recommend going through the “Configuring models” guide, as this is essential skill to work with dbt.

    I would also recommend that you take 20 minutes of your time and listen to the “Project Walkthrough Video”, which actually confirmed to me that dbt was actually as easy as I suspected it to be. Up to that video, I was convinced that I was missing something that would make dbt complex to use, but nope. If you know SQL, you’ll be confortable with dbt right from the start.

    Hopefully, by this point you have a good understanding of dbt, how it fits within your architecture, how to set it up and have created your first project. Now what?

    Well, you would need some data of course. Without going over the details, because this is out of scope for this article, let’s assume you have a few source tables in a Redshift/PostgreSQL database.

    An example for me was to transform an export of Eventbrite registrations into a clean DW structure to be used in Tableau. My staging model would look something like this…

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    SELECT
    CAST('eventbrite' AS VARCHARAS source,
    CAST("Order #"AS integerAS registration_natural_key,
    CAST("Event ID"AS integerAS event_natural_key,
    CAST("Event Name"AS varchar(64)) AS event_name,
    CAST("Attendee #"AS integerAS attendee_natural_key,
    CAST("Email"AS varchar(128)) AS attendee_email,
    CAST("First Name"AS varchar(32)) AS attendee_first_name,
    CAST("Last Name"AS varchar(32)) AS attendee_last_name,
    CAST("Job Title"AS varchar(128)) AS attendee_job_title,
    CAST("Company"AS varchar(32)) AS company_name,
    CAST("Ticket Type"AS varchar(32)) AS tickets_type,
    CAST("Quantity"AS smallintAS tickets_quantity,
    CAST("Total Paid"AS decimal(6, 2)) AS transaction_amount,
    CAST("Currency"AS char(3)) AS currency,
    CAST("Order Date"AS timestampAS registration_datetime</pre>
     
    FROM
    eventbrite.registrations

    I could then derive entities such as dim_event…

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    WITH registration AS (
     
    SELECT FROM {{ ref('stg_eventbrite_registrations') }}
     
    )
    SELECT DISTINCT
    {{ dbt_utils.surrogate_key('registration.source''registration.event_natural_key''registration.event_name') }} AS surrogate_key,
    registration.source,
    registration.event_natural_key AS natural_key,
    registration.event_name AS name
     
    FROM registration

    dim_attendee…

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    WITH registration AS (
     
    SELECT FROM {{ ref('stg_eventbrite_registrations') }}
     
    )
     
    SELECT DISTINCT
    {{ dbt_utils.surrogate_key('registration.source''registration.attendee_natural_key''registration.attendee_email') }} AS surrogate_key,
    registration.source,
    registration.attendee_natural_key AS natural_key,
    registration.attendee_email AS email,
    registration.attendee_first_name AS first_name,
    registration.attendee_last_name AS last_name,
    registration.attendee_job_title AS job_title,
    registration.company_name
     
    FROM registration

    and fct_registration…

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    WITH registration AS (
     
    SELECT FROM {{ ref('stg_eventbrite_registrations') }}
     
    )
     
    SELECT
    {{ dbt_utils.surrogate_key('registration.source''registration.registration_natural_key''registration.attendee_natural_key''registration.event_natural_key') }} AS surrogate_key,
    registration.source,
    registration.registration_natural_key AS natural_key,
    registration.attendee_natural_key,
    registration.event_natural_key,
    registration.tickets_quantity,
    registration.tickets_type,
    registration.transaction_amount,
    registration.currency,
    registration.registration_datetime
     
    FROM registration

    SinterData provides a tool to view the relationships between our models, which when our data warehouse becomes complex, is a great tool to have at our disposal. For our current example, we end up with the following DAG (directed acyclic graph)

    dbt-an-exemple-of-data-modeling2

    dbt – another exemple of data modeling

    And that’s it, we just created a first version of our data warehouse.

    Good practices

    The example above is pretty basic, but as things become more complex, you’ll need to start picking up some best practices. Some of them, you can introduce them right away, some of them you’ll learn of as you go, and some of them will impose themselves upon you.

    Here’s a non-exhaustive list of “good practices” I picked up during my first few projects.

    Folder Structure

    How you structure your data warehouse and how you structure your dbt/models folder are 2 entirely different things. There was some good discussions around folder organization in dbt, but I think in the end it all depends on how you would like to work.

    Here’s what works for me. I have 4 main folders:

    • Staging – This is where I create my foundational models on top of the source tables that were exported to the data warehouse. I do a selection of fields, typecast, filter and rename the fields if necessary. You know, basic maintenance.
    • Transform – This is where the bulk of the work happens. It’s where staging models are being used and transformed to become the final entities. You would join sources, add aggregates, filter, etc.
    • Entities – This is where I have my final fact and dimension models.
    • FinalModels (I usually have a more business-focused name, such as ‘Marketplace’) – This is where I create the main queries that are being used by BI tools. There are discussions in regards to building your final model within dbt or directly in your BI tool, but I personally like to have the “big ones” here in dbt with all other queries.

    In regards to the Transform table, even here I have a specific folder structure underneath. Here’s how it looks:

    • EntityName
      • Source1
        • EntityName_Source1_1_FirstTransformation.sql
        • EntityName_Source1_2_SecondTransformation.sql
      • Source2
      • EntityName_1_MergeSources.sql

    As you can see, I like to keep it structured by source and sequence. This really helps whenever I have to introduce new features or when debugging.

    Testing Suite

    Here’s something that I didn’t use first, but that was always in the back of my mind.

    I actually was looking for a solution outside of dbt, because I had a specific test scenario that I wanted to do and didn’t know much about the testing framework in dbt to get it working with dbt itself. So I had asked my question on Slack directly and Tristan pointed me out to custom schema tests.

    And that kinda propelled me in the world of testing within the confines of dbt. The Data Build Tool already has a few tests that can be used right off the box, but there’s also the dbt-utils package which provides other tests that can be used in your own suite of testing.

    Right now, for one project, I have over 60 tests that I run in my test environment after doing changes and that I run automatically every hour in the prod environment. Whenever a test fails in prod, an alert is being sent to our Slack channel.

    Amazing! 

    Data Quality Exploration

    Once I’ve made changes in my dev environment, built my models and tested them, I like to use Tableau to actually visualize my Data Warehouse before it gets built in prod. Some sort of Quality Assessment.

    I’m not doing anything fancy at this point, but I do like to plot my main entities’ data sets to make sure that no weird anomalies were introduced during development.

    The testing suite that was discussed previously safeguards my data at the atomic level, but my Tableau QA workbook safeguards data quality at the aggregated level.

    Work Flow

    That brings me to my overall work flow. Here are the steps I take when introducing a new feature within a data warehouse:

    1. Research/Exploration into required fields and their sources
    2. Write up tests as acceptance criterias
    3. Build data models in dbt
    4. Testing in development environment
    5. Quality assessment with Tableau
    6. Commit code to git repository
    7. Pull new code in prod
    8. Build new ETL in prod
    9. Testing in prod environment
    10. Integrate new feature in final BI report
    11. Push to Sandbox for client approval

    Extras

    I did briefly touch on those previously, but there are a few extras worth knowing about:

    • dbt-utils – This package is provided by the creators of dbt and packs in a few extras that are worth looking at. I personally use only a few of what’s provided (such as the ‘surrogate_key’ SQL helper, as seen in the models above), but there is a lot to explore here and that might also be useful.
    • Sinter – Not sure how that service is associated to dbt and Fishtown Analytics, but their service is all about dbt. Essentially, they provide the means to run, schedule and manage your dbt jobs. Simple, yet really well made.
    • dbt Graph Viewer – Actually provided by the folks at Sinter, this little tool is something I use frequently. Not only can you see the whole DAG of your project, but you can also focus on only one model and look at their parent and/or children models.

    I should also quickly talk about macros, which are potentially a big thing, but that I haven’t gotten around to. Although I did just read that article by Claire Carole on design patterns which talks about maintaining your code DRY, using Jinja’s templating language and the introduction of macros. Anyways, still a whole lot of stuff to discover

    The Big Picture

    As you can see, dbt is more than just a simple tool. I like to think of it as a core tool within your data warehouse building arsenal.

    You can think of it as following the principles of the Unix Tools Philosophy, where the goal is to have an ensemble of tools that do one job only, but do it really well.

    But something else that is important in that philosophy. It is for those tools to easily interact together (by the use of the pipe “|” symbol which pipes output from one tool as input to another tool). That dbt keeps really close to SQL is what makes it so easy to adopt, but also so easy to interact with many other tools. It’s a common language that allows for outputs to become inputs for another tool.

    But also, the choice of SQL is one that assures interoperability within a modular BI architecture. As is argued by Ajay Kulkarni, co-founder of TimeScaleDB:

    “We believe that SQL has become the universal interface for data analysis.”
    (Source: Why SQL is beating NoSQL, and what this means for the future of data)

    There is a lot that’s been said and written about the modern BI architecture, with cloud-based services as their underlying infrastructure and modularity as a core principle. Having tools such as dbt adds to that philosophy by being itself modular (and replacable), but to actually be very flexible so that it can work with many different inputs (data sources) and outputs (data warehouses).

    Here’s to your dbt journey, to the future development of dbt and the community that is being fostered around that project!

  • 相关阅读:
    程序员创业第二步:五个角度打造企业世界级竞争力
    开源题材征集 + MVC&EF Core 完整教程小结
    MVC+EF Core 完整教程20--tag helper详解
    MVC5+EF6 入门完整教程13 -- 动态生成多级菜单
    加载驱动三种的方法
    Caused by: javax.el.PropertyNotFoundException: Property [userName] not found on type [java.lang.String]
    Eclipse 中的 insert spaces for tabs 设置方法
    Windows中mysql的配置文件,解决字符集编码问题,统一使用utf8字符集
    ERROR 1055 (42000): Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated ......问题报错解决办法!
    Java中的类三种类加载器+双气委派模型
  • 原文地址:https://www.cnblogs.com/rongfengliang/p/10981250.html
Copyright © 2011-2022 走看看