Guide to Data Warehousing

Short and comprehensive information about different data modeling techniques

Apr 18, 2020

In this guide, I’ll try to cover several methodologies, explain their differences and when and why (in my point of view) one is better than the other, and maybe introduce some tools you can use when modeling DWH (Data Warehouse) or EDW (Enterprise Data Warehouse).

My story

I started my BI (Business Intelligence) career in a small company that consulted with other companies on how to improve their processes or just helped them build a BI system so that they can make the decisions themselves.

How I imagined working during that time (almost straight out of the university): I would come to work, get equipment, someone would explain what I have to do, I would start doing it. I was so surprised when I came to work on the first day: I’ve got the book “The Data Warehouse Toolkit” by Ralph Kimball, and my manager told me to read and learn. So, I started reading and trying to understand it. It took me some time. I don’t remember how long it took, but maybe a week or so to get a proper understanding (at least in my view).

Glossary

Terms which need an explanation and you might find in this article:

Staging area — a copy of raw/source data on the same database/machine we have our DWH

Transformation area — transformed staging data. Prepared to be loaded to DWH

Fact table — transactional or event-based data with some measures. i.e., sale information, warehouse products movement.

Dimension table — all information for a particular thing in one place. i.e., all product-related information, customer information.

Methodologies covered

Kimball Methodology

I am starting with a technique that I learned first mostly because it’s easy to comprehend.

It was created by Ralph Kimball and his colleagues (hence the name). This approach is considered to be a bottom-up design approach. For a more general audience, maybe it’s more familiar by dimensional modeling name. Personally, what I liked about this modeling — easy to design and analyze. It was designed to answer specific questions or help understand particular areas (i.e., HR, Sales). This approach allows for fast development, but we lose some flexibility. Usually, we need to rebuild the DWH (or some of its parts) to apply the changes. Most of BI reporting tools can understand this model, and you can quickly drag and drop reports (i.e., MS SQL Server Analysis Services, Tableau)

Kimball flow:

Load data to staging from source systems
Transform the data
Load do Star/Snowflake Schema.

I’ll cover Star and Snowflake schemas in more detail in sections below.

Star Schema

In Star Schema, we have a fact table and dimension tables (with all of the foreign keys in the fact table). You can find more details in this Wikipedia article. In a nutshell, it looks like this:

SqlPac at English Wikipedia CC BY-SA 3.0

Pros:

Faster if we want to do filtering on some dimensions (no need for multiple joins)
Simple modeling (no normalization needed)

Cons:

We need more space to store all foreign dimension keys in the fact table.
Bigger hustle to add additional dimension key to fact table (if it’s enormous, updates will take more time)

Snowflake Schema

Snowflake is a Star schema with more layers. I.e., we have an address in our shop dimension. We can create an address dimension with address_PK, which would point to the dim_shop. You can read upon snowflake schema in this Wikipedia article. Simplified view of it:

Pros:

Less hustle to add additional layers or groupings on existing dimensions
Less storage

Cons:

Harder to filter values (more joins might be needed)

I’m not saying that one modeling is better than the other; it all depends on the use cases, available resources, and the end goal. Weigh all of the options and think if you’ll be adding more groupings to the dimensions, if you want to add more atomic layers later (updating fact table with foreign keys to dimension tables).

Inmon Methodology

Practically I haven’t used it, so this will be a more theoretical overview.

This methodology was created by Bill Inmon and is considered to be a top-down approach. We have to have a whole picture and model it accordingly to 3NF (Normal form), making this approach more complicated than Kimballs. The downside is that you need skillful people to design this data model and to integrate all of the subject areas into it. Compared to Kimballs, it takes more time to get it running, but it’s way easier to maintain, and it’s more of an enterprise-grade approach.

Inmon Flow:

Load data from source to stage
Add data to 3NF compliant EDW (Enterprise Data Warehouse)
Build Data marts on top of the EDW

To compare these to it all comes down to the company/business area data, we want to model. From my personal experience, the first project I had was from a retail client. We did a model on Kimball with a Star schema, because we knew requirements and their data questions. If it would want some more integrations to their DWH, like join the employees who were working those days, inventory management, it would be more appropriate with the Inmon approach.

In my opinion, if a company is small and they want to track and improve only particular elements — usually it’s easier and faster to go with Kimballs’ approach.

Data Vault

Moving from small company to a big corporation had it’s toll on me as well. I got a better understanding that sometimes we need a better and more simplified EDW. That’s when I started to work with Data Vault. As far as I see it — it’s a combination of Kimballs Star Schema and Inmons methodology. Best of both worlds. If you want a more detailed view check out Dan Linsteds site, he’s the creator of this approach.

I’m going to cover several most essential components of Data Vault.

Hubs

Hub is a collection of all of the distinct entities, i.e., for account hub, we’d have an account, account_ID, load_date, src_name. So we can track from where the record originally came from when it was loaded and if we need a surrogate key generated from the business key.

Links

Not to sound funny, but the link is a link between different hubs. I.e., we have employees, and they belong to teams. Teams and employees have different hubs, so we can have team_employee_link, which would have team_employee_link, team_id, employee_id, load_date, src_name.

Satellites

Slowly changing dimension of particular entity attributes. I.e., we have a product as an entity. We have multiple product information columns, name, price. So we load this information as a slowly changing dimension with information product_id, open_date, close_date, is_deleted, product_name, product_price. Capturing all the changes lets us re-create snapshots of data and see how it evolved.

As well as these most basic entities, we have Satellite Links, Transactional Links. I won’t go into details for these; if you want to know more about it — check Dan Linstedts site or Wikipedia.

Data Lake

Coming from a simple data background, I came across a Data lake term. Which stores all the information (structured and non-structured) we have. Nowadays, terms like Big data, Data lake are getting tons of attention. I didn’t understand why we needed to store that much data of different types until I started working with vast amounts of it. Data is today’s and future’s gold. In my experience for all data-driven companies, data lakes are almost a must. Store as much as you can then do analysis, look for insights.

Storing data from multiple sources in raw formats comes with its own cost.

If you won’t keep tabs and manage your data lake properly — it might become a data swamp.

From my point of view, it’s an additional layer before creating your EDW. You have data engineers working on bringing raw data to the data lake and building an EDW on top of it. Analysts can work and rely on the pre-processed and cleansed data.

Lakehouse

Databricks company introduced this term at the end of January 2020. The idea here that we do everything straight on the source data. Most of the ML/AI tools are more designed for unstructured data (texts, images, videos, sound). Processing them to DWH or Data Lake will take some time and, for sure, will not be near real-time. I’m a bit skeptical about this approach. It’s like creating a vast data swamp and letting people drown in it. Too much data that is not managed, cleansed might lead to false assumptions and will not be one source of truth in a big company. At least now, while streaming is not that big of a thing, I think it’s worth waiting for this methodology to mature more. Unless you’re some tech startup that wants to be way ahead of your competition and use bleeding-edge technology, but it’s just my assumptions.

Summary

In my opinion, all of these methodologies will co-exist for a long time. It all depends on the company and its use case!

If it’s just a small or medium business, and if the old school Data Warehousing approach satisfies your needs, why use something which might not bring more profit for you?

If it’s a large enterprise — the data lake probably is a must if you want to stay competitive and provide excellent service to your customers. In my opinion, you still will have to create a pre-processed layer, which would be some sort of one source of truth for reports, with more aggregated/cleansed data. The best fit here (in my opinion) is Star or Snowflake schemas. It will enable us to look for general patterns and trends more quickly. In case we need to do a deep dive, and DWH/Datamart is too aggregated — we could always go to the data lake and check raw data.

Or maybe your employees are tech-savvy, and you’re a next-gen tech startup, and you want to beat your competition by giving insights while using bleeding-edge technologies — perhaps you need a lakehouse? Unfortunately, I haven’t seen real use cases, can’t think of how it can work with the old school methods for cleaner and smoother analysis.

Reference List

[1] https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/data-warehouse-dw-toolkit/

[2] https://en.wikipedia.org/wiki/Star_schema

[3] https://en.wikipedia.org/wiki/Snowflake_schema

[4] https://danlinstedt.com/solutions-2/data-vault-basics/

[5] https://en.wikipedia.org/wiki/Data_vault_modeling

[6] https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html

Uncle Data

Discussion about this post