Guide to Data Warehousing
Short and comprehensive information about different data modeling techniques

In this guide, Iโll try to cover several methodologies, explain their differences and when and why (in my point of view) one is better than the other, and maybe introduce some tools you can use when modeling DWH (Data Warehouse) or EDW (Enterprise Data Warehouse).
My story
I started my BI (Business Intelligence) career in a small company that consulted with other companies on how to improve their processes or just helped them build a BI system so that they can make the decisions themselves.
How I imagined working during that time (almost straight out of the university): I would come to work, get equipment, someone would explain what I have to do, I would start doing it. I was so surprised when I came to work on the first day: Iโve got the book โThe Data Warehouse Toolkitโ by Ralph Kimball, and my manager told me to read and learn. So, I started reading and trying to understand it. It took me some time. I donโt remember how long it took, but maybe a week or so to get a proper understanding (at least in my view).
Glossary
Terms which need an explanation and you might find in this article:
Staging area โ a copy of raw/source data on the same database/machine we have our DWH
Transformation area โ transformed staging data. Prepared to be loaded to DWH
Fact table โ transactional or event-based data with some measures. i.e., sale information, warehouse products movement.
Dimension table โ all information for a particular thing in one place. i.e., all product-related information, customer information.
Methodologies covered
Kimball Methodology
I am starting with a technique that I learned first mostly because itโs easy to comprehend.
It was created by Ralph Kimball and his colleagues (hence the name). This approach is considered to be a bottom-up design approach. For a more general audience, maybe itโs more familiar by dimensional modeling name. Personally, what I liked about this modeling โ easy to design and analyze. It was designed to answer specific questions or help understand particular areas (i.e., HR, Sales). This approach allows for fast development, but we lose some flexibility. Usually, we need to rebuild the DWH (or some of its parts) to apply the changes. Most of BI reporting tools can understand this model, and you can quickly drag and drop reports (i.e., MS SQL Server Analysis Services, Tableau)
Kimball flow:
Load data to staging from source systems
Transform the data
Load do Star/Snowflake Schema.
Iโll cover Star and Snowflake schemas in more detail in sections below.
Star Schema
In Star Schema, we have a fact table and dimension tables (with all of the foreign keys in the fact table). You can find more details in this Wikipedia article. In a nutshell, it looks like this:
Pros:
Faster if we want to do filtering on some dimensions (no need for multiple joins)
Simple modeling (no normalization needed)
Cons:
We need more space to store all foreign dimension keys in the fact table.
Bigger hustle to add additional dimension key to fact table (if itโs enormous, updates will take more time)
Snowflake Schema
Snowflake is a Star schema with more layers. I.e., we have an address in our shop dimension. We can create an address dimension with address_PK, which would point to the dim_shop. You can read upon snowflake schema in this Wikipedia article. Simplified view of it:
Pros:
Less hustle to add additional layers or groupings on existing dimensions
Less storage
Cons:
Harder to filter values (more joins might be needed)
Iโm not saying that one modeling is better than the other; it all depends on the use cases, available resources, and the end goal. Weigh all of the options and think if youโll be adding more groupings to the dimensions, if you want to add more atomic layers later (updating fact table with foreign keys to dimension tables).
Inmon Methodology
Practically I havenโt used it, so this will be a more theoretical overview.
This methodology was created by Bill Inmon and is considered to be a top-down approach. We have to have a whole picture and model it accordingly to 3NF (Normal form), making this approach more complicated than Kimballs. The downside is that you need skillful people to design this data model and to integrate all of the subject areas into it. Compared to Kimballs, it takes more time to get it running, but itโs way easier to maintain, and itโs more of an enterprise-grade approach.
Inmon Flow:
Load data from source to stage
Add data to 3NF compliant EDW (Enterprise Data Warehouse)
Build Data marts on top of the EDW
To compare these to it all comes down to the company/business area data, we want to model. From my personal experience, the first project I had was from a retail client. We did a model on Kimball with a Star schema, because we knew requirements and their data questions. If it would want some more integrations to their DWH, like join the employees who were working those days, inventory management, it would be more appropriate with the Inmon approach.
In my opinion, if a company is small and they want to track and improve only particular elements โ usually itโs easier and faster to go with Kimballsโ approach.
Data Vault
Moving from small company to a big corporation had itโs toll on me as well. I got a better understanding that sometimes we need a better and more simplified EDW. Thatโs when I started to work with Data Vault. As far as I see it โ itโs a combination of Kimballs Star Schema and Inmons methodology. Best of both worlds. If you want a more detailed view check out Dan Linsteds site, heโs the creator of this approach.
Iโm going to cover several most essential components of Data Vault.
Hubs
Hub is a collection of all of the distinct entities, i.e., for account hub, weโd have an account, account_ID, load_date, src_name. So we can track from where the record originally came from when it was loaded and if we need a surrogate key generated from the business key.
Links
Not to sound funny, but the link is a link between different hubs. I.e., we have employees, and they belong to teams. Teams and employees have different hubs, so we can have team_employee_link, which would have team_employee_link, team_id, employee_id, load_date, src_name.
Satellites
Slowly changing dimension of particular entity attributes. I.e., we have a product as an entity. We have multiple product information columns, name, price. So we load this information as a slowly changing dimension with information product_id, open_date, close_date, is_deleted, product_name, product_price. Capturing all the changes lets us re-create snapshots of data and see how it evolved.
As well as these most basic entities, we have Satellite Links, Transactional Links. I wonโt go into details for these; if you want to know more about it โ check Dan Linstedts site or Wikipedia.
Data Lake
Coming from a simple data background, I came across a Data lake term. Which stores all the information (structured and non-structured) we have. Nowadays, terms like Big data, Data lake are getting tons of attention. I didnโt understand why we needed to store that much data of different types until I started working with vast amounts of it. Data is todayโs and futureโs gold. In my experience for all data-driven companies, data lakes are almost a must. Store as much as you can then do analysis, look for insights.
Storing data from multiple sources in raw formats comes with its own cost.
If you wonโt keep tabs and manage your data lake properly โ it might become a data swamp.
From my point of view, itโs an additional layer before creating your EDW. You have data engineers working on bringing raw data to the data lake and building an EDW on top of it. Analysts can work and rely on the pre-processed and cleansed data.
Lakehouse
Databricks company introduced this term at the end of January 2020. The idea here that we do everything straight on the source data. Most of the ML/AI tools are more designed for unstructured data (texts, images, videos, sound). Processing them to DWH or Data Lake will take some time and, for sure, will not be near real-time. Iโm a bit skeptical about this approach. Itโs like creating a vast data swamp and letting people drown in it. Too much data that is not managed, cleansed might lead to false assumptions and will not be one source of truth in a big company. At least now, while streaming is not that big of a thing, I think itโs worth waiting for this methodology to mature more. Unless youโre some tech startup that wants to be way ahead of your competition and use bleeding-edge technology, but itโs just my assumptions.
Summary
In my opinion, all of these methodologies will co-exist for a long time. It all depends on the company and its use case!
If itโs just a small or medium business, and if the old school Data Warehousing approach satisfies your needs, why use something which might not bring more profit for you?
If itโs a large enterprise โ the data lake probably is a must if you want to stay competitive and provide excellent service to your customers. In my opinion, you still will have to create a pre-processed layer, which would be some sort of one source of truth for reports, with more aggregated/cleansed data. The best fit here (in my opinion) is Star or Snowflake schemas. It will enable us to look for general patterns and trends more quickly. In case we need to do a deep dive, and DWH/Datamart is too aggregated โ we could always go to the data lake and check raw data.
Or maybe your employees are tech-savvy, and youโre a next-gen tech startup, and you want to beat your competition by giving insights while using bleeding-edge technologies โ perhaps you need a lakehouse? Unfortunately, I havenโt seen real use cases, canโt think of how it can work with the old school methods for cleaner and smoother analysis.
Reference List
[2] https://en.wikipedia.org/wiki/Star_schema
[3] https://en.wikipedia.org/wiki/Snowflake_schema
[4] https://danlinstedt.com/solutions-2/data-vault-basics/
[5] https://en.wikipedia.org/wiki/Data_vault_modeling
[6] https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html