Data Modelling. It just looks easy—practical examples
Modelling the same dataset in different styles (3NF, Kimball's, Data Vault)
While working on a post for more extensive Dagster usage, I sought a more interesting data set to play around with (I got tired of Yellow Taxi 😅).
I used to love playing board games, but now I have limited time and wait for my kids to grow so we can play more complicated ones. It was an excellent area to look around for datasets. If you’re into board games, you’re most likely familiar with BGG (Board Game Geek). Amazingly, they have an API for their games, and the response data I saw got me into this medium post.
You can read my introductory post I wrote about data warehousing a long time ago:
Guide to Data Warehousing
In this guide, I’ll try to cover several methodologies, explain their differences and when and why (in my point of view) one is better than the other, and maybe introduce some tools you can use when modeling DWH (Data Warehouse) or EDW (Enterprise Data Warehouse).
Let’s look at the API response. By the way, I’m using the second one: https://www.boardgamegeek.com/xmlapi2/thing?type={}&id={}
First of all, it’s a freaking XML.
<items termsofuse="https://boardgamegeek.com/xmlapi/termsofuse">
<item type="boardgame" id="2356">
<thumbnail>https://cf.geekdo-images.com/XvsI7muV5ZJsNrKa_PhiFw__thumb/img/UCrkDmKbxbJaVB4V9YCvV3lkvTg=/fit-in/200x150/filters:strip_icc()/pic275151.jpg</thumbnail>
<image>https://cf.geekdo-images.com/XvsI7muV5ZJsNrKa_PhiFw__original/img/XsGMhGkcpPranJnvJ4VrbmsZXCs=/0x0/filters:format(jpeg)/pic275151.jpg</image>
<name type="primary" sortindex="1" value="Scrabble Sentence Cube Game"/>
<name type="alternate" sortindex="1" value="Kampen om tiden"/>
<name type="alternate" sortindex="1" value="Super bogstavrebus"/>
<name type="alternate" sortindex="1" value="Super bokstavsrebus"/>
<description>Players in turn roll the 21 word cubes, set the timer and form sentences with the words appearing on the tops of the cubes in a crossword puzzle-like fashion (one word may be part of two sentences - one running vertically and one running horizontally). Turns end when the egg timer runs out. Players score 50 points for each sentence of 7 words or more. All other sentences score the square of the number of words in the sentence (i.e. a 4-word sentence scores 4 X 4, or 16 points). A bonus of 50 points is awarded if all 21 cubes are used. Two points are deducted for each unused cube. The 1983 version of the game seems to differ in the number of cubes. The rules on the inside of the box lid still have a 1971 copyright date, but among other differences, they state that the game contains 27 cubes. Two of the new cubes are marked with stars on all six sides, making them wild. From the 1983 rules: "When playing a wild cube, a player must state what word it represents, after which it cannot be changed during that turn." The 1988 version has 24 cubes - two of which are the starred wild dice. The box states that this is the 2nd edition of the game from Coleco Games. Selchow & Righter is credited in the directions as a subsidiary of Coleco Industries, Inc. The majority of the outer box is blue. </description>
<yearpublished value="1971"/>
<minplayers value="0"/>
<maxplayers value="0"/>
<poll name="suggested_numplayers" title="User Suggested Number of Players" totalvotes="2">
<results numplayers="0+">
<result value="Best" numvotes="0"/>
<result value="Recommended" numvotes="0"/>
<result value="Not Recommended" numvotes="0"/>
</results>
</poll>
<playingtime value="10"/>
<minplaytime value="10"/>
<maxplaytime value="10"/>
<minage value="10"/>
<poll name="suggested_playerage" title="User Suggested Player Age" totalvotes="1">
<results>
<result value="2" numvotes="0"/>
<result value="3" numvotes="0"/>
<result value="4" numvotes="0"/>
<result value="5" numvotes="0"/>
<result value="6" numvotes="0"/>
<result value="8" numvotes="1"/>
<result value="10" numvotes="0"/>
<result value="12" numvotes="0"/>
<result value="14" numvotes="0"/>
<result value="16" numvotes="0"/>
<result value="18" numvotes="0"/>
<result value="21 and up" numvotes="0"/>
</results>
</poll>
<poll name="language_dependence" title="Language Dependence" totalvotes="1">
<results>
<result level="1" value="No necessary in-game text" numvotes="0"/>
<result level="2" value="Some necessary text - easily memorized or small crib sheet" numvotes="0"/>
<result level="3" value="Moderate in-game text - needs crib sheet or paste ups" numvotes="0"/>
<result level="4" value="Extensive use of text - massive conversion needed to be playable" numvotes="1"/>
<result level="5" value="Unplayable in another language" numvotes="0"/>
</results>
</poll>
<link type="boardgamecategory" id="1017" value="Dice"/>
<link type="boardgamecategory" id="1025" value="Word Game"/>
<link type="boardgamemechanic" id="2072" value="Dice Rolling"/>
<link type="boardgamefamily" id="75031" value="Components: Sand Timers"/>
<link type="boardgamefamily" id="4711" value="Game: Scrabble"/>
<link type="boardgamefamily" id="62039" value="Word Games: Spelling / Letters"/>
<link type="boardgamedesigner" id="3" value="(Uncredited)"/>
<link type="boardgamepublisher" id="496" value="danspil"/>
<link type="boardgamepublisher" id="359" value="Selchow & Righter"/>
<link type="boardgamepublisher" id="5047" value="Top-Toy"/>
</item>
</items>
And I can already see some exciting lists and combinations. Since I'm not too fond of XML, let’s put it to JSON and view it. It’s a better form, anyway.
{
"items":{
"@termsofuse":"https://boardgamegeek.com/xmlapi/termsofuse",
"item":{
"@type":"boardgame",
"@id":"2356",
"thumbnail":"https://cf.geekdo-images.com/XvsI7muV5ZJsNrKa_PhiFw__thumb/img/UCrkDmKbxbJaVB4V9YCvV3lkvTg=/fit-in/200x150/filters:strip_icc()/pic275151.jpg",
"image":"https://cf.geekdo-images.com/XvsI7muV5ZJsNrKa_PhiFw__original/img/XsGMhGkcpPranJnvJ4VrbmsZXCs=/0x0/filters:format(jpeg)/pic275151.jpg",
"name":[
{
"@type":"primary",
"@sortindex":"1",
"@value":"Scrabble Sentence Cube Game"
},
{
"@type":"alternate",
"@sortindex":"1",
"@value":"Kampen om tiden"
},
{
"@type":"alternate",
"@sortindex":"1",
"@value":"Super bogstavrebus"
},
{
"@type":"alternate",
"@sortindex":"1",
"@value":"Super bokstavsrebus"
}
],
"description":"Players in turn roll the 21 word cubes, set the timer and form sentences with the words appearing on the tops of the cubes in a crossword puzzle-like fashion (one word may be part of two sentences - one running vertically and one running horizontally). Turns end when the egg timer runs out. Players score 50 points for each sentence of 7 words or more. All other sentences score the square of the number of words in the sentence (i.e. a 4-word sentence scores 4 X 4, or 16 points). A bonus of 50 points is awarded if all 21 cubes are used. Two points are deducted for each unused cube. The 1983 version of the game seems to differ in the number of cubes. The rules on the inside of the box lid still have a 1971 copyright date, but among other differences, they state that the game contains 27 cubes. Two of the new cubes are marked with stars on all six sides, making them wild. From the 1983 rules: "When playing a wild cube, a player must state what word it represents, after which it cannot be changed during that turn." The 1988 version has 24 cubes - two of which are the starred wild dice. The box states that this is the 2nd edition of the game from Coleco Games. Selchow & Righter is credited in the directions as a subsidiary of Coleco Industries, Inc. The majority of the outer box is blue. ",
"yearpublished":{
"@value":"1971"
},
"minplayers":{
"@value":"0"
},
"maxplayers":{
"@value":"0"
},
"poll":[
{
"@name":"suggested_numplayers",
"@title":"User Suggested Number of Players",
"@totalvotes":"2",
"results":{
"@numplayers":"0+",
"result":[
{
"@value":"Best",
"@numvotes":"0"
},
{
"@value":"Recommended",
"@numvotes":"0"
},
{
"@value":"Not Recommended",
"@numvotes":"0"
}
]
}
},
{
"@name":"suggested_playerage",
"@title":"User Suggested Player Age",
"@totalvotes":"1",
"results":{
"result":[
{
"@value":"2",
"@numvotes":"0"
},
{
"@value":"3",
"@numvotes":"0"
},
{
"@value":"4",
"@numvotes":"0"
},
{
"@value":"5",
"@numvotes":"0"
},
{
"@value":"6",
"@numvotes":"0"
},
{
"@value":"8",
"@numvotes":"1"
},
{
"@value":"10",
"@numvotes":"0"
},
{
"@value":"12",
"@numvotes":"0"
},
{
"@value":"14",
"@numvotes":"0"
},
{
"@value":"16",
"@numvotes":"0"
},
{
"@value":"18",
"@numvotes":"0"
},
{
"@value":"21 and up",
"@numvotes":"0"
}
]
}
},
{
"@name":"language_dependence",
"@title":"Language Dependence",
"@totalvotes":"1",
"results":{
"result":[
{
"@level":"1",
"@value":"No necessary in-game text",
"@numvotes":"0"
},
{
"@level":"2",
"@value":"Some necessary text - easily memorized or small crib sheet",
"@numvotes":"0"
},
{
"@level":"3",
"@value":"Moderate in-game text - needs crib sheet or paste ups",
"@numvotes":"0"
},
{
"@level":"4",
"@value":"Extensive use of text - massive conversion needed to be playable",
"@numvotes":"1"
},
{
"@level":"5",
"@value":"Unplayable in another language",
"@numvotes":"0"
}
]
}
}
],
"playingtime":{
"@value":"10"
},
"minplaytime":{
"@value":"10"
},
"maxplaytime":{
"@value":"10"
},
"minage":{
"@value":"10"
},
"link":[
{
"@type":"boardgamecategory",
"@id":"1017",
"@value":"Dice"
},
{
"@type":"boardgamecategory",
"@id":"1025",
"@value":"Word Game"
},
{
"@type":"boardgamemechanic",
"@id":"2072",
"@value":"Dice Rolling"
},
{
"@type":"boardgamefamily",
"@id":"75031",
"@value":"Components: Sand Timers"
},
{
"@type":"boardgamefamily",
"@id":"4711",
"@value":"Game: Scrabble"
},
{
"@type":"boardgamefamily",
"@id":"62039",
"@value":"Word Games: Spelling / Letters"
},
{
"@type":"boardgamedesigner",
"@id":"3",
"@value":"(Uncredited)"
},
{
"@type":"boardgamepublisher",
"@id":"496",
"@value":"danspil"
},
{
"@type":"boardgamepublisher",
"@id":"359",
"@value":"Selchow & Righter"
},
{
"@type":"boardgamepublisher",
"@id":"5047",
"@value":"Top-Toy"
}
]
}
}
}
We can simplify this JSON by leaving the item object and removing unnecessary dictionary objects. Nonetheless, it doesn’t simplify the whole object to a simple table.
Let’s get down to business and try to model this data in multiple ways and explain things as we go. But before, let’s imagine this JSON as a table in the database.
So by List, I mean here an Array/List/JSON object, leaving it to your imagination, want to outline that it’s a complex object and not a simple data type.
Inmon methodology
1st Normal Form
By definition, 1st Normal Form:
If a relation contains a composite or multi-valued attribute, it violates the first normal form, or the relation is in first normal form if it does not contain any composite or multi-valued attribute. A relation is in first normal form if every attribute in that relation is singled valued attribute.
A table is in 1 NF if:
* There are only Single Valued Attributes.
* Attribute Domain does not change.
* There is a unique name for every Attribute/Column.
* The order in which data is stored does not matter.
What we can do is extract complex objects and make new rows for each different object inside of it.
One place where I still have doubts is that in polls, results have value and level. I was thinking of having a composite field of them, but let’s leave it as is for now.
If we check the 1NF rules — we comply. Hooray, one down, two more to go. However, we lost the Primary key of ID since multiple pieces of information are repeated per row level.
2nd Normal Form
Definition:
A relation that is in First Normal Form and every non-primary-key attribute is fully functionally dependent on the primary key, then the relation is in Second Normal Form (2NF).
Now, here comes the most complex and headache-causing part. We have to separate entities and define their relationships in another table.
In this scenario, we’d first have to separate links and polls to separate tables and have a table that connects them to the board game table. I’ve created a surrogate key for link_id and link_type and named it link_unique_id. I’ve done this looking from a software engineering perspective that I can easily add new components without heavily modifying existing structure. I’ll talk a bit more about it in Kimball's methodology section!
However, there is one issue: there will be multiple polls with different results, so poll_name is not a Primary key in the poll table. To fix it, we have to separate the results into an other table.
Now I’ve separated everything to confide to 2NF.
3rd Normal Form
A relation is in third normal form, if there is no transitive dependency for non-prime attributes as well as it is in second normal form
I don't see any transitive dependencies since poll_total_votes and result_number_of_votes were moved to the linking tables. I might have skipped something as I went. Feel free to comment under the post if I skipped/missed something!
Kimball methodology
Now, this is where the fun part begins. At this point, we should know what Issue we’re tackling and what’s an area of interest in our analytical solution.
I can see multiple ideas.
We want to identify how poll results look based on different board game dimensions.
Let’s say we get our hands on transactional data of sold goods in some shops and want to analyze it.
Again, it highly depends on the question you’re trying to answer. Because of this, your data model might look completely different.
Sales model
I am starting with this one since it’s the most common and easiest to grasp for everyone.
We have our fact table and our board game dimension, but we also have more information about the board game itself in the link table. Since the link part can contain multiple IDs matched to the same one boardgame, we introduced a so-called bridge table.
A logical data model may contain one or more many-to-many relationships. Physical data modelling techniques transform a many-to-many many-relationships into one-to many-relationships by adding additional tables. These are referred to as bridge tables.
It's good that the boardgame_link table already serves that purpose, and we don’t have to make any adjustments! That’s, of course, if you went over the hell and modelled data in 3rd NF that helped you get to this point without breaking a sweat 😅
Polls model
Let’s say we don’t have any sales data and care about different polls we executed on other games. The exciting thing is that in tables, we don’t have any date fields, so I’m just winging it and modelling it out as I see fit here. Most likely, in the backend, they have event-driven architecture or even a simple transaction log from where you could extract the information about when some vote was cast and by whom.
So very similarly to the Sales model, we have a fact table with one more dimension key for polls and the same bridge table from multi-attributed values.
Note: Link table
Now, looking at both models, you might ask me why I’m not separating link entities as separate tables, i.e.
So my answer is that I can, though I’ve, from this one board game, noticed five different potential dimensions. Remember that the majority (if not all) would have a many-to-many relationship with the boardgame table. That means five more tables. How do we proceed if developers add one more? Will our processes crash? Will we notice that there were added new ones? Data contracts are still a dream (at least when I write this post). Being a person who hates doing a lot of maintenance work (adding new things constantly) instead of automating what I can, this solution is a no-brainer. People might be thinking differently, and like everywhere — everyone is allowed to have their opinion. Do what suits you or the actual situation best!
Data Vault
Here is a meme I’ve just created:
It’s just my opinion, but you might relate to it. I was a big fan of Data Vault modelling at some point in my career. Now, what I see is that there are some use cases for more niche modelling approaches (and yes, in my eyes, Data Vault, One Big Table or 3NF are niche ones). The fun thing is that this dataset suits more the niche approaches than the Star Schema! If you’re keen on a quick recap
This might be a controversial saying, but looking now, I think that the Data Vault model is good for companies/people who either have tons of things to cover and want to scale horizontally fast or are lazy to analyze data and create relevant data models upfront, spending time before implementing.
So, what would change from our 3rd NF model? Not much. We must add hubs and make our dimensions/entity tables SCD type 2. I will use natural keys instead of creating surrogate ones for simplicity and will rename the link table to attribute to avoid confusion.
Another thing is that if we’re using the DV approach for links with attributes (previously named boardgame_poll, poll_result), we have to create separate satellites. Most likely, I missed a couple of connections between hubs/sats/links/sat_links, etc., but I am not going to lie. It isn’t straightforward…
Summary
When choosing the approach, think carefully about:
Speed for the delivery
Complexity of the created model
Usability of the developed model
Speed for adjustments
Documentation and other people's understanding of the model
All of the modelling techniques have their perfect use case. If you can — try, fail, iterate until it works. Have your lessons learned, and try not to do the same ones later.