The Data Modeling Decisions That Make or Break Your Stack

What I learned the hard way about Kimball, One Big Table, and modern ELT after a decade in data

Aug 29, 2025

I've been doing data work for over a decade now, and I've learned one thing the hard way: the modelling decisions you make in those first few weeks will either set your team up for success or create technical debt that haunts you for years.

The problem isn't usually the big architectural choices - those get scrutinised and debated. It's the seemingly innocent decisions that compound over time. That "flexible" JSON field for storing product attributes. The dimension table grows from 10 columns to 50 because "it's easier just to add it here." The fact table starts handling multiple business processes because "they're related."

I've inherited systems where simple customer reports required 15+ table joins. I've seen data pipelines that worked fine with 1000 records but fell apart at 1 million. I've debugged transformations that took longer to run than the business cycles they were measuring.

These aren't infrastructure problems - they're modelling problems. And in my experience across banking, ERP, and various data platforms, the difference between a data stack that enables your business and one that constrains it often comes down to a few key decisions made early on.

The Three Modelling Approaches That Actually Matter

Despite what the consultants tell you, there are only three modelling approaches that matter in practice. I've used all of them, and each has its place depending on your situation.

Kimball: When You Need Results Yesterday

Kimball's dimensional modelling is the Honda Civic of data modelling - reliable, straightforward, and gets the job done. If you're building analytics for a specific department and need to show value quickly, this is your best bet.

The star schema pattern works beautifully with modern cloud warehouses. Here's what a clean implementation looks like:

-- Fact table - keep it simple
CREATE TABLE fact_sales (
    date_key INT,
    product_key INT,
    customer_key INT,
    sales_amount DECIMAL(10,2),
    quantity_sold INT,
    -- Cluster on your most common filter columns
    CLUSTER BY (date_key, customer_key)
);

-- Dimension - all the context you need
CREATE TABLE dim_product (
    product_key INT PRIMARY KEY,
    product_name VARCHAR(100),
    category VARCHAR(50),
    subcategory VARCHAR(50),
    brand VARCHAR(50)
);

I've used Kimball for retail analytics, marketing reporting, and financial dashboards. It's fast to implement, and business users understand it. The downside? It doesn't handle source system changes gracefully, and you'll find yourself rebuilding dimensions when business rules evolve.

As I mentioned in my data warehousing guide, Kimball shines when you know your business questions upfront and need to deliver value in 2-3 week sprints.

One Big Table: The Pragmatic Choice

Let me be honest - sometimes the best data model is the one that gets out of your way. One Big Table (OBT) approaches get dismissed as "unsophisticated," but I've seen them solve real business problems faster than any ideally normalised schema.

The idea is simple: denormalise everything into wide tables optimised for specific analytical workloads. Here's what it looks like in practice:

-- Customer analytics OBT
CREATE TABLE customer_analytics_obt (
    -- Customer info
    customer_id VARCHAR(50),
    customer_name VARCHAR(100),
    customer_email VARCHAR(100),
    customer_signup_date DATE,
    customer_tier VARCHAR(20),
    
    -- Latest order info
    last_order_date DATE,
    last_order_amount DECIMAL(10,2),
    days_since_last_order INT,
    
    -- Lifetime metrics
    total_orders INT,
    total_spent DECIMAL(12,2),
    avg_order_value DECIMAL(10,2),
    
    -- Behavioral segments
    purchase_frequency VARCHAR(20),
    product_affinity VARCHAR(50),
    churn_risk_score DECIMAL(3,2),
    
    -- Update metadata
    _updated_at TIMESTAMP
) CLUSTER BY (customer_tier, last_order_date);

The beauty of OBT is its simplicity for end users. No joins to remember, no complex relationships to understand. Everything they need is in one place. The trade-off? You're duplicating data and creating maintenance overhead. But when query performance and user experience matter more than storage costs, it's often the right choice.

When to use OBT:

Self-service analytics where users have varying SQL skills
Reporting workloads with predictable access patterns
Real-time dashboards that need sub-second response times
Machine learning feature stores

When to avoid OBT:

Highly normalised operational systems
Data that changes frequently (you'll spend all your time maintaining it)
When storage costs are a primary concern

Modern ELT: The New Default

The shift from ETL to ELT isn't just a technical change - it's a fundamental rethinking of how we approach data modelling. Instead of transforming data before loading it, we load raw data and transform it using the warehouse's compute power.

This approach, popularised by tools like dbt, has become my default for most new projects:

-- Raw data layer - exactly as it comes from source
CREATE TABLE raw_orders (
    order_id VARCHAR(50),
    customer_id VARCHAR(50),
    order_date VARCHAR(20), -- Yes, even dates come as strings
    order_amount VARCHAR(20),
    status VARCHAR(50),
    _loaded_at TIMESTAMP
);

-- Staging layer - clean and type-cast
{{ config(materialized='view') }}

SELECT
    order_id,
    customer_id,
    CAST(order_date AS DATE) AS order_date,
    CAST(order_amount AS DECIMAL(10,2)) AS order_amount,
    UPPER(TRIM(status)) AS order_status,
    _loaded_at
FROM {{ source('ecommerce', 'raw_orders') }}
WHERE order_date IS NOT NULL

The beauty of ELT is that you can iterate quickly. Made a mistake in your transformation logic? Fix it and re-run. Need to add a new metric? Write a new model. The raw data is always there, unchanged.

But here's the thing - there's never a one-size-fits-all answer in data modelling. You might end up with a data warehouse where you have Kimball-style star schemas for everyday reporting needs, OBT structures for high-performance analytics, and various specialised marts for specific use cases. Each serves a purpose, and you'll join them together for different business scenarios.

The key is being intentional about these choices. Don't create different modelling approaches just because you can - make them because they solve specific problems better than a unified approach would. Otherwise, you'll end up with a confusing mess that nobody understands.

The Anti-Patterns That Will Destroy Your Performance

I've seen these patterns destroy data stacks more times than I care to count. Avoid them at all costs.

The EAV Death Trap

Entity-Attribute-Value patterns seem flexible until you need to query them. I inherited a system where customer properties were stored like this:

-- Don't do this. Seriously.
CREATE TABLE customer_attributes (
    customer_id INT,
    attribute_name VARCHAR(50),
    attribute_value VARCHAR(255)
);

-- To get basic customer info, you need this horror
SELECT
    c.customer_id,
    MAX(CASE WHEN ca.attribute_name = 'name' THEN ca.attribute_value END) AS name,
    MAX(CASE WHEN ca.attribute_name = 'email' THEN ca.attribute_value END) AS email,
    MAX(CASE WHEN ca.attribute_name = 'city' THEN ca.attribute_value END) AS city
FROM customers c
LEFT JOIN customer_attributes ca ON c.customer_id = ca.customer_id
GROUP BY c.customer_id;

This pattern makes simple queries complex and complex queries impossible. Every analytics request becomes a performance nightmare.

A modern variant of this problem happens when software engineers store everything in JSON fields. You end up with tables full of JSON blobs that need to be parsed row by row, deduplicated, and cleaned before they're usable. The queries run slowly because the database has to parse JSON for every single row instead of using proper indexing and column statistics.

The God Table Problem

I once worked with a "customer" table that had 147 columns. Finding the right column was like playing Where's Waldo, and every query scanned unnecessary data. Break large tables into logical components - it's better for performance and maintainability.

Wide Tables Without Clustering

Cloud warehouses are incredibly powerful, but they're not magic. If you don't tell them how to organise your data, they'll scan everything. Use clustering keys wisely:

-- Snowflake optimization for time-series data
CREATE TABLE events (
    event_date DATE,
    user_id VARCHAR(50),
    event_type VARCHAR(100),
    event_data VARIANT
) CLUSTER BY (event_date, user_id);

As I detailed in my practical data modelling examples, getting clustering right can mean the difference between 2-second and 2-minute queries.

A Framework for Evolving Your Data Model

One of the biggest challenges in data modelling is knowing when to evolve your approach. Here's the framework I use to guide these decisions without breaking the bank or driving myself to exhaustion.

Start Simple, Plan for Complexity

Begin with the most straightforward modelling approach that meets your current needs. If you're doing basic reporting for one department, start with Kimball. If you need high-performance analytics with predictable access patterns, consider OBT from the beginning.

The key is understanding your growth trajectory:

Volume growth: How fast is your data growing?
Source complexity: Do you frequently add new data sources?
Analytical complexity: Are users asking for more sophisticated analyses over time?
Performance requirements: Do you need sub-second queries, or are minutes acceptable?

When to Add Data Lake Layers

I add data lake patterns when I hit one of these thresholds:

Source diversity: When you're pulling from 5+ different source systems with different schemas
Transformation complexity: When your staging layer starts requiring complex business logic
Historical analysis: When users need to analyse data "as it was" at different points in time
Exploration needs: When analysts need to experiment with data that doesn't fit existing models

Governance Evolution Path

Start with basic governance and evolve as your organisation matures:

Phase 1: Basic documentation and naming conventions

Document table purposes and column definitions
Establish consistent naming patterns
Define data refresh schedules

Phase 2: Formal definitions and lineage

Create business glossaries for key metrics
Implement data lineage tracking
Establish data quality tests

Phase 3: Advanced governance and self-service

Implement semantic layers for business users
Create certified datasets and metrics
Enable self-service analytics with guardrails

The mistake I see teams make is trying to implement Phase 3 governance when they're still struggling with Phase 1 problems. Build your foundation first.

Migration Strategies That Actually Work

Start with Raw Data

Whether you're moving from legacy systems or modernising an existing warehouse, always start by preserving raw data in your cloud warehouse's native tables.

I've learned this lesson multiple times: external tables and data lake storage seem appealing for cost reasons, but they introduce performance bottlenecks and complexity. Snowflake works best when data resides in its native table format, allowing it to leverage micro-partitioning, clustering, and caching. The same principle applies to BigQuery, Redshift, and other cloud warehouses - they're optimised for their own storage formats.

-- Create a landing zone for all source data in native tables
CREATE TABLE raw_legacy_customers (
    extracted_at TIMESTAMP,
    source_system VARCHAR(50),
    customer_data VARIANT, -- JSON blob of the original record
    _metadata VARIANT -- Capture extraction metadata
) CLUSTER BY (extracted_at, source_system);

-- For structured sources, preserve the original structure
CREATE TABLE raw_crm_customers (
    extracted_at TIMESTAMP,
    customer_id VARCHAR(50),
    first_name VARCHAR(100),
    last_name VARCHAR(100),
    email VARCHAR(255),
    -- Include ALL original columns, even ones you don't think you need
    original_create_date TIMESTAMP,
    last_modified_date TIMESTAMP
) CLUSTER BY (extracted_at, customer_id);

This approach provides flexibility to rebuild models without losing historical context, and you reap the full performance benefits of your cloud warehouse. Yes, storage costs more than external files, but the query performance and simplicity gains far outweigh the cost difference in most cases.

Migrate by Business Domain

Don't try to migrate everything at once. Pick one business domain (such as customer data or product catalogue) and migrate it completely before moving on to the next—this limits the blast radius when things go wrong.

Parallel Processing During Migration

Run old and new systems in parallel for at least one complete business cycle. This enables you to validate data accuracy and provides business users with confidence in the new system.

The Snowflake Performance Audit Template

Here's the performance audit framework I use when inheriting a new data stack. Since I primarily work with Snowflake, these queries are Snowflake-specific; however, the concepts apply to other platforms as well.

Query Performance Analysis

-- Find your slowest queries
SELECT
    query_text,
    execution_time/1000 as execution_time_seconds,
    warehouse_name,
    user_name,
    start_time,
    total_elapsed_time/1000 as total_time_seconds
FROM table(information_schema.query_history())
WHERE execution_time > 30000 -- 30+ seconds
ORDER BY execution_time DESC
LIMIT 20;

-- Identify queries scanning too much data
SELECT
    query_text,
    bytes_scanned,
    warehouse_name,
    execution_time/1000 as execution_time_seconds
FROM table(information_schema.query_history())
WHERE bytes_scanned > 1073741824 -- 1+ GB
ORDER BY bytes_scanned DESC
LIMIT 20;

Table Health Check

-- Check for tables that might benefit from clustering
SELECT
    table_name,
    row_count,
    bytes,
    clustering_key,
    CASE
        WHEN clustering_key IS NULL AND row_count > 1000000 THEN 'Consider clustering'
        ELSE 'OK'
    END as recommendation
FROM information_schema.tables
WHERE table_schema != 'INFORMATION_SCHEMA'
ORDER BY bytes DESC;

-- Find tables with poor clustering
SELECT
    table_name,
    clustering_key,
    SYSTEM$CLUSTERING_INFORMATION(table_name) as clustering_info
FROM information_schema.tables
WHERE clustering_key IS NOT NULL
  AND table_schema != 'INFORMATION_SCHEMA';

Warehouse Utilisation Review

-- Check warehouse usage patterns
SELECT
    warehouse_name,
    DATE(start_time) as usage_date,
    COUNT(*) as query_count,
    SUM(execution_time)/1000 as total_execution_seconds,
    AVG(execution_time)/1000 as avg_execution_seconds
FROM table(information_schema.query_history())
WHERE start_time >= DATEADD(day, -30, CURRENT_TIMESTAMP())
GROUP BY warehouse_name, DATE(start_time)
ORDER BY usage_date DESC, total_execution_seconds DESC;

Cost Optimisation Queries

-- Find expensive queries by compute cost
SELECT
    query_text,
    warehouse_name,
    execution_time/1000 as execution_seconds,
    warehouse_size,
    (execution_time/1000/3600) *
    CASE warehouse_size
        WHEN 'X-SMALL' THEN 1
        WHEN 'SMALL' THEN 2
        WHEN 'MEDIUM' THEN 4
        WHEN 'LARGE' THEN 8
        WHEN 'X-LARGE' THEN 16
        ELSE 32
    END as estimated_credit_cost
FROM table(information_schema.query_history())
WHERE start_time >= DATEADD(day, -7, CURRENT_TIMESTAMP())
ORDER BY estimated_credit_cost DESC
LIMIT 20;

What I'd Do Differently Today

If I were starting a new data stack today, here's exactly what I'd do:

Start with dbt Core and Snowflake/BigQuery - The ecosystem is mature, the learning curve is manageable, and you'll have community support when things break.

Implement data quality testing from day one - Use dbt-expectations or Great Expectations. Trust me, catching data quality issues early is infinitely easier than debugging them in production.

Build raw → staging → marts layers - Even if you think you don't need the complexity initially, you'll thank yourself later when requirements change.

Version control everything: your dbt models, seed files, and documentation. Treat your data transformations like the code they are.

Monitor costs obsessively - Cloud warehouses can get expensive fast. Set up cost alerts and review usage on a monthly basis.

The Bottom Line

Data modelling isn't glamorous work, but it's the foundation on which everything else builds. The decisions you make in those first few weeks will echo through every query, every dashboard, and every analytics request for years to come.

I've learned this the hard way: there's no such thing as a temporary data model. That "quick fix" you implement to unblock the marketing team will still be running two years later, probably supporting mission-critical reports you never imagined.

Take the time to think through your modelling approach. Understand your business requirements. Start simple but build for change. And for the love of all that is holy, document your decisions so that the next person (probably future you) understands why you made the choices you did.

The data stack that serves your business well isn't the one with the most cutting-edge technology. It's the one that makes data accessible, reliable, and fast - no matter how your business evolves.

Uncle Data

Discussion about this post

Ready for more?