tadata
Back to home

Data Modeling Patterns: Choosing the Right Architecture

#data-modeling#data-engineering#analytics#data-warehouse

Data modeling is the structural foundation of every analytics platform. The wrong model slows queries, confuses analysts, and makes schema evolution painful. The right model accelerates insight delivery and scales with the business.

Modeling Approach Comparison

ApproachStructureQuery PerformanceFlexibilityComplexityBest For
Star SchemaFact tables + denormalized dimensionsExcellent — few joinsMediumLowBI dashboards, known queries
Snowflake SchemaFact tables + normalized dimensionsGood — more joinsHighMediumStorage-sensitive, normalized sources
Data Vault 2.0Hubs + Links + SatellitesVariable — needs martsVery highHighRegulated industries, auditability
One Big Table (OBT)Single wide denormalized tableExcellent — zero joinsLowVery lowSmall teams, single-domain analytics
Activity SchemaEntity + activity streamGoodHighMediumEvent-driven analytics

Kimball vs Inmon: Philosophical Divide

DimensionKimball (Bottom-Up)Inmon (Top-Down)
Starting pointBusiness process / departmentEnterprise-wide data model
Central structureDimensional star schemas3NF enterprise data warehouse
Data martsBuilt first, conformed over timeDerived from the central warehouse
Time to first valueWeeksMonths
Schema evolutionModerate — add dimensions/factsComplex — ripple effects
Team size neededSmall to mediumLarge, dedicated modeling team
Governance modelDecentralized, conformed dimensionsCentralized, strict standards
Modern relevanceHigh — aligns with dbt/lakehouseMedium — suits large enterprises

dbt Modeling Layers

┌─────────────────────────────────────────────────────────┐
│                    dbt Project Structure                 │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  ┌───────────┐   ┌───────────┐   ┌───────────┐        │
│  │  Sources  │──▶│  Staging  │──▶│  Inter-   │        │
│  │           │   │  (stg_)   │   │  mediate  │        │
│  │ Raw data  │   │ 1:1 clean │   │  (int_)   │        │
│  └───────────┘   └───────────┘   └─────┬─────┘        │
│                                        │               │
│                                  ┌─────▼─────┐        │
│                                  │   Marts    │        │
│                                  │  (fct_,    │        │
│                                  │   dim_)    │        │
│                                  └─────┬─────┘        │
│                                        │               │
│                                  ┌─────▼─────┐        │
│                                  │  Metrics  │        │
│                                  │  (mtc_)   │        │
│                                  └───────────┘        │
└─────────────────────────────────────────────────────────┘

Layer Responsibilities:
  Sources    → External system references, freshness checks
  Staging    → Renaming, casting, deduplication. One model per source table.
  Intermediate → Business logic joins, aggregations across staging models
  Marts      → Consumer-facing: fact and dimension tables
  Metrics    → Semantic layer definitions for BI tools

Pattern Selection Decision Tree

Start: What is your primary constraint?

├── "Speed to delivery"
│   ├── Small dataset (< 50 GB)? → One Big Table (OBT)
│   └── Larger dataset? → Star Schema (Kimball)
│
├── "Regulatory / auditability"
│   └── Need full history + lineage? → Data Vault 2.0
│
├── "Query flexibility"
│   ├── Known, stable queries? → Star Schema
│   └── Ad-hoc, exploratory? → Snowflake Schema or Data Vault + Marts
│
└── "Team size & skill"
    ├── < 3 data engineers? → OBT or Star Schema
    └── Dedicated modeling team? → Data Vault or Inmon 3NF

Modeling Anti-Patterns to Avoid

Anti-PatternSymptomFix
God table200+ columns, mixed granularityDecompose into proper fact/dim
Soft deletes everywhereis_deleted flags bloat tablesUse SCD Type 2 or separate archive
Business logic in BI toolMetrics differ across dashboardsPush logic into dbt marts / semantic layer
No grain statementAnalysts unsure what a row representsDocument grain explicitly per model
Premature normalizationExcessive joins kill dashboard performanceDenormalize for read-heavy patterns
Missing surrogate keysNatural keys change, breaking joinsAlways add stable surrogate keys

Slowly Changing Dimensions (SCD) Strategy

SCD TypeBehaviorUse CaseStorage Impact
Type 0Never changesReference codes, constantsMinimal
Type 1OverwriteCorrections, non-historical attributesMinimal
Type 2Add new row with validity datesFull history requiredHigh
Type 3Add column for previous valueLimited history (current + previous)Low
Type 6Hybrid (1 + 2 + 3)Current + historical with easy accessMedium-High

Resources

:::