Data warehouse design is the process of structuring a centralized database to consolidate data from multiple systems for consistent reporting. The three decisions that determine success are the schema model (like Star or Snowflake), the architectural layers, and the design methodology. Getting these right prevents the warehouse from becoming a maintenance nightmare.
Get those three right and everything else – ETL pipelines, dashboards, query performance – sits on a solid foundation. Get them wrong and you spend the next two years rebuilding.
Schema Design: Star, Snowflake, or Galaxy?
The schema is how your tables relate to each other. Most warehouses use one of three patterns:
| Schema Type | Structure | Query Speed | Complexity | Best Use Case |
| Star Schema | One central fact table, denormalised dimension tables directly connected | Fast – fewer joins | Low – easy to understand and maintain | Most BI and reporting workloads; recommended starting point |
| Snowflake Schema | Fact table + normalised dimension tables (dimensions split into sub-dimensions) | Slower – more joins required | Medium – more tables, cleaner data model | Storage-sensitive environments; complex dimension hierarchies |
| Galaxy / Constellation | Multiple fact tables sharing dimension tables | Variable – depends on query | High – complex to maintain | Enterprise warehouses with multiple business processes |
For most teams building their first warehouse: start with Star Schema. It’s easier to query, easier to explain to stakeholders, and easier to refactor later. Snowflake adds storage efficiency at the cost of query complexity – a trade-off that rarely makes sense until you’re managing hundreds of millions of rows.
The Three-Layer Architecture
A well-designed warehouse separates concerns across three layers. Collapsing these into one is one of the most common design mistakes:
- Staging Layer: Raw, unmodified copies of source data – the archive. Nothing transforms here. If something goes wrong downstream, you re-process from this layer
- Integration / Core Layer: Cleaned, transformed, and joined data. Business rules applied here. This is where Kimball’s dimensional model or Inmon’s normalised model lives
- Access / Presentation Layer: Aggregated, pre-joined views optimised for reporting tools. What your BI dashboards actually query. Rebuilding this layer doesn’t require touching the core
The separation matters because it gives you a clean re-run path when source systems change, and it keeps your reporting layer fast without polluting the integration logic.
Kimball vs. Inmon: The Two Design Philosophies
| Factor | Kimball (Bottom-Up) | Inmon (Top-Down) |
| Philosophy | Build dimensional data marts first; enterprise view emerges from marts | Build enterprise data model first; data marts are subsets |
| Design Direction | Business process → dimensional model → data mart | Enterprise model → ETL → data mart |
| Time to First Value | Faster – first mart can deliver in weeks | Slower – requires enterprise model upfront |
| Best For | Agile teams, departmental projects, faster ROI needed | Large enterprises, regulated industries, long-term consistency |
| Key Strength | Pragmatic, business-aligned, fast delivery | Single source of truth, highly consistent |
| Key Weakness | Integration across marts can be complex later | Upfront investment is significant; slow to first delivery |
Most modern data teams lean Kimball – the faster time-to-value and business-process alignment fits the way analytics teams actually operate. Inmon is more common in financial services and healthcare where data governance and consistency across the enterprise justify the upfront cost.
Slowly Changing Dimensions (SCD): Types 1, 2, and 3
SCDs handle the problem of dimension data that changes over time – a customer changes their address, an employee changes department, a product changes category. How you record that change matters:
- Type 1 – Overwrite: Simply update the record. No history kept. Use when history genuinely doesn’t matter (e.g. correcting a typo)
- Type 2 – Add New Row: Keep the old record, insert a new one with a validity date range. Preserves full history. The most common choice for anything where historical accuracy matters – customer addresses, product pricing
- Type 3 – Add Column: Add a “previous value” column. Tracks one level of change only. Rarely used – too limited for most real scenarios
In practice: default to Type 2 for any dimension where you’ll want to ask “what was the value at the time of this transaction?” That covers most business-critical dimensions.
Cloud Data Warehouses: How Design Adapts
Modern cloud warehouses – Snowflake, BigQuery, Amazon Redshift, Azure Synapse – change some traditional design constraints:
- Separation of compute and storage means you don’t pre-optimise for storage the way on-premise design required
- Columnar storage makes wide denormalised tables (Star Schema) even faster – less reason to normalise for performance
- Serverless options (BigQuery, Athena) remove the cluster sizing problem entirely for many teams
- ELT instead of ETL: load raw data first, transform inside the warehouse using dbt or similar – the staging layer becomes even more important
5 Common Design Mistakes to Avoid
- Building the semantic layer before the data model is stable – dashboards built on shifting foundations need constant rework
- No staging layer – loading transformed data directly with no raw copy makes re-processing from source impossible
- Treating every data source as its own mart – no shared dimensions means no consistent metrics across the business
- Over-normalising in a cloud warehouse – the cost of extra joins is higher than storage savings in modern columnar systems
- Designing for today’s questions only – a warehouse that can’t accommodate new dimensions without structural rework will become the bottleneck within 18 months
Final Thought
Good data warehouse design is invisible. When it’s working, analysts just notice that their reports run fast, the numbers are consistent across departments, and adding a new data source doesn’t require a two-week project. That invisibility is the goal. Most of the visible problems in analytics – conflicting metrics, slow dashboards, failed pipeline runs – trace back to architectural decisions made early that nobody questioned at the time.
