Data Warehousing
  • Data Warehousing
  • Readme
  • Fundamentals
    • Terms to Know
    • Jobs
    • Skills needed for DW developer
    • Application Tiers
    • Operational Database
    • What is a Data Warehouse
      • Typical Data Architecture
      • Problem Statement
      • Features of Data Warehouse
      • Need for Data Warehouse
      • Current State of the Art
    • Activities of Data Science
    • Types of Data
    • Data Storage Systems
    • Data Warehouse 1980 - Current
    • Data Warehouse vs Data Mart
    • Data Warehouse Architecture
      • Top-Down Approach
      • Bottom-Up Approach
    • Data Warehouse Characteristic
      • Subject Oriented
      • Integrated
      • Time Variant
      • Non Volatile
    • Tools
    • Cloud vs On-Premise
    • Steps to design a Data Warehouse
      • Gather Requirements
      • Environment
      • Data Modeling
      • Choosing ETL / ELT Solution
      • Online Analytic Processing
      • Front End
      • Query Optimization
    • Dataset Examples
    • Thoughts on some data
  • RDBMS
    • Data Model
      • Entity Relationship Model
      • Attributes
      • Keys
      • Transaction
      • ACID
    • Online vs Batch
    • DSL vs GPL
    • Connect to Elvis
    • SQL Concepts
      • Basic Select - 1
      • Basic Select - 2
      • UNION Operators
      • Wild Cards & Distinct
      • Group By & Having
      • Sub Queries
      • Derived Tables
      • Views
    • Practice using SQLBolt
  • Cloud
    • Overview
    • Types of Cloud Services
    • Challenges of Cloud Computing
    • AWS
      • AWS Global Infrastructure
      • EC2
      • S3
      • IAM
    • Terraform
  • Spark - Databricks
    • Storage Formats
    • File Formats
    • Medallion Architecture
    • Delta
  • Data Warehousing Concepts
    • Dimensional Modelling
      • Star Schema
      • Galaxy Schema
      • Snowflake Schema
      • Starflake Schema
      • Star vs Snowflake
      • GRAIN
      • Multi-Fact Star Schema
      • Vertabelo Tool
    • Dimension - Fact
    • Sample Excercise
    • Keys
      • Why Surrogate Keys are Important
    • More Examples
    • Master Data Management
    • Steps of Dimensional Modeling
    • Types of Dimensions
      • Date Dimension Table
      • Degenerate Dimension
      • Junk Dimension
      • Static Dimension
      • Conformed Dimensions
      • Slowly Changing Dimensions
        • SCD - Type 0
        • SCD - Type 1
        • SCD - Type 2
        • SCD - Type 3
        • SCD - Type 4
        • SCD - Type 6
        • SCD - Type 5 - Fun Fact
      • Role Playing Dimension
      • Conformed vs Role Playing
      • Shrunken Dimension
      • Swappable Dimension
      • Step Dimension
    • Types of Facts
      • Factless Fact Table
      • Transaction Fact
      • Periodic Fact
      • Accumulating Snapshot Fact Table
      • Transaction vs Periodic vs Accumulating
      • Additive, Semi-Additive, Non-Additive
      • Periodic Snapshot vs Additive
      • Conformed Fact
    • Sample Data Architecture Diagram
    • Data Pipeline Models
    • New DW Concepts
Powered by GitBook
On this page
  1. Fundamentals

Thoughts on some data

Don't remove NULL columns or Bad data from the Source. Let's learn and handle that in Spark.

Sample 1

Serial Number
List Year

20093

2020

10 - A Will

200192

2020

14 - Foreclosure

190871

2019

18 - In Lieu Of Foreclosure

dim_nonusecode

id

code

code_name

Sample 2

TripID
Source Lat
Souce Long
Des Lat
Des Long

1

-73.9903717

40.73469543

-73.98184204

40.73240662

2

-73.98078156

40.7299118

-73.94447327

40.71667862

3

-73.98455048

40.67956543

-73.95027161

40.78892517

dim_location

location_id

lat

long

location_name (If exists)

fact

trip_id

source_location_id

destination_location_id

Another Variation

Location

POINT (-72.98492 41.64753)

POINT (-72.96445 41.25722)

Notes:

DRY Principle: You're not repeating the lat-long info, adhering to the "Don't Repeat Yourself" principle.

Ease of Update: If you need to update a location's details, you do it in one place.

Flexibility: Easier to add more attributes to locations in the future.

5 decimal places: Accurate to ~1.1 meters, usually good enough for most applications including vehicle navigation.

4 decimal places: Accurate to ~11 meters, may be suitable for some applications but not ideal for vehicle-level precision.

3 decimal places: Accurate to ~111 meters, generally too coarse for vehicle navigation but might be okay for city-level analytics.

Sample 3

Fiscal Year
disbursement Date
Vendor Invoice Date
Vendor Invoice Week
Check Clearance Date

2023

06-Oct-23

08-Aug-23

08-06-2023

2023

06-Oct-23

16-Aug-23

08/13/2023

2023

06-Oct-23

22-Sep-23

09/17/2023

10-08-2023

Create a single Date Dimension and map all these dates to that table.

See the reference diagram given in canvas

Sample 4

Another DateTime

date_attending
ip_location

2017-12-23 12:00:00

Reseda, CA, United States

2017-12-23 12:00:00

Los Angeles, CA, United States

2018-01-05 14:00:00

Mission Viejo, CA, United States

you can create a date time dimension like this

DateTimeID
FullDateTime
Year
Month
Day
Hour
Minute
Weekday
IsWeekend
IsHoliday

1

2017-12-23 12:00:00

2017

12

23

12

0

6

False

False

2

2018-01-05 14:00:00

2018

1

5

14

0

5

False

False

Sample 5

Job Title
Experience
Qualifications
Salary Range
Age_Group

Digital Marketing Specialist

5 to 15 Years

M.Tech

$59K-$99K

Youth (<25)

Web Developer

2 to 12 Years

BCA

$56K-$116K

Adults (35-64)

Operations Manager

0 to 12 Years

PhD

$61K-$104K

Young Adults (25-34)

Network Engineer

4 to 11 Years

PhD

$65K-$91K

Young Adults (25-34)

Event Manager

1 to 12 Years

MBA

$64K-$87K

Adults (35-64)

Dimension will look like this

Experience Dimension Table

ExperienceID
ExperienceRange
MinExperience
MaxExperience

1

5 to 15 Years

5

15

2

2 to 12 Years

2

12

3

0 to 12 Years

0

12

4

4 to 11 Years

4

11

5

1 to 12 Years

1

12

Salary Range Dimension Table

SalaryID
SalaryRange
MinSalary
MaxSalary

1

$59K-$99K

$59K

$99K

2

$56K-$116K

$56K

$116K

3

$61K-$104K

$61K

$104K

4

$65K-$91K

$65K

$91K

5

$64K-$87K

$64K

$87K

Age Group Dimension Table

AgeGroupID
AgeGroupLabel
AgeGroupRange
MinAge
MaxAge

1

Youth (<25)

<25

NULL

24

2

Adults (35-64)

35-64

35

64

3

Young Adults (25-34)

25-34

25

34

Sample 6

ITEM CODE
ITEM DESCRIPTION

100293

SANTORINI GAVALA WHITE - 750ML

100641

CORTENOVA VENETO P/GRIG - 750ML

100749

SANTA MARGHERITA P/GRIG ALTO - 375ML

The Dimension will turn out to be like this

ItemID
ItemCode
ItemDescription
Quantity

1

100293

SANTORINI GAVALA WHITE

750ML

2

100641

CORTENOVA VENETO P/GRIG

750ML

3

100749

SANTA MARGHERITA P/GRIG ALTO

375ML

Sample 7

production_countries
spoken_languages

United Kingdom, United States of America

English, French, Japanese, Swahili

United Kingdom, United States of America

English

United Kingdom, United States of America

English, Mandarin

Let's create the Dimension table first

CountryID
CountryName

1

United Kingdom

2

United States of America

LanguageID
LanguageName

1

English

2

French

3

Japanese

4

Swahili

5

Mandarin

There are two approaches

Creating Many to Many

FactID
CountryID
LanguageID

1

1

1

1

2

1

1

1

2

1

2

2

Pros

Normalization: Easier to update and maintain data.

Cons

Storage: May require more storage for the additional tables and keys.

Fact Table with Array Datatypes

FactID
ProductionCountryIDs
SpokenLanguageIDs

1

[1, 2]

[1, 2, 3, 4]

2

[1, 2]

[1]

3

[1, 2]

[1, 5]

With newer Systems like Spark SQL which supports Arrays

  • Simplicity: Easier to understand and less complex to set up.

  • Performance: Could be faster for certain types of queries, especially those that don't require unpacking the array.

PreviousDataset ExamplesNextData Model

Last updated 1 year ago