Thoughts on some data

Don't remove NULL columns or Bad data from the Source. Let's learn and handle that in Spark.

Sample 1

Serial Number
List Year

20093

2020

10 - A Will

200192

2020

14 - Foreclosure

190871

2019

18 - In Lieu Of Foreclosure

dim_nonusecode

id

code

code_name

Sample 2

TripID
Source Lat
Souce Long
Des Lat
Des Long

1

-73.9903717

40.73469543

-73.98184204

40.73240662

2

-73.98078156

40.7299118

-73.94447327

40.71667862

3

-73.98455048

40.67956543

-73.95027161

40.78892517

dim_location

location_id

lat

long

location_name (If exists)

fact

trip_id

source_location_id

destination_location_id

Another Variation

Location

POINT (-72.98492 41.64753)

POINT (-72.96445 41.25722)

Notes:

DRY Principle: You're not repeating the lat-long info, adhering to the "Don't Repeat Yourself" principle.

Ease of Update: If you need to update a location's details, you do it in one place.

Flexibility: Easier to add more attributes to locations in the future.

5 decimal places: Accurate to ~1.1 meters, usually good enough for most applications including vehicle navigation.

4 decimal places: Accurate to ~11 meters, may be suitable for some applications but not ideal for vehicle-level precision.

3 decimal places: Accurate to ~111 meters, generally too coarse for vehicle navigation but might be okay for city-level analytics.

Sample 3

Fiscal Year
disbursement Date
Vendor Invoice Date
Vendor Invoice Week
Check Clearance Date

2023

06-Oct-23

08-Aug-23

08-06-2023

2023

06-Oct-23

16-Aug-23

08/13/2023

2023

06-Oct-23

22-Sep-23

09/17/2023

10-08-2023

Create a single Date Dimension and map all these dates to that table.

See the reference diagram given in canvas

Sample 4

Another DateTime

date_attending
ip_location

2017-12-23 12:00:00

Reseda, CA, United States

2017-12-23 12:00:00

Los Angeles, CA, United States

2018-01-05 14:00:00

Mission Viejo, CA, United States

you can create a date time dimension like this

DateTimeID
FullDateTime
Year
Month
Day
Hour
Minute
Weekday
IsWeekend
IsHoliday

1

2017-12-23 12:00:00

2017

12

23

12

0

6

False

False

2

2018-01-05 14:00:00

2018

1

5

14

0

5

False

False

Sample 5

Job Title
Experience
Qualifications
Salary Range
Age_Group

Digital Marketing Specialist

5 to 15 Years

M.Tech

$59K-$99K

Youth (<25)

Web Developer

2 to 12 Years

BCA

$56K-$116K

Adults (35-64)

Operations Manager

0 to 12 Years

PhD

$61K-$104K

Young Adults (25-34)

Network Engineer

4 to 11 Years

PhD

$65K-$91K

Young Adults (25-34)

Event Manager

1 to 12 Years

MBA

$64K-$87K

Adults (35-64)

Dimension will look like this

Experience Dimension Table

ExperienceID
ExperienceRange
MinExperience
MaxExperience

1

5 to 15 Years

5

15

2

2 to 12 Years

2

12

3

0 to 12 Years

0

12

4

4 to 11 Years

4

11

5

1 to 12 Years

1

12

Salary Range Dimension Table

SalaryID
SalaryRange
MinSalary
MaxSalary

1

$59K-$99K

$59K

$99K

2

$56K-$116K

$56K

$116K

3

$61K-$104K

$61K

$104K

4

$65K-$91K

$65K

$91K

5

$64K-$87K

$64K

$87K

Age Group Dimension Table

AgeGroupID
AgeGroupLabel
AgeGroupRange
MinAge
MaxAge

1

Youth (<25)

<25

NULL

24

2

Adults (35-64)

35-64

35

64

3

Young Adults (25-34)

25-34

25

34

Sample 6

ITEM CODE
ITEM DESCRIPTION

100293

SANTORINI GAVALA WHITE - 750ML

100641

CORTENOVA VENETO P/GRIG - 750ML

100749

SANTA MARGHERITA P/GRIG ALTO - 375ML

The Dimension will turn out to be like this

ItemID
ItemCode
ItemDescription
Quantity

1

100293

SANTORINI GAVALA WHITE

750ML

2

100641

CORTENOVA VENETO P/GRIG

750ML

3

100749

SANTA MARGHERITA P/GRIG ALTO

375ML

Sample 7

production_countries
spoken_languages

United Kingdom, United States of America

English, French, Japanese, Swahili

United Kingdom, United States of America

English

United Kingdom, United States of America

English, Mandarin

Let's create the Dimension table first

CountryID
CountryName

1

United Kingdom

2

United States of America

LanguageID
LanguageName

1

English

2

French

3

Japanese

4

Swahili

5

Mandarin

There are two approaches

Creating Many to Many

FactID
CountryID
LanguageID

1

1

1

1

2

1

1

1

2

1

2

2

Pros

Normalization: Easier to update and maintain data.

Cons

Storage: May require more storage for the additional tables and keys.

Fact Table with Array Datatypes

FactID
ProductionCountryIDs
SpokenLanguageIDs

1

[1, 2]

[1, 2, 3, 4]

2

[1, 2]

[1]

3

[1, 2]

[1, 5]

With newer Systems like Spark SQL which supports Arrays

  • Simplicity: Easier to understand and less complex to set up.

  • Performance: Could be faster for certain types of queries, especially those that don't require unpacking the array.

Last updated