Thoughts on some data

Don't remove NULL columns or Bad data from the Source. Let's learn and handle that in Spark.

Sample 1

Serial Number

List Year

20093

2020

10 - A Will

200192

2020

14 - Foreclosure

190871

2019

18 - In Lieu Of Foreclosure

dim_nonusecode

code

code_name

Sample 2

TripID

Source Lat

Souce Long

Des Lat

Des Long

-73.9903717

40.73469543

-73.98184204

40.73240662

-73.98078156

40.7299118

-73.94447327

40.71667862

-73.98455048

40.67956543

-73.95027161

40.78892517

dim_location

location_id

lat

long

location_name (If exists)

fact

trip_id

source_location_id

destination_location_id

Another Variation

Location

POINT (-72.98492 41.64753)

POINT (-72.96445 41.25722)

Notes:

DRY Principle: You're not repeating the lat-long info, adhering to the "Don't Repeat Yourself" principle.

Ease of Update: If you need to update a location's details, you do it in one place.

Flexibility: Easier to add more attributes to locations in the future.

5 decimal places: Accurate to ~1.1 meters, usually good enough for most applications including vehicle navigation.

4 decimal places: Accurate to ~11 meters, may be suitable for some applications but not ideal for vehicle-level precision.

3 decimal places: Accurate to ~111 meters, generally too coarse for vehicle navigation but might be okay for city-level analytics.

Sample 3

Fiscal Year

disbursement Date

Vendor Invoice Date

Vendor Invoice Week

Check Clearance Date

2023

06-Oct-23

08-Aug-23

08-06-2023

2023

06-Oct-23

16-Aug-23

08/13/2023

2023

06-Oct-23

22-Sep-23

09/17/2023

10-08-2023

Create a single Date Dimension and map all these dates to that table.

See the reference diagram given in canvas

Sample 4

Another DateTime

date_attending

ip_location

2017-12-23 12:00:00

Reseda, CA, United States

2017-12-23 12:00:00

Los Angeles, CA, United States

2018-01-05 14:00:00

Mission Viejo, CA, United States

you can create a date time dimension like this

DateTimeID

FullDateTime

Year

Month

Day

Hour

Minute

Weekday

IsWeekend

IsHoliday

2017-12-23 12:00:00

2017

False

2018-01-05 14:00:00

2018

False

Sample 5

Job Title

Experience

Qualifications

Salary Range

Age_Group

Digital Marketing Specialist

5 to 15 Years

M.Tech

$59K-$99K

Youth (<25)

Web Developer

2 to 12 Years

BCA

$56K-$116K

Adults (35-64)

Operations Manager

0 to 12 Years

PhD

$61K-$104K

Young Adults (25-34)

Network Engineer

4 to 11 Years

PhD

$65K-$91K

Young Adults (25-34)

Event Manager

1 to 12 Years

MBA

$64K-$87K

Adults (35-64)

Dimension will look like this

Experience Dimension Table

ExperienceID

ExperienceRange

MinExperience

MaxExperience

5 to 15 Years

2 to 12 Years

0 to 12 Years

4 to 11 Years

1 to 12 Years

Salary Range Dimension Table

SalaryID

SalaryRange

MinSalary

MaxSalary

$59K-$99K

$59K

$99K

$56K-$116K

$56K

$116K

$61K-$104K

$61K

$104K

$65K-$91K

$65K

$91K

$64K-$87K

$64K

$87K

Age Group Dimension Table

AgeGroupID

AgeGroupLabel

AgeGroupRange

MinAge

MaxAge

Youth (<25)

<25

NULL

Adults (35-64)

35-64

Young Adults (25-34)

25-34

Sample 6

ITEM CODE

ITEM DESCRIPTION

100293

SANTORINI GAVALA WHITE - 750ML

100641

CORTENOVA VENETO P/GRIG - 750ML

100749

SANTA MARGHERITA P/GRIG ALTO - 375ML

The Dimension will turn out to be like this

ItemID

ItemCode

ItemDescription

Quantity

100293

SANTORINI GAVALA WHITE

750ML

100641

CORTENOVA VENETO P/GRIG

750ML

100749

SANTA MARGHERITA P/GRIG ALTO

375ML

Sample 7

production_countries

spoken_languages

United Kingdom, United States of America

English, French, Japanese, Swahili

United Kingdom, United States of America

English

United Kingdom, United States of America

English, Mandarin

Let's create the Dimension table first

CountryID

CountryName

United Kingdom

United States of America

LanguageID

LanguageName

English

French

Japanese

Swahili

Mandarin

There are two approaches

Creating Many to Many

FactID

CountryID

LanguageID

Pros

Normalization: Easier to update and maintain data.

Cons

Storage: May require more storage for the additional tables and keys.

Fact Table with Array Datatypes

FactID

ProductionCountryIDs

SpokenLanguageIDs

[1, 2]

[1, 2, 3, 4]

[1, 2]

[1]

[1, 2]

[1, 5]

With newer Systems like Spark SQL which supports Arrays

Simplicity: Easier to understand and less complex to set up.
Performance: Could be faster for certain types of queries, especially those that don't require unpacking the array.

PreviousDataset Examples NextData Model

Last updated 1 year ago