Don't remove NULL columns or Bad data from the Source. Let's learn and handle that in Spark.
Sample 1
18 - In Lieu Of Foreclosure
dim_nonusecode
id
code
code_name
Sample 2
TripID
Source Lat
Souce Long
Des Lat
Des Long
location_name (If exists)
Another Variation
POINT (-72.98492 41.64753)
POINT (-72.96445 41.25722)
Notes:
DRY Principle: You're not repeating the lat-long info, adhering to the "Don't Repeat Yourself" principle.
Ease of Update: If you need to update a location's details, you do it in one place.
Flexibility: Easier to add more attributes to locations in the future.
5 decimal places: Accurate to ~1.1 meters, usually good enough for most applications including vehicle navigation.
4 decimal places: Accurate to ~11 meters, may be suitable for some applications but not ideal for vehicle-level precision.
3 decimal places: Accurate to ~111 meters, generally too coarse for vehicle navigation but might be okay for city-level analytics.
Sample 3
Fiscal Year
disbursement Date
Vendor Invoice Date
Vendor Invoice Week
Check Clearance Date
Create a single Date Dimension and map all these dates to that table.
See the reference diagram given in canvas
Sample 4
Another DateTime
date_attending
ip_location
Reseda, CA, United States
Los Angeles, CA, United States
Mission Viejo, CA, United States
you can create a date time dimension like this
DateTimeID
FullDateTime
Year
Month
Day
Hour
Minute
Weekday
IsWeekend
IsHoliday
Sample 5
Job Title
Experience
Qualifications
Salary Range
Age_Group
Digital Marketing Specialist
Dimension will look like this
Experience Dimension Table
ExperienceID
ExperienceRange
MinExperience
MaxExperience
Salary Range Dimension Table
SalaryID
SalaryRange
MinSalary
MaxSalary
Age Group Dimension Table
AgeGroupID
AgeGroupLabel
AgeGroupRange
MinAge
MaxAge
Sample 6
ITEM CODE
ITEM DESCRIPTION
SANTORINI GAVALA WHITE - 750ML
CORTENOVA VENETO P/GRIG - 750ML
SANTA MARGHERITA P/GRIG ALTO - 375ML
The Dimension will turn out to be like this
ItemID
ItemCode
ItemDescription
Quantity
SANTA MARGHERITA P/GRIG ALTO
Sample 7
production_countries
spoken_languages
United Kingdom, United States of America
English, French, Japanese, Swahili
United Kingdom, United States of America
United Kingdom, United States of America
Let's create the Dimension table first
There are two approaches
Creating Many to Many
FactID
CountryID
LanguageID
Pros
Normalization: Easier to update and maintain data.
Cons
Storage: May require more storage for the additional tables and keys.
Fact Table with Array Datatypes
FactID
ProductionCountryIDs
SpokenLanguageIDs
With newer Systems like Spark SQL which supports Arrays
Simplicity: Easier to understand and less complex to set up.
Performance: Could be faster for certain types of queries, especially those that don't require unpacking the array.