Thoughts on some data
Don't remove NULL columns or Bad data from the Source. Let's learn and handle that in Spark.
Sample 1
dim_nonusecode
id
code
code_name
Sample 2
Another Variation
Notes:
DRY Principle: You're not repeating the lat-long info, adhering to the "Don't Repeat Yourself" principle.
Ease of Update: If you need to update a location's details, you do it in one place.
Flexibility: Easier to add more attributes to locations in the future.
5 decimal places: Accurate to ~1.1 meters, usually good enough for most applications including vehicle navigation.
4 decimal places: Accurate to ~11 meters, may be suitable for some applications but not ideal for vehicle-level precision.
3 decimal places: Accurate to ~111 meters, generally too coarse for vehicle navigation but might be okay for city-level analytics.
Sample 3
Create a single Date Dimension and map all these dates to that table.
See the reference diagram given in canvas
Sample 4
Another DateTime
you can create a date time dimension like this
Sample 5
Dimension will look like this
Experience Dimension Table
Salary Range Dimension Table
Age Group Dimension Table
Sample 6
The Dimension will turn out to be like this
Sample 7
Let's create the Dimension table first
There are two approaches
Creating Many to Many
Pros
Normalization: Easier to update and maintain data.
Cons
Storage: May require more storage for the additional tables and keys.
Fact Table with Array Datatypes
With newer Systems like Spark SQL which supports Arrays
Simplicity: Easier to understand and less complex to set up.
Performance: Could be faster for certain types of queries, especially those that don't require unpacking the array.
Last updated