Thoughts on some data

Don't remove NULL columns or Bad data from the Source. Let's learn and handle that in Spark.

Sample 1

dim_nonusecode

id

code

code_name

Sample 2

Another Variation

Notes:

DRY Principle: You're not repeating the lat-long info, adhering to the "Don't Repeat Yourself" principle.

Ease of Update: If you need to update a location's details, you do it in one place.

Flexibility: Easier to add more attributes to locations in the future.

5 decimal places: Accurate to ~1.1 meters, usually good enough for most applications including vehicle navigation.

4 decimal places: Accurate to ~11 meters, may be suitable for some applications but not ideal for vehicle-level precision.

3 decimal places: Accurate to ~111 meters, generally too coarse for vehicle navigation but might be okay for city-level analytics.

Sample 3

Create a single Date Dimension and map all these dates to that table.

See the reference diagram given in canvas

Sample 4

Another DateTime

you can create a date time dimension like this

Sample 5

Dimension will look like this

Experience Dimension Table

Salary Range Dimension Table

Age Group Dimension Table

Sample 6

The Dimension will turn out to be like this

Sample 7

Let's create the Dimension table first

There are two approaches

Creating Many to Many

Pros

Normalization: Easier to update and maintain data.

Cons

Storage: May require more storage for the additional tables and keys.

Fact Table with Array Datatypes

With newer Systems like Spark SQL which supports Arrays

  • Simplicity: Easier to understand and less complex to set up.

  • Performance: Could be faster for certain types of queries, especially those that don't require unpacking the array.

Last updated