Thoughts on some data
Don't remove NULL columns or Bad data from the Source. Let's learn and handle that in Spark.
Sample 1
20093
2020
10 - A Will
200192
2020
14 - Foreclosure
190871
2019
18 - In Lieu Of Foreclosure
dim_nonusecode
id
code
code_name
Sample 2
1
-73.9903717
40.73469543
-73.98184204
40.73240662
2
-73.98078156
40.7299118
-73.94447327
40.71667862
3
-73.98455048
40.67956543
-73.95027161
40.78892517
location_id
lat
long
location_name (If exists)
trip_id
source_location_id
destination_location_id
Another Variation
POINT (-72.98492 41.64753)
POINT (-72.96445 41.25722)
Notes:
DRY Principle: You're not repeating the lat-long info, adhering to the "Don't Repeat Yourself" principle.
Ease of Update: If you need to update a location's details, you do it in one place.
Flexibility: Easier to add more attributes to locations in the future.
5 decimal places: Accurate to ~1.1 meters, usually good enough for most applications including vehicle navigation.
4 decimal places: Accurate to ~11 meters, may be suitable for some applications but not ideal for vehicle-level precision.
3 decimal places: Accurate to ~111 meters, generally too coarse for vehicle navigation but might be okay for city-level analytics.
Sample 3
2023
06-Oct-23
08-Aug-23
08-06-2023
2023
06-Oct-23
16-Aug-23
08/13/2023
2023
06-Oct-23
22-Sep-23
09/17/2023
10-08-2023
Create a single Date Dimension and map all these dates to that table.
See the reference diagram given in canvas
Sample 4
Another DateTime
2017-12-23 12:00:00
Reseda, CA, United States
2017-12-23 12:00:00
Los Angeles, CA, United States
2018-01-05 14:00:00
Mission Viejo, CA, United States
you can create a date time dimension like this
1
2017-12-23 12:00:00
2017
12
23
12
0
6
False
False
2
2018-01-05 14:00:00
2018
1
5
14
0
5
False
False
Sample 5
Digital Marketing Specialist
5 to 15 Years
M.Tech
$59K-$99K
Youth (<25)
Web Developer
2 to 12 Years
BCA
$56K-$116K
Adults (35-64)
Operations Manager
0 to 12 Years
PhD
$61K-$104K
Young Adults (25-34)
Network Engineer
4 to 11 Years
PhD
$65K-$91K
Young Adults (25-34)
Event Manager
1 to 12 Years
MBA
$64K-$87K
Adults (35-64)
Dimension will look like this
Experience Dimension Table
1
5 to 15 Years
5
15
2
2 to 12 Years
2
12
3
0 to 12 Years
0
12
4
4 to 11 Years
4
11
5
1 to 12 Years
1
12
Salary Range Dimension Table
1
$59K-$99K
$59K
$99K
2
$56K-$116K
$56K
$116K
3
$61K-$104K
$61K
$104K
4
$65K-$91K
$65K
$91K
5
$64K-$87K
$64K
$87K
Age Group Dimension Table
1
Youth (<25)
<25
NULL
24
2
Adults (35-64)
35-64
35
64
3
Young Adults (25-34)
25-34
25
34
Sample 6
100293
SANTORINI GAVALA WHITE - 750ML
100641
CORTENOVA VENETO P/GRIG - 750ML
100749
SANTA MARGHERITA P/GRIG ALTO - 375ML
The Dimension will turn out to be like this
1
100293
SANTORINI GAVALA WHITE
750ML
2
100641
CORTENOVA VENETO P/GRIG
750ML
3
100749
SANTA MARGHERITA P/GRIG ALTO
375ML
Sample 7
United Kingdom, United States of America
English, French, Japanese, Swahili
United Kingdom, United States of America
English
United Kingdom, United States of America
English, Mandarin
Let's create the Dimension table first
1
United Kingdom
2
United States of America
1
English
2
French
3
Japanese
4
Swahili
5
Mandarin
There are two approaches
Creating Many to Many
1
1
1
1
2
1
1
1
2
1
2
2
Pros
Normalization: Easier to update and maintain data.
Cons
Storage: May require more storage for the additional tables and keys.
Fact Table with Array Datatypes
1
[1, 2]
[1, 2, 3, 4]
2
[1, 2]
[1]
3
[1, 2]
[1, 5]
With newer Systems like Spark SQL which supports Arrays
Simplicity: Easier to understand and less complex to set up.
Performance: Could be faster for certain types of queries, especially those that don't require unpacking the array.
Last updated