Data Warehousing
  • Data Warehousing
  • Readme
  • Fundamentals
    • Terms to Know
    • Jobs
    • Skills needed for DW developer
    • Application Tiers
    • Operational Database
    • What is a Data Warehouse
      • Typical Data Architecture
      • Problem Statement
      • Features of Data Warehouse
      • Need for Data Warehouse
      • Current State of the Art
    • Activities of Data Science
    • Types of Data
    • Data Storage Systems
    • Data Warehouse 1980 - Current
    • Data Warehouse vs Data Mart
    • Data Warehouse Architecture
      • Top-Down Approach
      • Bottom-Up Approach
    • Data Warehouse Characteristic
      • Subject Oriented
      • Integrated
      • Time Variant
      • Non Volatile
    • Tools
    • Cloud vs On-Premise
    • Steps to design a Data Warehouse
      • Gather Requirements
      • Environment
      • Data Modeling
      • Choosing ETL / ELT Solution
      • Online Analytic Processing
      • Front End
      • Query Optimization
    • Dataset Examples
    • Thoughts on some data
  • RDBMS
    • Data Model
      • Entity Relationship Model
      • Attributes
      • Keys
      • Transaction
      • ACID
    • Online vs Batch
    • DSL vs GPL
    • Connect to Elvis
    • SQL Concepts
      • Basic Select - 1
      • Basic Select - 2
      • UNION Operators
      • Wild Cards & Distinct
      • Group By & Having
      • Sub Queries
      • Derived Tables
      • Views
    • Practice using SQLBolt
  • Cloud
    • Overview
    • Types of Cloud Services
    • Challenges of Cloud Computing
    • AWS
      • AWS Global Infrastructure
      • EC2
      • S3
      • IAM
    • Terraform
  • Spark - Databricks
    • Storage Formats
    • File Formats
    • Medallion Architecture
    • Delta
  • Data Warehousing Concepts
    • Dimensional Modelling
      • Star Schema
      • Galaxy Schema
      • Snowflake Schema
      • Starflake Schema
      • Star vs Snowflake
      • GRAIN
      • Multi-Fact Star Schema
      • Vertabelo Tool
    • Dimension - Fact
    • Sample Excercise
    • Keys
      • Why Surrogate Keys are Important
    • More Examples
    • Master Data Management
    • Steps of Dimensional Modeling
    • Types of Dimensions
      • Date Dimension Table
      • Degenerate Dimension
      • Junk Dimension
      • Static Dimension
      • Conformed Dimensions
      • Slowly Changing Dimensions
        • SCD - Type 0
        • SCD - Type 1
        • SCD - Type 2
        • SCD - Type 3
        • SCD - Type 4
        • SCD - Type 6
        • SCD - Type 5 - Fun Fact
      • Role Playing Dimension
      • Conformed vs Role Playing
      • Shrunken Dimension
      • Swappable Dimension
      • Step Dimension
    • Types of Facts
      • Factless Fact Table
      • Transaction Fact
      • Periodic Fact
      • Accumulating Snapshot Fact Table
      • Transaction vs Periodic vs Accumulating
      • Additive, Semi-Additive, Non-Additive
      • Periodic Snapshot vs Additive
      • Conformed Fact
    • Sample Data Architecture Diagram
    • Data Pipeline Models
    • New DW Concepts
Powered by GitBook
On this page
  1. Spark - Databricks

File Formats

CSV/TSV

Pros

  • Tabular Row storage.

  • Human-readable is easy to edit manually.

  • Simple schema.

  • Easy to implement and parse the file(s).

Cons

  • No standard way to present binary data.

  • No complex data types.

  • Large in size.

JSON

Pros

  • Supports hierarchical structure.

  • Most languages support them.

  • Widely used in Web

Cons

  • More memory usage due to repeatable column names.

  • Not very splittable.

  • Lacks indexing.

AVRO

Avro is a data serialization system often used in big data processing and storage frameworks like Apache Hadoop.

Examples of use cases for AVRO files include RPC, message queues, log files, and data pipelines.

Parquet

Parquet is a columnar storage file format optimized for use with Apache Hadoop and related big data processing frameworks. Twitter and Cloudera developed it to provide a compact and efficient way of storing large, flat datasets.

Best for WORM (Write Once Read Many).

The key features of Parquet are:

  1. Columnar Storage: Parquet is optimized for columnar storage, unlike row-based files like CSV or TSV. This allows it to efficiently compress and encode data, which makes it a good fit for storing data frames.

  2. Schema Evolution: Parquet supports complex nested data structures, and the schema can be modified over time. This provides much flexibility when dealing with data that may evolve.

  3. Compression and Encoding: Parquet allows for highly efficient compression and encoding schemes. This is because columnar storage makes better compression and encoding schemes possible, which can lead to significant storage savings.

  4. Language Agnostic: Parquet is built from the ground up for use in many languages. Official libraries are available for reading and writing Parquet files in many languages, including Java, C++, Python, and more.

  5. Integration: Parquet is designed to integrate well with various big data frameworks. It has deep support in Apache Hadoop, Apache Spark, and Apache Hive and works well with other data processing frameworks.

In short, Parquet is a powerful tool in the big data ecosystem due to its efficiency, flexibility, and compatibility with a wide range of tools and languages.

Difference between CSV and Parquet

Dataset
Size
Query Run Time
Data Scanned
Cost

CSV

1 TB

236 Seconds

1 TB

$5.75

Parquet

130 GB

6.78 Seconds

2.51 GB

$0.01

Savings

87% less space

34x faster

99% less data scanned

99.7% savings

Parquet Compression

  • Snappy (default)

  • Gzip

Snappy

  • Low CPU Util

  • Low Compression Rate

  • Splittable

  • Use Case: Hot Layer

  • Compute Intensive

GZip

  • High CPU Util

  • High Compression Rate

  • Splittable

  • Use Case: Cold Layer

  • Storage Intensive

PreviousStorage FormatsNextMedallion Architecture

Last updated 1 year ago