CSV/TSV/Parquet

CSV/TSV

Comma Separated Values

Tab Separated Values

Pros

Tabular Row storage.
Human-readable is easy to edit manually.
Simple schema.
Easy to implement and parse the file(s).

Cons

There is no standard way to present binary data.
No complex data types.
Large in size.

Parquet

Parquet is a columnar storage file format optimized for use with Apache Hadoop and related big data processing frameworks. Twitter and Cloudera developed it to provide a compact and efficient way of storing large, flat datasets.

Best for WORM (Write Once Read Many).

The key features of Parquet are:

Columnar Storage: Parquet is optimized for columnar storage, unlike row-based files like CSV or TSV. This allows it to compress and encode data efficiently, making it a good fit for storing data frames.
Schema Evolution: Parquet supports complex nested data structures, and the schema can be modified over time. This provides much flexibility when dealing with data that may evolve.
Compression and Encoding: Parquet allows for highly efficient compression and encoding schemes. This is because columnar storage makes better compression and encoding schemes possible, which can lead to significant storage savings.
Language Agnostic: Parquet is built from the ground up for use in many languages. Official libraries are available for reading and writing Parquet files in many languages, including Java, C++, Python, and more.
Integration: Parquet is designed to integrate well with various big data frameworks. It has deep support in Apache Hadoop, Apache Spark, and Apache Hive and works well with other data processing frameworks.

In short, Parquet is a powerful tool in the big data ecosystem due to its efficiency, flexibility, and compatibility with a wide range of tools and languages.

Difference between CSV and Parquet

Dataset

Size

Query Run Time

Data Scanned

Cost

CSV

1 TB

236 Seconds

1 TB

$5.75

Parquet

130 GB

6.78 Seconds

2.51 GB

$0.01

Savings

87% less space

34x faster

99% less data scanned

99.7% savings

PreviousStorage Formats NextParquet Example

Last updated 2 years ago

hashtagCSV/TSV

hashtagParquet

hashtagDifference between CSV and Parquet

CSV/TSV

Parquet

Difference between CSV and Parquet