DUCK DB

DuckDB is a single file built with no dependencies.

All the great features can be read here https://duckdb.org/

Automatic Parallelism: DuckDB has improved its automatic parallelism capabilities, meaning it can more effectively utilize multiple CPU cores without requiring manual tuning. This results in faster query execution for large datasets.

Parquet File Improvements: DuckDB has improved its handling of Parquet files, both in terms of reading speed and support for more complex data types and compression codecs. This makes DuckDB an even better choice for working with large datasets stored in Parquet format.

Query Caching: Improves the performance of repeated queries by caching the results of previous executions. This can be a game-changer for analytics workloads with similar queries being run multiple times.

How to use DuckDB?

Download the CLI Client (Windows / Mac / Linux). For other programming languages, visit https://duckdb.org/docs/installation/
Unzip the file.
Open Command / Terminal and run the Executable.

How does DuckDB help in Data Engineering?

Download orders. parquet from

https://github.com/duckdb/duckdb-data/releases/download/v1.0/orders.parquet

More files are available here https://github.com/cwida/duckdb-data/releases/

Terminal

./duckdb

Create / Open a database

.open ordersdb

Duckdb allows you to read the contents of orders.parquet as is without needing a table. Double quotes around the file name orders.parquet is essential.

describe table  "orders.parquet"

Not only this, but it also allows you to query the file as-is. (This feature is similar to one data bricks supports)

select * from "orders.parquet" limit 3;

DuckDB supports CTAS syntax and helps to create tables from the actual file.

show tables;

create table orders  as select * from "orders.parquet";

select count(*) from orders;

DuckDB supports parallel query processing, and queries run fast.

This table has 1.5 million rows, and aggregation happens in less than a second.

select now(); select o_orderpriority,count(*) cnt from orders group by o_orderpriority; select now();

DuckDB also helps to convert parquet files to CSV in a snap. It also supports converting CSV to Parquet.

COPY "orders.parquet" to 'orders.csv'  (FORMAT "CSV", HEADER 1);Select * from "orders.csv" limit 3;

It also supports exporting existing Tables to Parquet files.

COPY "orders" to  'neworder.parquet' (FORMAT "PARQUET");

DuckDB supports Programming languages such as Python, R, JAVA, node.js, C/C++.

DuckDB ably supports Higher-level SQL programming such as Macros, Sequences, Window Functions.

Get sample data from Yellow Cab

https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

Copy yellow cabs data into yellowcabs folder

create table taxi_trips as select * from "yellowcabs/*.parquet";

SELECT
    PULocationID,
    EXTRACT(HOUR FROM tpep_pickup_datetime) AS hour_of_day,
    AVG(fare_amount) AS avg_fare
FROM
    taxi_trips
GROUP BY
    PULocationID,
    hour_of_day;

Extensions

https://duckdb.org/docs/extensions/overview

INSTALL json;
LOAD json;

select * from demo.json;

describe demo.json;

Load directly from HTTP location

select * from 'https://raw.githubusercontent.com/gchandra10/filestorage/main/sales_100.csv'

PreviousJQ NextCICD Intro

Last updated 1 year ago