Delta Lake for read/write parquet files

What is Delta Lake?

Delta Lake is an open-source storage layer that brings ACID (atomicity, consistency, isolation, and durability) transactions for big data workloads.

Key features of Delta Lake

Feature

Description

ACID Transactions

Data lakes are typically populated through multiple processes and pipelines, some of which are writing data concurrently with reads. Prior to Delta Lake and the addition of transactions, data engineers had to go through a manual error prone process to ensure data integrity. Delta Lake brings familiar ACID transactions to data lakes. It provides serializability, the strongest level of isolation level.

Scalable Metadata Handling

In big data, even the metadata itself can be "big data." Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.

Time Travel (data versioning)

The ability to "undo" a change or go back to a previous version is one of the key features of transactions. Delta Lake provides snapshots of data enabling you to revert to earlier versions of data for audits, rollbacks or to reproduce experiments.

Open Format

Apache Parquet is the baseline format for Delta Lake, enabling you to leverage the efficient compression and encoding schemes that are native to the format.

Unified Batch and Streaming Source and Sink

A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.

Schema Enforcement

Schema enforcement helps ensure that the data types are correct and required columns are present, preventing bad data from causing data inconsistency.

Schema Evolution

Delta Lake enables you to make changes to a table schema that can be applied automatically, without having to write migration DDL.

Audit History

Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.

Updates and Deletes

Delta Lake supports Scala / Java / Python and SQL APIs for a variety of functionality. Support for merge, update, and delete operations helps you to meet compliance requirements. For more information, see Announcing the Delta Lake 0.6.1 ReleaseAnnouncing the Delta Lake 0.7 Release and Simple, Reliable Upserts and Deletes on Delta Lake Tables using Python APIs, which includes code snippets for merge, update, and delete DML commands.

What is the idea?

  • SQL-Datareader for parquet files
  • SQL Bulk-Insert for parquet files
  • Rolap Cubes for parquet files

Learn more at https://delta.io/

7
7 votes

Open For Voting · Last Updated