Delta Lake 101

Anmol Kale
5 min readJun 28, 2022

--

What is delta lake ? A lake ? A storage? A Repository ?

Delta Lake is an Open-Source Data Storage Layer which is built on spark that hold the data in its natural format like structured , semi-structured or unstructured. Fundamentally, Delta Lake maintains a transaction log alongside the data. This enables each Delta Lake table to have ACID Properties compliant reads and writes.

  • Delta lake brings full ACID transactions to Apache Spark. That means jobs will either complete or not at all.
  • Delta is open-sourced by Apache. You can store a large amount of data without worrying about locking.
  • Delta lake is deeply powdered by Apache Spark which means that the Spark jobs (batch/stream) can be converted without writing those from scratch.

Delta Lake Architecture

Delta Lake allows us to incrementally improve the quality until it is ready for consumption. Data flows like water in Delta Lake from one stage to another stage.

Bronze Tables:
Data may comes from various sources which could be Dirty. Thus, It is a dumping ground for raw data

Silver Tables:
Consists of Facts and Dimensions ,DML Operations and Data Cleaning is easy in this stage. It is easy for debugging.

Gold Tables:
Consists of clean data, Aggregated tables which is ready for consumption. Can also be created in synapse as well.

1. We will use Synapse notebooks for creating and loading data in Bronze and Silver DB’s
2. We can use Spark SQL for loading the data into Gold layer
3. Once the Data is loaded into Gold layer we can use OnDemand SQL Pool to create external table so that Analysts can access
4. You can access On demand Pool in Power BI as well

Why Delta Lake is Different ??

  • Metadata Handling: Delta Lakes are capable of handling even petabytes of data with ease.
  • Schema Enforcement: It reads the Schema as a part of the metadata and looks at every column, data type, etc.
  • Unified Batch and Streaming: Delta Lakes provides a single architecture for reading stream data and batch data as well.
  • Upserts and Merges : Delta allows you to make Upserts and Merges easily. It is similar to SQL Merges into the Delta table. It allows you to merge data from another data frame into your table and apply updates, inserts, and deletes.
  • Time Travel : Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.

How Delta Works?

Delta lake is based on Parquet, it adds the transactional awareness to Parquet using transaction log which will be maintained in additional folder (_delta_log ) under the table directory.

Which Language to prefer ?

Delta Lake supports Scala / Java / Python and SQL APIs for a variety of functionality. Support for merge, update, and delete operations helps you to meet compliance requirements.

Tables in Delta Lake

A Delta Table records version changes or modifications in a table in Delta Lake. Unlike traditional tables that store data in a row and column format, the Delta Table facilitates ACID transactions and time travel features to store metadata information for quicker Data Ingestion. Data stored in a Delta Table is a secure Parquet file format that is an encoded layer over data.

Types of Delta Tables :

Managed Tables

•A managed table is a Spark SQL table for which Spark manages both the data and the metadata.

• Managed tables are the default when creating a table.

• The data for a managed table resides in the LOCATION of the database it is registered to.

•In order to move a managed table to a new database, you must rewrite all data to the new location.

  • Available for all clusters

•df.write.saveAsTable(“table_name”)

Creating managed Table:-

“Here the location/loadpath of the tables will be of the default Database of databricks”

2. Unmanaged Tables

•Databricks only manages the metadata for unmanaged (external) tables.

• Unmanaged tables will always specify a LOCATION during table creation; you can either register an existing directory of data files as a table or provide a path when a table is first defined.

•Data and metadata are managed independently.

•User who has access to workspace can query the data in file location

•Different systems can read the data without altering it.

•df.write.option(“path”,“/path/to/empty/directory”).saveAsTable(“table_nme”)

“Databricks version greater than 8 uses delta lake table as the default table

Creating Unmanaged tables:-

“Here we can control the location of the tables”

Comparison Between Both Tables:

•Managed delta table execution speed is faster than Unmanaged delta table.

•It is easily accessible in Databricks and can be dropped when no longer required.

•Unmanaged delta tables stores the previous versions of files which can be retrieved when data is deleted or updated by mistake.

  • Anyone who has access to Databricks can read the data of Unmanaged delta tables that is present in the file location.

--

--

Anmol Kale
Anmol Kale

No responses yet