08 mayo 2024 / 11:33 AM

Databricks Freaky Friday Pills #3: DLT & DQ framework

SDG Blog

Welcome aboard another riveting journey into the realms of Databricks! If you’ve been following our previous articles, you’re well on your way to building a solid foundation in Databricks.

Welcome aboard another riveting journey into the realms of Databricks! If you’ve been following our previous articles, you’re well on your way to building a solid foundation in Databricks. But fear not if you’ve just joined us — we’re here to catch you up on what you’ve missed. In our earlier stories, we delved into the intricacies of Databricks’ architecture, uncovering its key components that pave the way for creating comprehensive end-to-end machine learning solutions. If you need a refresher, you can find our previous discussion [#1]. Additionally, we took a deep dive into workspace capabilities and workflow definitions, with a particular focus on Jobs [#2].

Now, as promised in the previous article, it’s time to plunge into one of the cornerstone components of Databricks’ workflow — the so-called Delta Live Tables (DLTs). Think of DLTs as a declarative framework that lays the groundwork for building robust, maintainable, and thoroughly testable data processing pipelines for data quality.

Throughout this article, we’ll guide you through the development journey of a DLT, highlighting its core features, and shedding light on their respective benefits and limitations. We’ll also introduce you to a critical aspect of every business solution: data quality. With DLTs’ expectation feature, we’ll set up barriers within our pipelines to ensure that only pristine data flow through the DLT and are stored in the data lake. So, buckle up for our third adventure into the captivating world of Databricks!

1. Syntaxis DLTs

We’ll give you the basis to understand Delta Live Tables syntaxis, so we will have the tools to build our first DLT pipeline. To grasp the Delta Live Tables (DLT) syntax fundamentals and commence constructing our DLT pipeline, we must first import the “dlt” module:

Once imported, it’s crucial to note that both materialized views and streaming tables employ the @table decorator. To load a streaming table, simply apply it to a streaming read operation (table.readStream). Conversely, utilize static read (table.read) for loading a materialized view.

Below you can find the general syntax for declaring DLT materialized views and streaming tables:

For views, Databricks simplifies the process with the @view decorator:

By using a decorator, optional parameters, and a function returning a table, which will be stored in the target schema defined during DLT creation, table declaration becomes straightforward. For a detailed reference of DLT parameters, visit [link]. Additionally, various decorators related to Data Quality facilitate handling “bad” data, as discussed later in this article.

Now that we’ve covered the general DLT syntax, let’s deep into an example illustrating data loading in such pipelines. Typically, the pipeline’s first table retrieves data from the metastore:

Then, within the same pipeline, we can read from the initial table using the declaring function’s name:

Evidently, working with DLTs proves straightforward. Constructing pipelines to prep data for ML solutions can be achieved succinctly within a single notebook. While we’ve demonstrated the primary operations for loading and storing these tables, numerous other questions remain to be explored concerning these live versions of Delta Tables, such as: What are the DLT types? How do we monitor them? How do we select the target schema? We will try to answer these questions in the following sections.

2. DLT stands for Delta Live Tables

In summary, Delta Live Tables (DLT) in Databricks refers to a comprehensive framework designed for managing and processing data workflows in real-time and batch modes. It can integrate with Delta Lake, enabling efficient data ingestion, transformation, and analysis across various data sources and formats. It has features such as streaming tables, views, and materialized views, enabling users to build scalable data pipelines for diverse use cases, including real-time analytics, machine learning, and operational monitoring. We will describe these features in the following subsections:

Tables and views

In Databricks, Delta Live Tables (DLTs) offer a versatile framework for data processing, featuring three primary methods: streaming tables, views, and materialized views.

Streaming Tables: Streaming tables are tailored for handling streaming or incremental data processing tasks. They are engineered to efficiently manage growing datasets, ensuring each row is processed only once. This capability is pivotal for ingestion workloads requiring data freshness and low latency, making streaming tables well-suited for real-time data solutions.
Views: views serve as intermediate object representation within the Delta Live Pipeline. While views can compute or apply caching optimizations, they do not materialize the results. Instead, they optimize data access and processing within the pipeline itself. Databricks advocates for leveraging views to enforce data quality constraints and augment datasets, thus facilitating the efficient execution of multiple downstream queries.
Materialized Views: Materialized views represent a more concrete form of views, where the result is precomputed according to a specified refresh schedule. These tables are adept at managing any changes in the input data and persisting in the computed output to the metastore or catalog. Materialized views offer a practical solution for scenarios requiring precomputed, frequently accessed data, enhancing query performance and reducing computational overhead.

The image below shows a graphical representation of these three concepts:

Delta Live Table pipelines

Pipelines comprise materialized views and streaming tables, which are declared in Python or SQL source files. DLT intelligently infers dependencies between these tables, ensuring that updates occur in the correct sequence.

Delta Live Tables pipelines feature two primary categories of settings:

Dataset Declarations: These configurations define the notebooks or files containing Delta Live Tables syntax for declaring datasets. As demonstrated in the previous section, pipelines are managed through decorators and functions within notebooks.
Pipeline Infrastructure Settings: These configurations control the pipeline infrastructure, update processing, and table saving within the workspace. This aspect will be further explored in the subsequent section, where we’ll create a DLT pipeline from scratch and configure each necessary step.

While many configurations are optional, some require meticulous attention, particularly for production pipelines. These critical configurations include:

Target Schema Declaration: Specifying a target schema is essential for publishing data outside the pipeline, especially to the Hive metastore or Unity Catalog.
Data Access Permissions Configuration: Configuring data access permissions in the execution cluster ensures appropriate access to data sources and target storage locations.

Limitations of DLTs

While Delta Live Tables (DLT) offers powerful capabilities, there are certain limitations to consider:

Target Schema Constraint: The target schema can only be set for the entire DLT pipeline. This restricts the ability to store output from intermediate steps in the pipeline to different schemas.
Exclusive Delta Table Usage: All tables created and updated by Delta Live Tables are automatically designated as Delta tables, limiting flexibility in utilizing alternative formats.
Single Operation Restriction: DLTs can only serve as the target of a single operation within all Delta Live Tables pipelines, potentially constraining complex pipeline configurations.
Identity Column Limitation: Identity columns cannot be utilized with tables targeted by “APPLY CHANGES INTO” and may undergo re-computation during updates for materialized views. To ensure smooth operations, Databricks recommends restricting the use of identity columns for streaming tables within Delta Live Tables.
Pipeline Limitation: There is a current limitation of 100 DLT pipelines within a single workspace, which may impact scalability for organizations managing large-scale data workflows.

3. What to expect, data quality?

In any well-defined data project, establishing clear constraints and rules is essential for achieving goals within the boundaries of data quality. Databricks provides a framework for defining this quality using the so-called “expectations”.