Articles

Platform-Agnostic Data Pipeline Automation with Jinja and Python

Written by SDG Group | Sep 26, 2025 12:56:44 PM

Declarative metadata-driven data pipeline development

Welcome to #TechStation, SDG Group's hub dedicated to the latest innovations in data and analytics! In this article, we demonstrate how to extend the use of templating systems like Jinja beyond their traditional boundaries to automate the creation of data pipelines. Through a practical example based on Python and Databricks' Delta Live Tables, you will see how a metadata-driven approach can standardize development, drastically reduce manual work, and adapt to any data platform.

Interested in something else? Check all of our content here.

In modern data engineering, tools like dbt and Dataform have popularized the use of templating engines to streamline data pipeline development workflows.

In particular, dbt relies on Jinja: a fast, expressive and flexible Python-based templating engine that allows developers to insert variables, control structures (such as loops and conditionals), and reusable macros directly into text templates. These features help simplify code management, promote reusability, and enhance readability.

When thinking about Jinja’s capabilities, a spontaneous question arises: could its usage in the domain of data engineering be extended to other programming languages and tools?

The answer is yes, and a simple example of this will be illustrated in this article. For this demonstration the reference setting will be Databricks’ Delta Live Tables framework and the programming language will be Python.



Problem statement

Let us now move to the problem statement. Populating the bronze layer of a medallion architecture could be indeed a very monotonous task for a data engineer.

Very often it is just all about creating entities as a selection of attributes that sometimes are renamed and cast according to requirements. In a situation like this, the perfect ally could be found in Jinja: a few lines of Python code and a template file could easily and quickly automate the development and save a team time that could be spent on other less repetitive activities.

In practice, imagine having some requirements that, for each entity to be added to the bronze layer from Parquet files in the data lake, contain a source location, an entity name and a list of attributes, potentially with their respective data types and aliases.

 

Solution

If not yet in a tabular format, it would be really easy to create two Pandas DataFrames from the information provided: entities with source_location and entity_name columns, and attributes with entity_name, attribute_name, data_type and attribute_alias columns.

Now that the requirements are stored in an accessible tabular format, it is time to build a Jinja template (saved as demo.jinja) for the Python code containing the common backbone on which the DLT entities will be based. In this case, the structure is very straightforward: a DLT streaming table being created as a selection of columns that, if required, are cast and renamed.

Here is the code for the demo.jinja file:

import dlt
from pyspark.sql.functions import *

@dlt.table(
comment = "Bronze table"
)
def ():
df = (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "parquet")
.load("")
.select(

)
)
return df

 

Now it is just a matter of combining the two ingredients: a simple Python script (saved as demo.py) can be used to iterate over the requirements in the entities and attributes DataFrames and obtain renderings of them using the Jinja template.

These renderings can then be written to individual Python files, one for each entity, that define the corresponding streaming table. It is important to note that the tables requirements have been hardcoded as Python dictionaries that are being read as DataFrames only with the goal of simplifying this example, but they can in fact be read from any kind of source file.

Here is the code for the demo.py file:

import pandas as pd
from jinja2 import Template

# Entities definition
entities = pd.DataFrame([
    {"entity_name": "customers", "source_location": "abfss://myContainer@myStorageAccount.dfs.core.windows.net/raw/customers/*/*.parquet"},
    {"entity_name": "orders", "source_location": "abfss://myContainer@myStorageAccount.dfs.core.windows.net/raw/orders/*/*.parquet"},
    {"entity_name": "products", "source_location": "abfss://myContainer@myStorageAccount.dfs.core.windows.net/raw/products/*/*.parquet"},
])

# Attributes with optional cast and alias
attributes = pd.DataFrame([
    {"entity_name": "customers", "attribute_name": "customer_id", "data_type": "int", "attribute_alias": None},
    {"entity_name": "customers", "attribute_name": "name", "data_type": "string", "attribute_alias": "full_name"},
    {"entity_name": "customers", "attribute_name": "signup_ts", "data_type": "timestamp", "attribute_alias": "signup_date"},
    {"entity_name": "orders", "attribute_name": "order_id", "data_type": "int", "attribute_alias": None},
    {"entity_name": "orders", "attribute_name": "customer_id", "data_type": "int", "attribute_alias": None},
    {"entity_name": "orders", "attribute_name": "order_total", "data_type": "double", "attribute_alias": "total_amount"},
    {"entity_name": "orders", "attribute_name": "order_date", "data_type": "date", "attribute_alias": None},
    {"entity_name": "products", "attribute_name": "product_id", "data_type": "int", "attribute_alias": None},
    {"entity_name": "products", "attribute_name": "name", "data_type": "string", "attribute_alias": "product_name"},
    {"entity_name": "products", "attribute_name": "price", "data_type": "double", "attribute_alias": None},
    {"entity_name": "products", "attribute_name": "category", "data_type": "string", "attribute_alias": None},
])

# Loading the Jinja template from the demo.jinja file
with open("demo.jinja") as demo_template_file:
    demo_template = Template(demo_template_file.read())

# Rendering code for each entity and saving it to a Python file
for _, entity in entities.iterrows():
    entity_attrs = attributes[attributes.entity_name == entity["entity_name"]].to_dict(orient="records")
    entity_code = demo_template.render(
        entity_name = entity["entity_name"],
        source_location = entity["source_location"],
        columns = entity_attrs
    )
    with open(f"{entity['entity_name']}.py", "w") as entity_file:
        entity_file.write(entity_code)
 


Here is what the rendered code looks like for the
customers.py file.

import dlt
from pyspark.sql.functions import *

@dlt.table(
    comment = "Bronze customers table"
)

def customers():
    df = (
            spark.readStream.format("cloudFiles")
            .option("cloudFiles.format", "parquet")
            .load("abfss://myContainer@myStorageAccount.dfs.core.windows.net/raw/customers/*/*.parquet")
            .select(
                col("customer_id").cast("int"),
                col("name").cast("string").alias("full_name"),
                col("signup_ts").cast("timestamp").alias("signup_date")
                )
       )
    return df

 

This approach not only reduces boilerplate, but also ensures consistency across the codebase and makes changes much easier to manage. If, for example, a specific pattern needs to be modified or extended across all bronze tables, it is enough to update the Jinja template once, with no more manual updates across dozens of Python scripts.

Here’s a high-level description of how the process works:

  1. A Jinja template defines the backbone of a streaming table in Python.
  2. A Python script reads this template and dynamically injects entity-specific values.
  3. For each entity, the script filters the attributes DataFrame to extract only the relevant fields, then uses Jinja to render a complete Python script.
  4. The final output is a set of ready-to-deploy streaming tables definitions — each in its own .py file.

 

 

Benefits of this approach

This modular setup fits perfectly within a CI/CD workflow. Whether running locally or in a cloud-hosted pipeline, the script can regenerate DLT definitions automatically based on updated requirements. This unlocks a very powerful pattern: declarative metadata-driven data pipeline development, where data engineers or business analysts can define the “what” (the entities and attributes) while the system takes care of the “how” (the repetitive Python logic).

More importantly, this concept is entirely platform-agnostic. While the demonstration here uses Databricks’ Delta Live Tables, the same logic can be adapted to other environments. Microsoft Fabric notebooks or Glue jobs on AWS, for example, could easily become targets just by modifying the template and the rendering logic.

This proof-of-concept shows how data teams could focus on modeling, requirements and data quality, rather than spending time on manual script development or maintenance. Furthermore, all of this can be built using accessible tools (Python, Jinja), which make it easy to learn, extend, and integrate into multiple data tools and platforms.

This simple idea could be extended and refined to support validations, dependency declarations, and layered transformations (silver, gold) by simply expanding the metadata model and enriching the templates.

 

 

Conclusion

To summarize, templating engines like Jinja are not just for web development or dbt: they are a general-purpose solution for reducing repetition and increasing consistency in code generation.

Extending Jinja’s use and applying it in the general context of data pipeline development automation, could really improve a team’s productivity.

The demo that has just been presented may be simple, but its implications are potentially big: with just a few lines of configuration and templating logic, one could be able to produce structured, scalable data pipelines that are easier to maintain, faster to deploy, and more aligned with a declarative data infrastructure vision.

Do you want to accelerate the development of your data pipelines and reduce manual work? Contact us for a personalized consultation and discover how our tailor-made automation frameworks, based on tools like Jinja and Python, can standardize and scale your data factory, freeing up your team for higher-value activities.