top of page

Hands-on DataOps with Databricks, Terraform & GitHub Actions


A Step-by-Step Guide to Automating Databricks Deployments Using Infrastructure-as-Code


Why DataOps + DevOps for Databricks?


As teams scale their cloud-native data platforms, automation and reproducibility become essential. Manual provisioning and notebook execution just don’t cut it anymore. That’s where Infrastructure as Code (IaC) and CI/CD come in.


In this post, we’ll walk through what I presented in my latest webinar — a real-world automation pipeline that provisions Azure Databricks using Terraform, manages ETL notebooks and jobs, and schedules them using GitHub Actions.


Whether you're just getting started or already running Spark jobs in production, this guide will help you think like a platform engineer while working with data tools.


Architecture Overview


Here’s the high-level architecture we implemented:

Architecture of the Databricks Data Pipeline
Architecture of the Databricks Data Pipeline

Key components:

  • Terraform modules for reusable infrastructure

  • Azure for cloud resources (Databricks, Resource Groups, VNets)

  • GitHub Actions for automation

  • Databricks Jobs API for job orchestration

  • Fivetran for file ingestion (optional for real-time demo)



Modular Terraform Setup for Azure Databricks


We created two major layers:


infra/: Core Infrastructure

  • Resource Group

  • Virtual Network

  • Azure Databricks Workspace

  • Network Security Groups

module "databricks_workspace" {
  source                          = "../../../modules/databricks_workspace"
  workspace_name                  = "${local.prefix}-workspace"
  resource_group_name             = var.resource_group_name
  region                          = var.region
  managed_resource_group_name     = "${local.prefix}-managed-rg"
  vnet_id                         = module.network.vnet_id
  ...
}

apps/: Deploying Jobs, Notebooks, and Workflows


We created a simple Spark job as a Python script and uploaded it as a Databricks notebook:

resource "databricks_notebook" "nightly_job_notebook" {
  path     = "/Shared/nightly_task"
  language = "PYTHON"
  content_base64 = base64encode(file(var.notebook_file_path))
}

And the corresponding job:

resource "databricks_job" "nightly_serverless_job" {
  name = "Nightly Python Job - Serverless"
  notebook_task {
    notebook_path = databricks_notebook.nightly_job_notebook.path
  }
  schedule {
    quartz_cron_expression = "0 0 * * * ?"
    timezone_id = "UTC"
  }
  job_cluster {
    job_cluster_key = "serverless_cluster"
    new_cluster {
      spark_version = "13.3.x-scala2.12"
      runtime_engine = "PHOTON"
      num_workers = 1
    }
  }
}

GitHub Actions CI/CD for Terraform

We added a .github/workflows/terraform.yml pipeline:

name: Deploy Databricks Infra

on:
  push:
    paths:
      - 'apps/**'
      - 'infra/**'
  workflow_dispatch:

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3

      - name: Terraform Init
        run: terraform init

      - name: Terraform Apply
        run: terraform apply -auto-approve

This allows you to trigger deployments automatically or manually on any infra/app changes.


Testing with Databricks Community Edition


To make the workshop accessible, we demonstrated how to:

  • Create a free Databricks Community Edition account

  • Run the same jobs and notebooks without Azure billing

  • Sync code from GitHub manually or using databricks-cli


What You’ll Walk Away With


By the end of this exercise, you can:

  • Deploy Azure Databricks workspaces using Terraform

  • Organize your infrastructure and application layers cleanly

  • Manage Spark jobs, notebooks, and workflows as code

  • Automate it all using GitHub Actions


What’s Next?



In the upcoming sessions and course, we’ll dive into:


  • 🔐 Secure secret management (Key Vault + Databricks secrets)

  • 🧩 Advanced CI/CD pipelines

  • 🧬 Integrating Fivetran, dbt, and Unity Catalog

  • 🌍 Multi-environment (dev/staging/prod) strategies


bottom of page