cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to deploy Azure Databricks Sql (Table, Views, etc) into higher environments?

Jayesh
New Contributor III

We have build multiple tables and Views under Databricks Sql and unable to figure out how can we take this code and deploy at our higher environments? Need some guidance as we're unable to get any information from searched documentations.

14 REPLIES 14

-werners-
Esteemed Contributor III

Can you explain what you mean by 'higher environments', and if it is the data being written or the code you want to put in there?

Jayesh
New Contributor III

I mean QA, PreProd & production environments for "code" deployments.

-werners-
Esteemed Contributor III

https://docs.microsoft.com/en-us/azure/databricks/dev-tools/ci-cd/ci-cd-azure-devops

You don´t have to follow the whole CI/CD procedure. You could f.e. opt for only code promotion to QA/PRD using git + Databricks Repos.

In a small team this could even be done manually (commits using Repos on git and pull requests).

So basically git + Repos is the way.

Kaniz
Community Manager
Community Manager

Hi @Jayesh Mehta​, Can you please look at the article? Please let us know if this helps.

Bepposbeste
New Contributor II

Dear Databricks colleagues, the above answers do not help - sorry to say that. We as enterprise customers are well aware of how Azure DevOps works and how to do a Databricks deployment in general.

What I am missing is a SQL Server database project equivalent that I can deploy.

How can I control which tables, views, stored procedures are deployed to the different stages?

How can I control that tables are created first, then views, then views that reference views and so on.

Which engine generates a change script for me that executes an ALTER VIEW if necessary?

-werners-
Esteemed Contributor III

code is written in notebooks, which can be deployed using git ci/cd

database artifacts in spark (tables/views) are nothing more than semantic views on top of data lake files (parquet, delta lake, orc etc).

the creation of those tables can be done using a spark script which resides typically in a notebook (or a jar or a wheel).

Does that clear up anything?

Anonymous
Not applicable

Hey there @Jayesh Mehta​ and @Josef Prakljacic​ 

Hope everything is going great.

Checking in. Does @Werner Stinckens​'s response answer your question? If yes, would you be happy to mark it as best so that other members can find the solution more quickly? Else please let us know if you need more help. 

We'd love to hear from you.

Cheers!

mmlime
New Contributor III

Hi,

I agree with @Josef Prakljacic​ . If Databricks would like to compete SQL DWH/Synapse or Snowflake and target DWH users, It should prepare some guidelines how to manage "database" objects.

Yea, @Werner Stinckens​ with Engineering workspace and pySpark style notebooks you create your object directly from write statements. But now, with Delta, Delta Lake and mainly Identity columns you need to create your tables ahead, when you are creating your environment.

And sorry, but it is not only about creating tables. Later in production when tables are populated you maybe wants to rename column, or some other alter table operation. And in the best scenario you would like to use descriptive language (DDL) to do that and test it in lower environments. (DB Project for SQL Server is great example).

So lets discuss, how do you solve this in your production?

use case.:

I have a table `Customer` with 10000 records already in production. I would like to change column name `surname` to `lastName`.

I have std. dev flow. DEV env -> TEST -> STAGE -> PROD.

How to prepare change in DEV, test in in TEST a and propagate to PROD?

I can start:

I have customer.sql file with table DDL.

I will add there a command to ALTER TABLE.

I run this SQL at the beginning of CD pipeline.

I will test my pipelines after change in TEST and if succeed deploy to PROD.

OK fine greate. But now I have ALTER TABLE cmd in my SQL file and it fail with next CD pipeline run, cause `surname` column does not more exist.

How did you solve this @Josef Prakljacic​ ?

Thanks.

-werners-
Esteemed Contributor III

mmlime
New Contributor III

Thanks, but we are not looking for command how to rename a column. We are looking for process how to propagate it from dev to test. Test it and release to production. -> in real world. (with data, running pipelines ... ) no Demo notebook

-werners-
Esteemed Contributor III

Besides the CI/CD link I posted above, there is not a lot to be found.

A lot of companies deploy their scripts using jars. When using that approach you can apply software engineering practices to your scripts.

But that being said: Databricks is focusing heavily on notebooks (and python) and slowly steering away of the 'old' spark way of working.

It would be awesome to have a SE notebook framework, something that is lacking right now.

There is DBX, and delta live tables (python/sql only), but those are far from ideal.

Something that IS very valuable though are 'git for data' frameworks like LakeFS or Nessie. always use production data for dev/qa and commit to branches (which you can merge or not).

Frede
New Contributor III

Hi @Michal Mlaka​ .

I'm very interested in this topic as well. How did you decide to proceed?

mmlime
New Contributor III

Hi there, to be honest, I'm not currently working on any Databricks projects, so I haven't had the opportunity to work on that. However, I noticed that there is a new feature in the Terraform provider where you can register schemas and tables in Unity Catalog. This is something I would consider for a proof of concept when I work on Databricks in future projects.

Frede
New Contributor III

Ok, thanks for the reply.

I've landed a different solution. Even though Terraform could be a neat option, I would prefer to primarily use it for IaC-type operations.

What I've done (as opposed to having DDL in -many different- notebooks) is to have a single DDL notebook. This notebook read queries from a Control Table in our lake (this can also reside in a dedicated DB to easily cover multiple DBX workspaces).

DDLs are registered in this table and marked as 'pending' if to be executed for the next release. As part of the release process, this is then executed and then marked as non-pending again. this ensures statements are only executed once, for example, an alter statement. While at the same time having your DDL history.

Here is the code. Hope it can help someone else:

%sql
### Inserting Create Table statement for Calendar Dimension
 
INSERT INTO
  Control.DDL (
    SchemaName,
    TableName,
    QueryString,
    QueryOrder,
    PendingIndicator,
    LastExecutedDateTime,
    DLCreatedDateTime
  )
 VALUES(
    "gold",
    "DimCalendar",
    """CREATE TABLE Gold.DimCalendar (
  DLKey BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 0 INCREMENT BY 1),
  CalendarDateID DATE NOT NULL,
  DayShortName STRING NOT NULL,
  DayLongName STRING NOT NULL,
  WeekDayNumber INTEGER NOT NULL,
  MonthShortName STRING NOT NULL,
  MonthLongName STRING NOT NULL,
 MonthDayNumber INTEGER NOT NULL,
 QuarterNumber INTEGER NOT NULL,
 QuarterShortName STRING NOT NULL,
 QuarterLongName STRING NOT NULL,
 YearShort INTEGER NOT NULL,
 YearLong INTEGER NOT NULL,
 YearDayNumber INTEGER NOT NULL,
 YearWeekISONumber INTEGER NOT NULL,
 YearMonthNumber INTEGER NOT NULL,
 OrdinalDayNumber INTEGER NOT NULL,
 OrdinalCalendarYearNumber INTEGER NOT NULL,
 DLCreatedDateTime TIMESTAMP NOT NULL,
 DLUpdatedDateTime TIMESTAMP NOT NULL
 ) USING DELTA LOCATION '/mnt/ContainerNameGold/DimCalendar'""" ,
    0,
    1,
    NULL,
    Current_Timestamp
  )
 
%python
### this is the logic to execute the pending DDLs
  
ExecutionDF = spark.sql("SELECT * FROM Control.DDL WHERE PendingIndicator IS TRUE ORDER BY SchemaName, TableName, QueryOrder")
ExecutionList = ExecutionDF.collect()
 
for i in ExecutionList:
    print(i["QueryString"])
    spark.sql(i["QueryString"])
    spark.sql(f"UPDATE Control.DDL SET PendingIndicator = 0, QueryOrder = 0, LastExecutedDateTime = CURRENT_TIMESTAMP WHERE ID = {i['ID']}")
 %sql
---Creating the DDL Control table:
CREATE TABLE IF NOT EXISTS Control.DDL (
ID BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1),
SchemaName STRING NOT NULL,
TableName STRING NOT NULL,
QueryString STRING NOT NULL,
QueryOrder INT,
PendingIndicator BOOLEAN NOT NULL,
LastExecutedDateTime TIMESTAMP,
DLCreatedDateTime TIMESTAMP
) USING DELTA LOCATION '/mnt/ContainerName/Control/DDL'
 
 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.