06-08-2022 07:55 AM
We have build multiple tables and Views under Databricks Sql and unable to figure out how can we take this code and deploy at our higher environments? Need some guidance as we're unable to get any information from searched documentations.
06-09-2022 03:01 AM
Can you explain what you mean by 'higher environments', and if it is the data being written or the code you want to put in there?
06-09-2022 04:19 AM
I mean QA, PreProd & production environments for "code" deployments.
06-09-2022 04:28 AM
https://docs.microsoft.com/en-us/azure/databricks/dev-tools/ci-cd/ci-cd-azure-devops
You don´t have to follow the whole CI/CD procedure. You could f.e. opt for only code promotion to QA/PRD using git + Databricks Repos.
In a small team this could even be done manually (commits using Repos on git and pull requests).
So basically git + Repos is the way.
06-09-2022 04:24 AM
Hi @Jayesh Mehta, Can you please look at the article? Please let us know if this helps.
07-21-2022 11:22 PM
Dear Databricks colleagues, the above answers do not help - sorry to say that. We as enterprise customers are well aware of how Azure DevOps works and how to do a Databricks deployment in general.
What I am missing is a SQL Server database project equivalent that I can deploy.
How can I control which tables, views, stored procedures are deployed to the different stages?
How can I control that tables are created first, then views, then views that reference views and so on.
Which engine generates a change script for me that executes an ALTER VIEW if necessary?
07-25-2022 12:16 AM
code is written in notebooks, which can be deployed using git ci/cd
database artifacts in spark (tables/views) are nothing more than semantic views on top of data lake files (parquet, delta lake, orc etc).
the creation of those tables can be done using a spark script which resides typically in a notebook (or a jar or a wheel).
Does that clear up anything?
08-16-2022 07:50 AM
Hey there @Jayesh Mehta and @Josef Prakljacic
Hope everything is going great.
Checking in. Does @Werner Stinckens's response answer your question? If yes, would you be happy to mark it as best so that other members can find the solution more quickly? Else please let us know if you need more help.
We'd love to hear from you.
Cheers!
10-21-2022 10:16 AM
Hi,
I agree with @Josef Prakljacic . If Databricks would like to compete SQL DWH/Synapse or Snowflake and target DWH users, It should prepare some guidelines how to manage "database" objects.
Yea, @Werner Stinckens with Engineering workspace and pySpark style notebooks you create your object directly from write statements. But now, with Delta, Delta Lake and mainly Identity columns you need to create your tables ahead, when you are creating your environment.
And sorry, but it is not only about creating tables. Later in production when tables are populated you maybe wants to rename column, or some other alter table operation. And in the best scenario you would like to use descriptive language (DDL) to do that and test it in lower environments. (DB Project for SQL Server is great example).
So lets discuss, how do you solve this in your production?
use case.:
I have a table `Customer` with 10000 records already in production. I would like to change column name `surname` to `lastName`.
I have std. dev flow. DEV env -> TEST -> STAGE -> PROD.
How to prepare change in DEV, test in in TEST a and propagate to PROD?
I can start:
I have customer.sql file with table DDL.
I will add there a command to ALTER TABLE.
I run this SQL at the beginning of CD pipeline.
I will test my pipelines after change in TEST and if succeed deploy to PROD.
OK fine greate. But now I have ALTER TABLE cmd in my SQL file and it fail with next CD pipeline run, cause `surname` column does not more exist.
How did you solve this @Josef Prakljacic ?
Thanks.
10-24-2022 04:14 AM
change column name on delta lake:
10-24-2022 05:10 AM
Thanks, but we are not looking for command how to rename a column. We are looking for process how to propagate it from dev to test. Test it and release to production. -> in real world. (with data, running pipelines ... ) no Demo notebook
10-24-2022 05:18 AM
Besides the CI/CD link I posted above, there is not a lot to be found.
A lot of companies deploy their scripts using jars. When using that approach you can apply software engineering practices to your scripts.
But that being said: Databricks is focusing heavily on notebooks (and python) and slowly steering away of the 'old' spark way of working.
It would be awesome to have a SE notebook framework, something that is lacking right now.
There is DBX, and delta live tables (python/sql only), but those are far from ideal.
Something that IS very valuable though are 'git for data' frameworks like LakeFS or Nessie. always use production data for dev/qa and commit to branches (which you can merge or not).
02-08-2023 06:11 AM
Hi @Michal Mlaka .
I'm very interested in this topic as well. How did you decide to proceed?
02-15-2023 12:56 AM
Hi there, to be honest, I'm not currently working on any Databricks projects, so I haven't had the opportunity to work on that. However, I noticed that there is a new feature in the Terraform provider where you can register schemas and tables in Unity Catalog. This is something I would consider for a proof of concept when I work on Databricks in future projects.
02-15-2023 05:59 AM
Ok, thanks for the reply.
I've landed a different solution. Even though Terraform could be a neat option, I would prefer to primarily use it for IaC-type operations.
What I've done (as opposed to having DDL in -many different- notebooks) is to have a single DDL notebook. This notebook read queries from a Control Table in our lake (this can also reside in a dedicated DB to easily cover multiple DBX workspaces).
DDLs are registered in this table and marked as 'pending' if to be executed for the next release. As part of the release process, this is then executed and then marked as non-pending again. this ensures statements are only executed once, for example, an alter statement. While at the same time having your DDL history.
Here is the code. Hope it can help someone else:
%sql
### Inserting Create Table statement for Calendar Dimension
INSERT INTO
Control.DDL (
SchemaName,
TableName,
QueryString,
QueryOrder,
PendingIndicator,
LastExecutedDateTime,
DLCreatedDateTime
)
VALUES(
"gold",
"DimCalendar",
"""CREATE TABLE Gold.DimCalendar (
DLKey BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 0 INCREMENT BY 1),
CalendarDateID DATE NOT NULL,
DayShortName STRING NOT NULL,
DayLongName STRING NOT NULL,
WeekDayNumber INTEGER NOT NULL,
MonthShortName STRING NOT NULL,
MonthLongName STRING NOT NULL,
MonthDayNumber INTEGER NOT NULL,
QuarterNumber INTEGER NOT NULL,
QuarterShortName STRING NOT NULL,
QuarterLongName STRING NOT NULL,
YearShort INTEGER NOT NULL,
YearLong INTEGER NOT NULL,
YearDayNumber INTEGER NOT NULL,
YearWeekISONumber INTEGER NOT NULL,
YearMonthNumber INTEGER NOT NULL,
OrdinalDayNumber INTEGER NOT NULL,
OrdinalCalendarYearNumber INTEGER NOT NULL,
DLCreatedDateTime TIMESTAMP NOT NULL,
DLUpdatedDateTime TIMESTAMP NOT NULL
) USING DELTA LOCATION '/mnt/ContainerNameGold/DimCalendar'""" ,
0,
1,
NULL,
Current_Timestamp
)
%python
### this is the logic to execute the pending DDLs
ExecutionDF = spark.sql("SELECT * FROM Control.DDL WHERE PendingIndicator IS TRUE ORDER BY SchemaName, TableName, QueryOrder")
ExecutionList = ExecutionDF.collect()
for i in ExecutionList:
print(i["QueryString"])
spark.sql(i["QueryString"])
spark.sql(f"UPDATE Control.DDL SET PendingIndicator = 0, QueryOrder = 0, LastExecutedDateTime = CURRENT_TIMESTAMP WHERE ID = {i['ID']}")
%sql
---Creating the DDL Control table:
CREATE TABLE IF NOT EXISTS Control.DDL (
ID BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1),
SchemaName STRING NOT NULL,
TableName STRING NOT NULL,
QueryString STRING NOT NULL,
QueryOrder INT,
PendingIndicator BOOLEAN NOT NULL,
LastExecutedDateTime TIMESTAMP,
DLCreatedDateTime TIMESTAMP
) USING DELTA LOCATION '/mnt/ContainerName/Control/DDL'
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group