Databricks

Gim · ‎09-16-2022

Hi!

We have a project where we do some Data Engineering for a client. I implemented a scheduled batch processing of Databricks' autoloader (stream w/ availableNow) since they primarily have numerous file exports from several sources. We wanted to follow the medallion architecture and ELT.

The endpoint of the to-be-built data warehouse, however, will be primarily a MySQL database hosted in Azure. As of now, my autoloader output goes to Delta tables but not in the default Hive metastore. Instead, I save them into a container in ADLS storage accounts sorted by source.

From here, I intend to build out the rest of the silver and gold layer tables in separate containers.

Based on my current approach and the intended use case, what barriers/challenges would I face? What would be good alternatives?
What is the actual best practice for handling Delta tables and the locations from where they are stored?

Appreciate any inputs provided!

Ryan_Chynoweth · ‎09-21-2022

Hi Gim, please reach out to yours or the client's Databricks account team. They should be able to assist more. I would also look into Unity Catalog as it will make your data management much easier.

Having your bronze, silver, and gold containers in separate containers is fine. They can be in the same container if you wanted to as well. The differences are not that big. But overall multiple containers by zone is good.

(1) I would not recommend using a MySQL Database for your DW. You are already storing all your data in delta and ADLS. You should use Databricks SQL as your SQL compute. If you use a MySQL then you will replicate data and add unneeded complexity. You are able to do ETL, Data warehousing, BI, Streaming, and ML in Databricks.

(2) When working with Databricks you should store ALL your business data in your ADLS storage account just like you are doing. However, you can also create databases in Databricks using a location which will allow you to register the tables in the hive metastore while writing the data to ADLS. However, if you use Unity Catalog (linked above) this will automatically be handled for you!

Here is how to create a database with location and save a dataframe as a table:

%python
 
# create database
spark.sql("CREATE DATABASE IF NOT EXISTS my_db LOCATION ''abfss://mycontainer@mystorage.dfs.core.windows.net/path/to/db ")
 
# set default database 
spark.sql("use my_db")
 
# read data into df
df = spark.read.json("/path/to/file")
 
# write as delta table -- note that it will be saved to the default location of the database
df.write.saveAsTable("my_table")

If you already wrote a table to delta in ADLS but want to register it in the hive metastore:

# writing data to a location 
df.write.format("delta").save("/path/to/table")
 
 
# register the table in the database 
spark.sql("""
CREATE TABLE my_table 
AS 
SELECT * 
FROM delta.`/path/to/table`
""")
 
 
# read data with sql
spark.sql("SELECT * FROM my_table")

View solution in original post

Ryan_Chynoweth · ‎09-21-2022

Hi Gim, please reach out to yours or the client's Databricks account team. They should be able to assist more. I would also look into Unity Catalog as it will make your data management much easier.

Having your bronze, silver, and gold containers in separate containers is fine. They can be in the same container if you wanted to as well. The differences are not that big. But overall multiple containers by zone is good.

(1) I would not recommend using a MySQL Database for your DW. You are already storing all your data in delta and ADLS. You should use Databricks SQL as your SQL compute. If you use a MySQL then you will replicate data and add unneeded complexity. You are able to do ETL, Data warehousing, BI, Streaming, and ML in Databricks.

(2) When working with Databricks you should store ALL your business data in your ADLS storage account just like you are doing. However, you can also create databases in Databricks using a location which will allow you to register the tables in the hive metastore while writing the data to ADLS. However, if you use Unity Catalog (linked above) this will automatically be handled for you!

Here is how to create a database with location and save a dataframe as a table:

%python
 
# create database
spark.sql("CREATE DATABASE IF NOT EXISTS my_db LOCATION ''abfss://mycontainer@mystorage.dfs.core.windows.net/path/to/db ")
 
# set default database 
spark.sql("use my_db")
 
# read data into df
df = spark.read.json("/path/to/file")
 
# write as delta table -- note that it will be saved to the default location of the database
df.write.saveAsTable("my_table")

If you already wrote a table to delta in ADLS but want to register it in the hive metastore:

# writing data to a location 
df.write.format("delta").save("/path/to/table")
 
 
# register the table in the database 
spark.sql("""
CREATE TABLE my_table 
AS 
SELECT * 
FROM delta.`/path/to/table`
""")
 
 
# read data with sql
spark.sql("SELECT * FROM my_table")

Gim · ‎09-27-2022

Hi @Ryan Chynoweth (Databricks)

Apologies for not getting back as soon as I hoped. Appreciate the inputs!

(1) After having further discussions and clarity on the requirements of the project, I do agree it is no longer required to store the DW to MySQL. Only reason it was probably considered in the first place was because it was the only RDBMS solution the client had. Now that they have Databricks, might as well leverage the architecture and platform. I also personally would not want to deal with the extra layer of complexity just to maintain pipelines to keep tables up to date in both MySQL and Databricks natively.

(2) I never realized that I could do both registering tables in Hive metastore AND storing the Delta tables in ADLS. I always thought it had to be either one. I will definitely look in to Unity catalog and see if I can implement it soon. Can this be done within a single readStream and writeStream call? Here's what I have currently:

spark
   .readStream
   .format("cloudFiles") 
   .option("cloudFiles.format", source_format)
   .option('header', 'true')  
   .option("cloudFiles.schemaLocation", schema_path)
   .load(adls_raw_file_path)
 
    .writeStream
    .format('delta')
    .outputMode('append')
    .queryName(query_name)
    .option('checkpointLocation', checkpoint_path)
    .option("mergeSchema", "true")
    .trigger(availableNow=True)
    .start(adls_delta_table_path)

Ryan_Chynoweth · ‎09-27-2022

Hi @Gimwell Young ,

Happy that it helped! With Unity Catalog you specify a default location in ADLS for all your data so it should be handled for you. Please reach out to your Databricks account team to learn more.

Creating and writing to a managed table would look something like the following. Note that there are many ways to do this. The only difference to your code is the "toTable" in the write stream.

df = (spark
   .readStream
   .format("cloudFiles") 
   .option("cloudFiles.format", source_format)
   .option('header', 'true')  
   .option("cloudFiles.schemaLocation", schema_path)
   .load(adls_raw_file_path)
)
 
(df.writeStream
    .format('delta')
    .outputMode('append')
    .queryName(query_name)
    .option('checkpointLocation', checkpoint_path)
    .option("mergeSchema", "true")
    .trigger(availableNow=True)
    .toTable("<TABLE NAME HERE>")
)

Gim · ‎09-27-2022

Hi @Ryan Chynoweth

Oh okay. I did have toTable initially for this stream call before setting it to an ADLS path as the target.

To clarify-- so for it to work, I set it back to toTable to save it to metastore by default and then the ADLS portion of our requirement will be handled by Unity Catalog. Yes? How coupled would the tables and files be?

Ryan_Chynoweth · ‎09-28-2022

That is correct. Unity catalog designates a storage location for all data within a metastore so when you save as a table it is stored in an ADLS account.

This would create a managed table which means that data and metadata are coupled. i.e. if I drop the table the data is also deleted.

If you create an external table then they are decoupled. i.e. if I drop a table then the data is still available in storage.

I would recommend reaching out to yours or your customer's Databricks account team to learn more and help with the overall architecture.

shan_chandra · ‎09-23-2022

@Gimwell Young - Please refer to the below documentation on Delta lake best practices.

https://docs.databricks.com/delta/best-practices.html

Also, I would suggest to go over Delta live tables and the quick demo here to implement the medallion architecture - https://www.databricks.com/discover/pages/getting-started-with-delta-live-tables#intro

jose_gonzalez · ‎09-27-2022

Hi @Gimwell Young,

Just a friendly follow-up. Did any of the responses help you to resolve your question? if it did, please mark it as best. Otherwise, please let us know if you still need help.

Databricks

Delta Table storage best practices

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Meet the Community Team Virtually!

Data Warehousing in the Era of AI

Submit your feedback and win a $25 gift card!