Re: Delta Table storage best practices

Gim · ‎09-27-2022

Apologies for not getting back as soon as I hoped. Appreciate the inputs!

(1) After having further discussions and clarity on the requirements of the project, I do agree it is no longer required to store the DW to MySQL. Only reason it was probably considered in the first place was because it was the only RDBMS solution the client had. Now that they have Databricks, might as well leverage the architecture and platform. I also personally would not want to deal with the extra layer of complexity just to maintain pipelines to keep tables up to date in both MySQL and Databricks natively.

(2) I never realized that I could do both registering tables in Hive metastore AND storing the Delta tables in ADLS. I always thought it had to be either one. I will definitely look in to Unity catalog and see if I can implement it soon. Can this be done within a single readStream and writeStream call? Here's what I have currently:

spark
   .readStream
   .format("cloudFiles") 
   .option("cloudFiles.format", source_format)
   .option('header', 'true')  
   .option("cloudFiles.schemaLocation", schema_path)
   .load(adls_raw_file_path)
 
    .writeStream
    .format('delta')
    .outputMode('append')
    .queryName(query_name)
    .option('checkpointLocation', checkpoint_path)
    .option("mergeSchema", "true")
    .trigger(availableNow=True)
    .start(adls_delta_table_path)