topic How to run spark sql file through Azure Databricks in Data Engineering

How to run spark sql file through Azure Databricks

amama — Wed, 24 Jan 2024 19:41:57 GMT

We have a process that will write spark sql to a file, this process will generate thousands of spark sql files in the production environment.
These files will be created in the ADLS Gen2 directory.

sample spark file

---
val 2023_I = spark.sql("select rm.* from reu_master rm where rm.year = 2023 and rm.system_part='I'")
val criteria1_r1 = 2023_I.filter("field_id"==="nknk" or "field_id"==="gei")
criteria1_r1.write.mode("overwrite").save(path_to_adls_dir)

--------

We are exploring the best way to invoke these files from Azure Databricks. We would like to avoid reading files through Python to a variable and use this variable in the spark sql statement.

Re: How to run spark sql file through Azure Databricks

shan_chandra — Thu, 25 Jan 2024 04:10:15 GMT

@amama - you can mount the ADLS storage location in databricks. Since, this is a scala code, you can use workflow and create tasks to execute these scala code by providing the input as the mount location.

Re: How to run spark sql file through Azure Databricks

amama — Mon, 29 Jan 2024 18:43:40 GMT

@shan_chandra - The workflow is implemented in Azure Data Factory, the process (Map Reduce) which we are planning to replace with Databricks notebook will be invoked by ADF.

Essentially, we would like to call all these scripts (pig equivalent spark scripts) through a notebook, and this notebook will be an activity in ADF.

Re: How to run spark sql file through Azure Databricks

shan_chandra — Mon, 29 Jan 2024 23:01:38 GMT

@amama - using Databricks Notebook Activity in ADF, kindly invoke these individual scripts as an individual notebook by specifying notebook path and configure the Databricks linked service in ADF.