topic Re: Running a jar on Databricks shared cluster using Airflow in Data Engineering

Running a jar on Databricks shared cluster using Airflow

ayush19 — Mon, 25 Nov 2024 07:05:16 GMT

Hello,

I have a requirement to run a jar already installed on a Databricks cluster. It needs to be orchestrated using Apache Airflow.

I followed the docs for the operator which can be used to do so https://airflow.apache.org/docs/apache-airflow-providers-databricks/1.0.0/operators.html

The issue is, that every time I run this DAG, the cluster restarts and the jar file is installed on the cluster again. The file is already stored in Volume and installed on cluster, yet it restarts and re-installs jar on cluster.

How can I avoid this?

Re: Running a jar on Databricks shared cluster using Airflow

Alberto_Umana — Mon, 25 Nov 2024 15:53:07 GMT

Hello @ayush19,

Here are some suggestions, but would need to check how are the parameters configured.

Use an Existing Cluster: Instead of creating a new cluster each time, configure the DatabricksSubmitRunOperator to use an existing cluster. This can be done by specifying the existing_cluster_id parameter in the operator. This way, the cluster will not restart, and the jar file will not be reinstalled.

Cluster Configuration: Ensure that the cluster configuration does not force instance replacement upon restart. According to the context, one way to achieve this is by disabling multi-AZ (Availability Zone) selection in the cluster configuration. This can help in reusing the same instances rather than creating new ones

Re: Running a jar on Databricks shared cluster using Airflow

ayush19 — Tue, 26 Nov 2024 09:18:00 GMT

Hi Alberto,

I am using an existing cluster for it and not creating new cluster. I am using an all purpose cluster and which is used by multiple people in different regions so I'm not sure if I can disable Multi AZ. Is there a solution in which I can use an existing instance of cluster?
Also if you could please explain why is it restarting exactly? the Jar file is already installed on cluster, then what's the need to install it again?