Noted @Retired_mod @RonDeFreitas.
I am currently using Databricks runtime v12.2 (which is < v13.0). I followed this doc (Databricks Connect for Databricks Runtime 12.2 LTS and below) and connected my local terminal to Databricks cluster and was able to execute a sample spark code utilising my cluster compute from the terminal. Parallelly was also able to execute code on remote jupyter notebook following docs.
Though I have a 1 questions regarding this.
Current architecture of our system for context:
- I have python scripts, in a service, triggered via Airflow jobs. These scripts run on ECS (wrapped around Airflow's ECSOperators). Primary job of these scripts is to import data from S3 do some processing and dump it back in S3. Today a lot of this computation is done in numpy/pandas/dask. And we want to move it to pyspark by leveraging Databricks cluster that we have. A rough overview of our goal is to create a connector, this connector will create spark session, and we will rewrite the pandas/dask code with spark. The underlying compute would be databricks-spark compute.
- We are not inclined going with the approach of using Databricks operators for now, hence goal is to use Databricks connector and leverage the compute.
Question(s):
- My current databricks version is 12.2. I do see some relevant info regarding how to leverage dataricks-connect with a remote spark in docs for version v13.0+ of databricks-runtime. Is an upgrade necessary, just confirming?
- Does Airflow version matter in this regard? Since in case of PoC'ing databricks operators, I got to know, it requires Airflow v2.5+. Our current Airflow version is v2.4.2. Since, we more inclined towards using databrick-connect and not databricks operators to use the compute, version is not a matter isn't it?
Approach(s):
- As a part of PoC'ing the approach, I setup latest Airflow locally v2.8.1 and followed the Databricks-connect docs, though I faced issues and realized its probably due to Databricks version 12.2 that we have. I will tweak my approach based on clarification from this question.