Databricks Community

Surajv · ‎01-26-2024

Hi team,

I wanted to know if there is a way to connect a piece of my pyspark code running in ECS to Databricks cluster and leverage the databricks compute using Databricks connect?

I see Databricks connect is for connecting local ide code to databricks cluster, but do we have a way to connect code running in ecs with databricks?

Surajv · ‎01-31-2024

I've replied here: https://community.databricks.com/t5/community-discussions/connect-my-spark-code-running-in-aws-ecs-t...

RonDeFreitas · ‎01-29-2024

In addition to the answer from @Retired_mod I would also add that your result set that would come back from a Databricks query may be too large to process in-memory on your ECS container node. Spark often excels when it comes to asynchronous workloads, not immediate result sets.

If you could briefly explain your use-case it would help to make a better recommendation.

Surajv · ‎01-31-2024

Replied here: https://community.databricks.com/t5/community-discussions/connect-my-spark-code-running-in-aws-ecs-t...

Surajv · ‎01-30-2024

Noted @Retired_mod @RonDeFreitas.

I am currently using Databricks runtime v12.2 (which is < v13.0). I followed this doc (Databricks Connect for Databricks Runtime 12.2 LTS and below) and connected my local terminal to Databricks cluster and was able to execute a sample spark code utilising my cluster compute from the terminal. Parallelly was also able to execute code on remote jupyter notebook following docs.

Though I have a 1 questions regarding this.

Current architecture of our system for context:

I have python scripts, in a service, triggered via Airflow jobs. These scripts run on ECS (wrapped around Airflow's ECSOperators). Primary job of these scripts is to import data from S3 do some processing and dump it back in S3. Today a lot of this computation is done in numpy/pandas/dask. And we want to move it to pyspark by leveraging Databricks cluster that we have. A rough overview of our goal is to create a connector, this connector will create spark session, and we will rewrite the pandas/dask code with spark. The underlying compute would be databricks-spark compute.
We are not inclined going with the approach of using Databricks operators for now, hence goal is to use Databricks connector and leverage the compute.

Question(s):

My current databricks version is 12.2. I do see some relevant info regarding how to leverage dataricks-connect with a remote spark in docs for version v13.0+ of databricks-runtime. Is an upgrade necessary, just confirming?
Does Airflow version matter in this regard? Since in case of PoC'ing databricks operators, I got to know, it requires Airflow v2.5+. Our current Airflow version is v2.4.2. Since, we more inclined towards using databrick-connect and not databricks operators to use the compute, version is not a matter isn't it?

Approach(s):

As a part of PoC'ing the approach, I setup latest Airflow locally v2.8.1 and followed the Databricks-connect docs, though I faced issues and realized its probably due to Databricks version 12.2 that we have. I will tweak my approach based on clarification from this question.