cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Connect my spark code running in AWS ECS to databricks cluster

Surajv
New Contributor III

Hi team, 

I wanted to know if there is a way to connect a piece of my pyspark code running in ECS to Databricks cluster and leverage the databricks compute using Databricks connect?

I see Databricks connect is for connecting local ide code to databricks cluster, but do we have a way to connect code running in ecs with databricks?

4 REPLIES 4

Surajv
New Contributor III

RonDeFreitas
New Contributor II

In addition to the answer from @Retired_mod I would also add that your result set that would come back from a Databricks query may be too large to process in-memory on your ECS container node. Spark often excels when it comes to asynchronous workloads, not immediate result sets.

If you could briefly explain your use-case it would help to make a better recommendation.

Surajv
New Contributor III

Surajv
New Contributor III

Noted @Retired_mod @RonDeFreitas

I am currently using Databricks runtime v12.2 (which is < v13.0). I followed this doc (Databricks Connect for Databricks Runtime 12.2 LTS and below) and connected my local terminal to Databricks cluster and was able to execute a sample spark code utilising my cluster compute from the terminal. Parallelly was also able to execute code on remote jupyter notebook following docs. 

Though I have a 1 questions regarding this. 

Current architecture of our system for context: 

  • I have python scripts, in a service, triggered via Airflow jobs. These scripts run on ECS (wrapped around Airflow's ECSOperators). Primary job of these scripts is to import data from S3 do some processing and dump it back in S3. Today a lot of this computation is done in numpy/pandas/dask. And we want to move it to pyspark by leveraging Databricks cluster that we have. A rough overview of our goal is to create a connector, this connector will create spark session, and we will rewrite the pandas/dask code with spark. The underlying compute would be databricks-spark compute. 
  • We are not inclined going with the approach of using Databricks operators for now, hence goal is to use Databricks connector and leverage the compute.

Question(s): 

  • My current databricks version is 12.2. I do see some relevant info regarding how to leverage dataricks-connect with a remote spark in docs for version v13.0+ of databricks-runtime. Is an upgrade necessary, just confirming? 
  • Does Airflow version matter in this regard? Since in case of PoC'ing databricks operators, I got to know, it requires Airflow v2.5+. Our current Airflow version is v2.4.2. Since, we more inclined towards using databrick-connect and not databricks operators to use the compute, version is not a matter isn't it?

Approach(s):

  • As a part of PoC'ing the approach, I setup latest Airflow locally v2.8.1 and followed the Databricks-connect docs, though I faced issues and realized its probably due to Databricks version 12.2 that we have. I will tweak my approach based on clarification from this question

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group