cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Connect my spark code running in AWS ECS to databricks cluster

Surajv
New Contributor III

Hi team, 

I wanted to know if there is a way to connect a piece of my pyspark code running in ECS to Databricks cluster and leverage the databricks compute using Databricks connect?

I see Databricks connect is for connecting local ide code to databricks cluster, but do we have a way to connect code running in ecs with databricks?

5 REPLIES 5

Kaniz_Fatma
Community Manager
Community Manager

Hi @Surajv, With Databricks Connect, you can seamlessly connect your PySpark code, running in ECS, to a Databricks cluster. Not only does Databricks Connect enable you to secure your preferred programming language to a Databricks cluster, but it also allows for the execution of Spark commands from a variety of environments, including IDEs, notebooks, and custom applications.

 

Python code executes on your local machine, but for DataFrame tasks, PySpark code operates within a remote Databricks workspace cluster. The results are then transmitted back to the local user. In order to properly use this feature, input your specific <workspace-instance-name>, <access-token-value>, and <cluster-id>. The <access-token-value> is a personal access token for authenticating with Databricks. 

 

Before starting, verify compatibility between Databricks Connect and Databricks Runtime versions. These instructions are tailored for Databricks Runtime versions 13.0 and above. However, remember that although Databricks Connect allows you to write and run Spark code on your Databricks clusters, certain limitations may depend on your setup and requirements.

 

As always, it's important to note these considerations to maximise this functionality.

 

Surajv
New Contributor III

RonDeFreitas
New Contributor II

In addition to the answer from @Kaniz_Fatma I would also add that your result set that would come back from a Databricks query may be too large to process in-memory on your ECS container node. Spark often excels when it comes to asynchronous workloads, not immediate result sets.

If you could briefly explain your use-case it would help to make a better recommendation.

Surajv
New Contributor III

Surajv
New Contributor III

Noted @Kaniz_Fatma @RonDeFreitas

I am currently using Databricks runtime v12.2 (which is < v13.0). I followed this doc (Databricks Connect for Databricks Runtime 12.2 LTS and below) and connected my local terminal to Databricks cluster and was able to execute a sample spark code utilising my cluster compute from the terminal. Parallelly was also able to execute code on remote jupyter notebook following docs. 

Though I have a 1 questions regarding this. 

Current architecture of our system for context: 

  • I have python scripts, in a service, triggered via Airflow jobs. These scripts run on ECS (wrapped around Airflow's ECSOperators). Primary job of these scripts is to import data from S3 do some processing and dump it back in S3. Today a lot of this computation is done in numpy/pandas/dask. And we want to move it to pyspark by leveraging Databricks cluster that we have. A rough overview of our goal is to create a connector, this connector will create spark session, and we will rewrite the pandas/dask code with spark. The underlying compute would be databricks-spark compute. 
  • We are not inclined going with the approach of using Databricks operators for now, hence goal is to use Databricks connector and leverage the compute.

Question(s): 

  • My current databricks version is 12.2. I do see some relevant info regarding how to leverage dataricks-connect with a remote spark in docs for version v13.0+ of databricks-runtime. Is an upgrade necessary, just confirming? 
  • Does Airflow version matter in this regard? Since in case of PoC'ing databricks operators, I got to know, it requires Airflow v2.5+. Our current Airflow version is v2.4.2. Since, we more inclined towards using databrick-connect and not databricks operators to use the compute, version is not a matter isn't it?

Approach(s):

  • As a part of PoC'ing the approach, I setup latest Airflow locally v2.8.1 and followed the Databricks-connect docs, though I faced issues and realized its probably due to Databricks version 12.2 that we have. I will tweak my approach based on clarification from this question
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!