cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

uzairm
by New Contributor III
  • 4919 Views
  • 2 replies
  • 1 kudos

My whole code is running on driver node, I want my code to run on worker nodes so that the memory of driver node is not exhausted. Please tell me improvement is my codes. My spark crashes frequently when the pulled data from s3 is huge.

I am running process which has 4 steps.Querying s3 file paths from dynamo DB based on certain parameters given by user. (function to do so provided by client, just have to import). Returns a list of filesCheck if those file paths have already been qu...

  • 4919 Views
  • 2 replies
  • 1 kudos
Latest Reply
Vartika
Moderator
  • 1 kudos

Hi @uzair mustafa​ Thank you for posting your question in our community! We are happy to assist you.Does @Suteja Kanuri​'s answer help? If it does, would you be happy to mark it as best?This will help other community members who may have similar ques...

  • 1 kudos
1 More Replies
uzairm
by New Contributor III
  • 7433 Views
  • 2 replies
  • 2 kudos

Resolved! ThreadPoolExecutor in Databricks

I am using a threadpool executor and running notebooks in parallel. However, these parallel notebooks are not using executors at all and all the load is going towards the driver node resulting in running out of memory for the driver node and eventual...

  • 7433 Views
  • 2 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @uzair mustafa​ Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.Please help us select the best solution by clicking on "Select As Best" if it does.Your feedbac...

  • 2 kudos
1 More Replies
nagini_sitarama
by New Contributor III
  • 2058 Views
  • 3 replies
  • 2 kudos

Error while optimizing the table . Failure of InSet.sql for UTF8String collection

count of the table : 1125089 for october month data , So I am optimizing the table. optimize table where batchday >="2022-10-01" and batchday<="2022-10-31"I am getting error like : GC overhead limit exceeded    at org.apache.spark.unsafe.types.UTF8St...

image.png
  • 2058 Views
  • 3 replies
  • 2 kudos
Latest Reply
Priyanka_Biswas
Valued Contributor
  • 2 kudos

Hi @Nagini Sitaraman​ To understand the issue better I would like to get some more information. Does the error occur at the driver side or executor side? Can you please share the full error stack trace? You may need to check the spark UI to find wher...

  • 2 kudos
2 More Replies
draculla1208
by New Contributor
  • 995 Views
  • 0 replies
  • 0 kudos

Able to read .hdf files but not able to write to .hdf files from worker nodes and save to dbfs

I have a set of .hdf files that I want to distribute and read on Worker nodes under Databricks environment using PySpark. I am able to read .hdf files on worker nodes and get the data from the files. The next requirement is that now each worker node ...

  • 995 Views
  • 0 replies
  • 0 kudos
AmanSehgal
by Honored Contributor III
  • 10363 Views
  • 2 replies
  • 12 kudos

How concurrent runs in a job matches to cluster configuration?

In databricks jobs, there's a field to add concurrent runs which can be set to 1000.If I've a cluster with 4 worker nodes and 8 cores each, then at max how many concurrent jobs I'll be able to execute?What will happen if I launch 100 instances of sam...

  • 10363 Views
  • 2 replies
  • 12 kudos
Latest Reply
Prabakar
Esteemed Contributor III
  • 12 kudos

@Aman Sehgal​ On E2 workspace the limit is 1000 concurrent runs. If you trigger 100 runs​ at the same time, 100 clusters will be created and the runs will be executed. If you use the same cluster for 100 runs, then you might face a lot of failed jobs...

  • 12 kudos
1 More Replies
sarosh
by New Contributor
  • 7130 Views
  • 3 replies
  • 1 kudos

ModuleNotFoundError / SerializationError when executing over databricks-connect

I am running into the following error when I run a model fitting process over databricks-connect.It looks like worker nodes are unable to access modules from the project's parent directory. Note that the program runs successfully up to this point; n...

modulenotfoundanno
  • 7130 Views
  • 3 replies
  • 1 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 1 kudos

Hi @Sarosh Ahmad​ , Just a friendly follow-up. Do you still need help or the above responses help you to find the solution? Please let us know.

  • 1 kudos
2 More Replies
AjayHN
by New Contributor II
  • 2968 Views
  • 2 replies
  • 2 kudos

Resolved! Notebook failing in job-cluster but runs fine in all-purpose-cluster with the same configuration

I have a notebook with many join and few persist operations (which runs fine on all-purpose-cluster (with worker nodes - i3.xlarge and autoscale enabled), but the same notebook failing in job-cluster with the same cluster definition (to be frank the ...

job-cluster all-purpose-cluster
  • 2968 Views
  • 2 replies
  • 2 kudos
Latest Reply
jose_gonzalez
Moderator
  • 2 kudos

Hi @Ajay Nanjundappa​ ,Check "Event log" tab. Search for any spot terminations events. It seems like all your nodes are spot instances. The error "FetchFailedException" is associated with spot termination nodes.

  • 2 kudos
1 More Replies
HamzaJosh
by New Contributor II
  • 12071 Views
  • 7 replies
  • 3 kudos

I want to use databricks workers to run a function in parallel on the worker nodes

I have a function making api calls. I want to run this function in parallel so I can use the workers in databricks clusters to run it in parallel. I have tried with ThreadPoolExecutor() as executor: results = executor.map(getspeeddata, alist)to run m...

  • 12071 Views
  • 7 replies
  • 3 kudos
Latest Reply
HamzaJosh
New Contributor II
  • 3 kudos

You guys are not getting the point, I am making API calls in a function and want to store the results in a dataframe. I want multiple processes to run this task in parallel. How do I create a UDF and use it in a dataframe when the task is calling an ...

  • 3 kudos
6 More Replies
User16826992666
by Valued Contributor
  • 1808 Views
  • 3 replies
  • 0 kudos

Is it possible to enable encryption in between worker nodes?

I have a security requirement to encrypt all data when it is in transit. I am wondering if there is a setting I can use to enable encryption of the data during shuffles between the worker nodes.

  • 1808 Views
  • 3 replies
  • 0 kudos
Latest Reply
amr
Valued Contributor
  • 0 kudos

Inter-node encryption is a requirement for HIPPA compliance, reach out to your account management team and ask them for HIPPA compliant shards.

  • 0 kudos
2 More Replies
User16826987838
by Contributor
  • 1588 Views
  • 1 replies
  • 0 kudos
  • 1588 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Honored Contributor II
  • 0 kudos

Databricks recommends launching the cluster so that the Spark driver is on an on-demand instance, which allows saving the state of the cluster even after losing spot instance nodes. If you choose to use all spot instances including the driver, any ca...

  • 0 kudos
Labels