Databricks Community

pjp94 · 02-22-2024

Can someone explain why this below code is throwing an error? My intuition is telling me it's my spark version (3.2.1) but would like confirmation:d = {'key':['a','a','c','d','e','f','g','h'], 'data':[1,2,3,4,5,6,7,8]} x = ps.DataFrame(d) x[x['...

pjp94 · 07-05-2023

I've ran a dual multiprocessing and multithreading solution in python before using the multiprocessing and concurrent futures python modules. However, since the multiprocessing module only runs on the driver node, I have to instead use sc.parallelize...

pjp94 · 01-06-2023

For some reason, my dbconnect failed and I haven't been able to resolve the issue. I am connecting to an enterprise server. I was getting the following errors which (I believe) are now resolved.I defined the PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON v...

pjp94 · 12-05-2022

Not sure if I'm missing something here, but running a task outside of a python function runs much much quicker than executing the same task inside a function. Is there something I'm missing with how spark handles functions? 1) def task(x): y = dostuf...

pjp94 · 09-19-2022

I get the below error when trying to run multi-threading - fails towards the end of the run. My guess is it's related to memory/worker config. I've seen some solutions involving modifying the number of workers or CPU on the cluster - however that's n...

pjp94 · 12-05-2022

Sure. My function queries an external database (jdbc) along with a delta table. I'm not performing any expensive computations - just filtering for the most part. When printing timestamps in the function, I notice that most of the time is being spent ...

pjp94 · 09-19-2022

Since I don't have permissions to change cluster configurations, the only solution that ended up working was setting a max thread count to about half of the actual max so I don't overload the containers. However, open to any other optimization ideas!

pjp94 · 02-01-2022

Hi @Werner Stinckens , this is exactly what I was looking for. Thanks!1) Follow up questions, do you need to setup an object level storage connection on databricks (ie. to an S3 bucket or Azure Blob)? 2) Any folders in your /mnt path are external ob...

pjp94 · 12-09-2021

Thanks for the clarifying this... had trouble finding this anywhere

pjp94 · 12-01-2021

Thanks for clarification on this to all of you... helps alot. Unfortunately, I'm on an organization cluster so I can't upgrade or have the permission to create a new cluster so will into koalas as an alternative to pyspark.pandas.

Databricks Community

User Stats

User Activity

pyspark.pandas PandasNotImplementedError

Run threadpool on multiple nodes

DB Connect failing

Calling a python function (def) in databricks

ERROR - Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

Re: Calling a python function (def) in databricks

Re: ERROR - Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

Re: Difference between DBFS and Delta Lake?

Re: Databrick Job - Notebook Execution

Re: Pyspark vs Pandas