Can someone explain why this below code is throwing an error? My intuition is telling me it's my spark version (3.2.1) but would like confirmation:d = {'key':['a','a','c','d','e','f','g','h'],
'data':[1,2,3,4,5,6,7,8]}
x = ps.DataFrame(d)
x[x['...
I've ran a dual multiprocessing and multithreading solution in python before using the multiprocessing and concurrent futures python modules. However, since the multiprocessing module only runs on the driver node, I have to instead use sc.parallelize...
For some reason, my dbconnect failed and I haven't been able to resolve the issue. I am connecting to an enterprise server. I was getting the following errors which (I believe) are now resolved.I defined the PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON v...
Not sure if I'm missing something here, but running a task outside of a python function runs much much quicker than executing the same task inside a function. Is there something I'm missing with how spark handles functions? 1) def task(x): y = dostuf...
I get the below error when trying to run multi-threading - fails towards the end of the run. My guess is it's related to memory/worker config. I've seen some solutions involving modifying the number of workers or CPU on the cluster - however that's n...
Sure. My function queries an external database (jdbc) along with a delta table. I'm not performing any expensive computations - just filtering for the most part. When printing timestamps in the function, I notice that most of the time is being spent ...
Since I don't have permissions to change cluster configurations, the only solution that ended up working was setting a max thread count to about half of the actual max so I don't overload the containers. However, open to any other optimization ideas!
Hi @Werner Stinckens​ , this is exactly what I was looking for. Thanks!1) Follow up questions, do you need to setup an object level storage connection on databricks (ie. to an S3 bucket or Azure Blob)? 2) Any folders in your /mnt path are external ob...
Thanks for clarification on this to all of you... helps alot. Unfortunately, I'm on an organization cluster so I can't upgrade or have the permission to create a new cluster so will into koalas as an alternative to pyspark.pandas.