โ01-30-2024 08:17 AM
I have a dataframe with abt 2 million text rows (1gb). I partition it into about 700 parititons as thats the no of cores available on my cluster exceutors. I run the transformations extracting medical information and then write the results in parquet on S3. The process runs for 3 hrs and then crashes. The driver crashed with following error.
The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.
I have tried driver with both 128gb and 256 memory but end up in same result. Also i have used the persist option with similar crashes.
โ01-30-2024 10:22 AM
Hi @desertstorm , The error "The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached." usually happens when the driver is under memory pressure. This means that there is a piece of code that is executing on the driver and not on executors. We need to identify and remove that piece if code.
Here are some general suggestions to watch out:-
1. If your code has display or collect operations, you should remove that.
2. If your code has python piece of code, you need to replace that with pyspark.
โ01-30-2024 03:09 PM
Hi @Lakshay Thanks so much for your reply. I have looked into most of those options and dont see any python code. Its mostly pipeline.transform. Here is the code where it crashes. I feel it should not bring to the driver either for with column or for writing to parquet. so not sure whats wrong. Happy to share the file as well
โ01-31-2024 03:18 AM
Hi @desertstorm , I think the issue is with "Process rxnorm results" part of the code. You can try to comment out that part to confirm if that is correct.
โ01-30-2024 03:52 PM
just wondering where the magic number "768" in your repartition is coming from? how big is your cluster? what about your driver instance?
โ01-30-2024 04:06 PM
thats the no of cores available on executors. i have tried driver with 256 gb as well as 128gb with same results
โ01-31-2024 11:28 AM
can you try to split the data?
do you have any collect() or any other driver's heavy actions that could cause this error in the driver?
โ02-06-2024 01:00 AM
Hey there! Thanks a bunch for being part of our awesome community! ๐
We love having you around and appreciate all your questions. Take a moment to check out the responses โ you'll find some great info. Your input is valuable, so pick the best solution for you. And remember, if you ever need more help , we're here for you!
Keep being awesome! ๐๐
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.