Dooley
Databricks Employee
Databricks Employee

One thing I would try is to have more workers with memory-optimized instances with larger memories to see if that will fixed it. I know you played with it a bit, but maybe with more workers & larger memory that might fix this. But let's take a look at the Spark UI a bit to try to troubleshoot.

(1) Under Stages of the job that ran that created the error - under the summary metrics, do you see data spill? If you do not see it there then you can see data spill in the SQL tab where you find the SQL query associated to the job number with the error and then you can click the query to see the DAG. You can click the + in the boxes to see the write out and see if you see a data spill in there.

(2) You go to Spark UI and then go to JDBC/ODBC connector, do you see data leaving?

(3) Also under stages, can you take a screenshot of what you see for this job? Can you sort by shuffle read?

(4) Do you see anything cached in storage under the "Storage" tab?

So there is a reference to PSYoungGen, I believe it didn't have enough to allocate memory for possible a large object and thus a GC was triggered by allocation failure. Did this happen multiple times in a 10s interval?