I'm having trouble working on Databricks with data that we are not allowed to save off or persist in any way. The data comes from an API (which returns a JSON response). We have a scala package on our cluster that makes the queries (almost 6k queries), saves them to a dataframe, then explodes that dataframe into a new dataframe that we can use to get the info we need. However, all the data gets collected to the driver node. Then, when we try to run comparisons/validations/spark code using the resulting dataframe, it won't distribute the work (tries to do everything on the driver) and takes up all the driver's JVM Heap memory.
The two errors we get are "OutOfMemoryError: Java Heap Space" or "The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached."
We've got a 64GB driver (with 5 32GB nodes) and have increased the max JVM Heap Memory to 25GB. But because it won't distribute the work to the nodes, the driver keeps crashing with the OOM errors.
Things run fine when we do save a parquet of this data and run our spark code/comparisons against the parquet (distributes the work fine), but, per an agreement with the data owners, we are only allowed to save data like this for development purposes (currently not a viable option for production).
Things we've tried:
-Removing all displays/prints/logs
-Caching the dataframe (and prior/subsequent dataframes to which it is connected) with .cache()/.count() and spark.sql("CACHE TABLE <tablename>")
-Increasing the size of the driver node to 64GB
-Increasing the JVM heap size to 25GB
-Removing unneeded columns from the dataframe
-Using .repartition(300) on the dataframe
Any help is greatly appreciated.