Wojciech_BUK
Valued Contributor III

You have quite small machines that you are using, please take into consideration that a lot of memory of machine is occupied by other processes 

https://kb.databricks.com/clusters/spark-shows-less-memory

This is not good idea to broadcast huge data frames as it can lead to OOM exception you are getting as spark worker will not be able to handle both batch coming from stream and big DF output 3rd DF that is produced by join.

The easiest ways would be to:

- exclude big dataframe transformation ( deduplicatnion ) to separate process and read already cleaned data. Do you need always entire dataframe to be cached ?

- try to controll your streaming batch size so it can fit memory with other cached data frames, there are options to do that 

- make sure your data is partitioned and distributed for paralele execution 

- make good cluster sizing , you pay per hour, if you use bigger cluster and it finish much faster , you pay less. If there are spills to disk, it gets slower and you pay more. Calculate your dataframes and batch size and parallelism ( psrtitnions) and adjust cluster memory, cores and numer of workers .