Hi @mjedy7
for cacheing in this scenario You could try to levarage persist() and unpersist() for the big table/ spark dataframe, see here:
https://medium.com/@eloutmadiabderrahim/persist-vs-unpersist-in-spark-485694f72452
Try to reduce the amount of data in big spark df You will cache, by reading only the neccessary columns, filtering data (if possible), precompute etc. Run vacuum and optimize on Your table regurarly, consider zordering the data to help spark skipping/ pruning the data aswell.
Broadcasting small table might be good idea.
Setting maxBytesPerTrigger/ maxFilesPerTrigger is for sure good idea.
Make sure your upsert is performing well.
Running the job please use the Spark UI to validate performance:
- monitor usage of the %CPU for each node, make sure Your job utilize all cpu evenly,
- check whats the number of tasks processing during the job execution - if there is a need to repartition/coalesce your input data or use aqe