Re: High cost of storage when using structured str...

PetePP · ‎08-31-2023

I had the same problem when starting with databricks. As outlined above, it is the shuffle partitions setting that results in number of files equal to number of partitions. Thus, you are writing low data volume but get taxed on the amount of write (and subsequent sequentialread) operations. Lowering amount of shuffle partitions helps solve this. On top of that, consider using spark.sql.streaming.noDataMicroBatches.enabled so that empty microbatches are ignored.