Hi @pgruetter, You can use Delta Lake with Spark Structured Streaming to handle incremental loads automatically. When you load a Delta table as a stream source and use it in a streaming query, the query processes all of the data present in the table as well as any new data that arrives after the stream is started.
The startingVersion option in Spark Structured Streaming reads all the data after a certain Delta table version.
In your case, you set the startingVersion
to 2, which means that the streaming job should only read versions 2-4 of the Delta table SLV .
When you load a Delta table as a stream source and use it in a streaming query, the query processes all of the data present in the table as well as any new data that arrives after the stream is started.
However, the streaming job takes 10+ hours and copies all 25B rows again. This could be because the startingVersion
option is not working as expected. You can try setting the starting version to 3 and see if the streaming job reads only version 4 of the Delta table SLV.
It’s also possible that the streaming job is writing to too many small files, which can cause performance issues. You can try increasing the size of the files written by the streaming job by setting the maxBytesPerTrigger option.
In general, it is possible to do an initial load and then switch to streaming. You can try increasing the size of the files written by the streaming job by setting the maxBytesPerTrigger option.
Source - https://docs.delta.io/latest/delta-streaming.html