I've got a parallel running process loading multiple tables into the datalake. I'm writing my logs to a delta table using dataframewriter in append mode. The problem is that every save is taking a bit of time with what appears to be the calculation of the snapshot for the delta table. Not a big deal for a small number of processes but when it does a bunch of stuff it needs to save a large number of log entries. This is significantly inflating the time it takes for the process to run and we can't fit a large database download into the 3 hour job window we have.
I'm using the following additional options when using dataframe writer.
additionaloptions = {
"delta.appendOnly": True,
"delta.autoOptimize.autoCompact": "auto",
"delta.autoOptimize.optimizeWrite": True,
"delta.logRetentionDuration": "Interval 1 Days",
"delta.deletedFileRetentionDuration": "Interval 1 Days",
"delta.tuneFileSizesForRewrites": True,
}
I'd prefer to log to tables as it makes it easy to query but I'd be open to other logging solutions. It's important that the logs are saved to an external datalake. Maybe I'm coming at this in the wrong direction altogether so I'm open to ideas.