cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How can I efficiently write to easily queryable logs?

BriGuy
New Contributor II

I've got a parallel running process loading multiple tables into the datalake. I'm writing my logs to a delta table using dataframewriter in append mode. The problem is that every save is taking a bit of time with what appears to be the calculation of the snapshot for the delta table. Not a big deal for a small number of processes but when it does a bunch of stuff it needs to save a large number of log entries. This is significantly inflating the time it takes for the process to run and we can't fit a large database download into the 3 hour job window we have.

I'm using the following additional options when using dataframe writer.

    

additionaloptions = {
        "delta.appendOnly": True,
        "delta.autoOptimize.autoCompact": "auto",
        "delta.autoOptimize.optimizeWrite": True,
        "delta.logRetentionDuration": "Interval 1 Days",
        "delta.deletedFileRetentionDuration": "Interval 1 Days",
        "delta.tuneFileSizesForRewrites": True,
    }
 
I'd prefer to log to tables as it makes it easy to query but I'd be open to other logging solutions. It's important that the logs are saved to an external datalake.  Maybe I'm coming at this in the wrong direction altogether so I'm open to ideas.
0 REPLIES 0
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.