cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How can I efficiently write to easily queryable logs?

BriGuy
New Contributor II

I've got a parallel running process loading multiple tables into the datalake. I'm writing my logs to a delta table using dataframewriter in append mode. The problem is that every save is taking a bit of time with what appears to be the calculation of the snapshot for the delta table. Not a big deal for a small number of processes but when it does a bunch of stuff it needs to save a large number of log entries. This is significantly inflating the time it takes for the process to run and we can't fit a large database download into the 3 hour job window we have.

I'm using the following additional options when using dataframe writer.

    

additionaloptions = {
        "delta.appendOnly": True,
        "delta.autoOptimize.autoCompact": "auto",
        "delta.autoOptimize.optimizeWrite": True,
        "delta.logRetentionDuration": "Interval 1 Days",
        "delta.deletedFileRetentionDuration": "Interval 1 Days",
        "delta.tuneFileSizesForRewrites": True,
    }
 
I'd prefer to log to tables as it makes it easy to query but I'd be open to other logging solutions. It's important that the logs are saved to an external datalake.  Maybe I'm coming at this in the wrong direction altogether so I'm open to ideas.
0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group