cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How can I efficiently write to easily queryable logs?

BriGuy
New Contributor II

I've got a parallel running process loading multiple tables into the datalake. I'm writing my logs to a delta table using dataframewriter in append mode. The problem is that every save is taking a bit of time with what appears to be the calculation of the snapshot for the delta table. Not a big deal for a small number of processes but when it does a bunch of stuff it needs to save a large number of log entries. This is significantly inflating the time it takes for the process to run and we can't fit a large database download into the 3 hour job window we have.

I'm using the following additional options when using dataframe writer.

    

additionaloptions = {
        "delta.appendOnly": True,
        "delta.autoOptimize.autoCompact": "auto",
        "delta.autoOptimize.optimizeWrite": True,
        "delta.logRetentionDuration": "Interval 1 Days",
        "delta.deletedFileRetentionDuration": "Interval 1 Days",
        "delta.tuneFileSizesForRewrites": True,
    }
 
I'd prefer to log to tables as it makes it easy to query but I'd be open to other logging solutions. It's important that the logs are saved to an external datalake.  Maybe I'm coming at this in the wrong direction altogether so I'm open to ideas.
0 REPLIES 0
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!