raphaelblg
Databricks Employee
Databricks Employee

Hi @lprevost

I don't think there's a way to "checkpoint partitions" as you said. 

For the gzip files, probably your executor is running out of memory during the decompression process. One of the few solutions that doesn't require changing your source files would be to increase the executors memory.

To enable Gzip parallel processing, this lib might be of your interest although I don't think it could address any memory issues based on the way the library works: https://github.com/nielsbasjes/splittablegzip

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

View solution in original post