rlgarris
Databricks Employee
Databricks Employee

Hi Mzaradzki -

In Spark 1.5 which we will be adding a feature to improve metadata caching in parquet specifically so it should greatly improve performance for your use case above.

One option to improve performance in Databricks is to use the dbutils.fs.cacheFiles function to move your parquet files to the SSDs attached to the workers in your cluster.

Cheers,

Richard