cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Parquet file merging or other optimisation tips

xxMathieuxxZara
New Contributor

Hi,

I need some guide lines for a performance issue with Parquet files :

I am loading a set of parquet files using : df = sqlContext.parquetFile( folder_path )

My parquet folder has 6 sub division keys

It was initially ok with a first sample of data organized this way so I stared pushing more and performance is slowing down very quickly as I do so

Because the way data arrives every day the above folder partition is "natural" BUT it leads to small fies which I read is a bottleneck explanation

Shall I merge several of of sub folders in a second phase ? If so what function (python API) shall I use for this ?

6 REPLIES 6

User16826991422
Contributor

Hi Mzaradzki -

In Spark 1.5 which we will be adding a feature to improve metadata caching in parquet specifically so it should greatly improve performance for your use case above.

One option to improve performance in Databricks is to use the dbutils.fs.cacheFiles function to move your parquet files to the SSDs attached to the workers in your cluster.

Cheers,

Richard

Hi Richard,

Will this actually parallelize reading the footers? Or just help for Spark-generated parquet files? WRT to the serialized footer reading, I haven't noticed large gains with caching the files on the ssds.

Cheers,

Ken

vida
Contributor II
Contributor II

Hi,

There are a couple of SQL optimizations I recommend for you to consider.

1) Making use of partitions for your table may help if you frequently only access data from certain days at a time. There's a notebook in the Databricks Guide called "Partitioned Tables" with more data.

2) If your files are really small - it is true that you may get better performance by consolidating those files into a smaller number. You can do that easily in spark with a command like this:

sqlContext.parquetFile( SOME_INPUT_FILEPATTERN )
          .coalesce(SOME_SMALLER_NUMBER_OF_DESIRED_PARTITIONS)
          .write.parquet(SOME_OUTPUT_DIRECTORY)

User16301467532
New Contributor II

Having a large # of small files or folders can significantly deteriorate the performance of loading the data. The best way is to keep the folders/files merged so that each file is around 64MB size. There are different ways to achieve this: your writer process can either buffer them in memory and write only after reaching a size or as a second phase you can read the temp directory and consolidate them together and write it out to a different location. If you want to do the latter, you can read each of your input directory as a dataframe and union them and repartition it to the # of files you want and dump it back. A code snippet in Scala would be:

val dfSeq = MutableList[DataFrame]()

sourceDirsToConsolidate.map(dir => { val df = sqlContext.parquetFile(dir) dfSeq += df })

val masterDf = dfSeq.reduce((df1, df2) => df1.unionAll(df2)) masterDf.coalesce(numOutputFiles).write.mode(saveMode).parquet(destDir)

The dataframe's api is same in python. So you might be able to easily convert this to python.

Hi Prakash,

I am trying to transfer parquet files from hadoop on prem to S3 , i am able to move normal HDFS file's but when it comes to parquet it is not working properly .

Do you have any clue how do we transfer parquet files from HDFS to S3 ?

Appreciate your response.

Thanks

Ishan

Anonymous
Not applicable

I have multiple small parquet files in all partitions , this is legacy data , want to merge files in individual partitions directories to single files. how can we achieve this.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.