Data Engineering

Forum Posts

Sorted by:

Start a conversation

by brickster_2018 • Databricks Employee

06-23-2021 6:34:13 AM

10176 Views
2 replies
0 kudos

Resolved! How does Delta solve the large number of small file problems?

Delta creates more small files during merge and updates operations.

Data Engineering

10176 Views
2 replies
0 kudos

06-23-2021 6:34:13 AM

View Replies

Latest Reply

brickster_2018
Databricks Employee

06-23-2021 6:45:02 AM

0 kudos

Delta solves the large number of small file problems using the below operations available for a Delta table. Optimize writes helps to optimizes the write operation by adding an additional shuffle step and reducing the number of output files. By defau...

0 kudos

06-23-2021 6:45:02 AM

1 More Replies

by Dean_Lovelace • New Contributor III

05-17-2023 1:36:23 AM

6132 Views
1 replies
1 kudos

Resolved! Efficiently move multiple files with dbutils.fs.mv command on abfs storage

As part of my batch processing I archive a large number of small files received from the source system each day using the dbutils.fs.mv command. This takes hours as dbutils.fs.mv moves the files one at a time.How can I speed this up?

Data Engineering

6132 Views
1 replies
1 kudos

05-17-2023 1:36:23 AM

View Replies

Latest Reply

daniel_sahal
Databricks MVP

05-17-2023 2:02:06 AM

1 kudos

@Dean Lovelace You can use multithreading.See example here: https://nealanalytics.com/blog/databricks-spark-jobs-optimization-techniques-multi-threading/

1 kudos

05-17-2023 2:02:06 AM

by Arun_tsr • New Contributor III

11-08-2022 10:30:02 PM

3333 Views
2 replies
0 kudos

Spark SQL output multiple small files

We are having multiple joins involving a large table (about 500gb in size). The output of the joins is stored into multiple small files each of size 800kb-1.5mb. Because of this the job is split into multiple tasks and taking a long time to complete....

Data Engineering

3333 Views
2 replies
0 kudos

11-08-2022 10:30:02 PM

View Replies

Latest Reply

Debayan
Databricks Employee

11-08-2022 11:32:03 PM

0 kudos

Hi @Arun Balaji , Could you please provide the error message you are receiving?

0 kudos

11-08-2022 11:32:03 PM

1 More Replies

by William_Scardua • Valued Contributor

10-06-2021 6:13:12 PM

5672 Views
4 replies
4 kudos

Resolved! Small/big file problem, how do you fix it ?

How do you work to fixing the small/big file problem ? what you suggest ?

Data Engineering

5672 Views
4 replies
4 kudos

10-06-2021 6:13:12 PM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

10-08-2021 12:01:20 AM

4 kudos

What Jose said.If you cannot use delta or do not want to:the use of coalesce and repartition/partitioning is the way to define the file size.There is no one ideal file size. It all depends on the use case, available cluster size, data flow downstrea...

4 kudos

10-08-2021 12:01:20 AM

3 More Replies

by User16826992666 • Databricks Employee

06-16-2021 9:42:52 AM

2554 Views
1 replies
0 kudos

How do I know if the number of files are causing performance issues?

I have read and heard that having too many small files can cause performance problems when reading large data sets. But how do I know if that is an issue I am facing?

Data Engineering

2554 Views
1 replies
0 kudos

06-16-2021 9:42:52 AM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-18-2021 1:47:00 PM

0 kudos

Databricks SQL endpoint has a query history section which provides additional information to debug / tune queries. One such metric under execution details is the number of files read. For ETL/Data science workloads, you could use the Spark UI of the ...

0 kudos

06-18-2021 1:47:00 PM

Databricks Community

Resolved! How does Delta solve the large number of small file problems?

Resolved! Efficiently move multiple files with dbutils.fs.mv command on abfs storage

Spark SQL output multiple small files

Resolved! Small/big file problem, how do you fix it ?

How do I know if the number of files are causing performance issues?