cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

What does "Determining location of DBIO file fragments..." mean, and how do I speed it up?

Ajay-Pandey
Esteemed Contributor III

Determining location of DBIO file fragments. This operation can take some time.

What does this mean, and how do I prevent it from having to perform this apparently-expensive operation every time? This happens even when all the underlying tables are Delta tables.

Ajay Kumar Pandey
1 ACCEPTED SOLUTION

Accepted Solutions

LandanG
Databricks Employee
Databricks Employee

Hey @Ajay Pandey​ ,

That message is related to delta caching, basically if a cluster is constantly scaling up or down then occasionally you might lose delta cache pieces. Determining the location of DBIO file fragments is the operation determining which executors the files were cached.

This is something that can be helped by trying a newer DBR such as 11.3 or 12.X. You could also try turning off the cache by setting the below configuration in the notebook and observing the behaviour:

spark.conf.set("spark.databricks.io.cache.enabled", "false")

You could also try optimizing the table(s)

%sql Optimize [table name]

View solution in original post

6 REPLIES 6

LandanG
Databricks Employee
Databricks Employee

Hey @Ajay Pandey​ ,

That message is related to delta caching, basically if a cluster is constantly scaling up or down then occasionally you might lose delta cache pieces. Determining the location of DBIO file fragments is the operation determining which executors the files were cached.

This is something that can be helped by trying a newer DBR such as 11.3 or 12.X. You could also try turning off the cache by setting the below configuration in the notebook and observing the behaviour:

spark.conf.set("spark.databricks.io.cache.enabled", "false")

You could also try optimizing the table(s)

%sql Optimize [table name]

Ajay-Pandey
Esteemed Contributor III

Thanks

Ajay Kumar Pandey

AdrianLobacz
Contributor

That is a message about the delta cache. It’s determines on which executors it has what cached, to route tasks for best cached locality. Optimizing your table more frequently so there are fewer files will make this better

U can try:

%sql Optimize [table name]

Ajay-Pandey
Esteemed Contributor III

Thanks

Ajay Kumar Pandey

Christianben9
New Contributor II

Determining location of DBIO file fragments" is a message that may be displayed during the boot process of a computer running the NetApp Data ONTAP operating system. This message indicates that the system is currently in the process of identifying and locating the DBIO (Data Block Input/Output) file fragments on the storage system. This process is necessary in order to ensure that all data on the system is accessible and in a consistent state.

The time it takes to complete this process can depend on several factors, such as the number of disks in the system, the amount of data stored on the disks, and the performance of the disks themselves. However, there are a few things you can do to potentially speed up this process:

  1. Increase the number of spare disks: Adding more spare disks to the system can help to speed up the process, as the system can use these spare disks to rebuild data faster.
  2. Check for disk errors: Make sure that all the disks are functioning properly and there are no errors on them.
  3. Check for firmware updates: Make sure that the firmware of the storage system and the disks is up to date.
  4. Check for performance bottlenecks: Check for any performance bottlenecks on the storage system, such as high CPU or memory usage, and address them if necessary.
  5. Check for any other software issues: Ensure that the software is running smoothly and not having any issues.

Keep in mind that this process is an important step in ensuring data integrity, it should not be skipped or rushed. It's crucial to be patient and let the process finish.

Ajay-Pandey
Esteemed Contributor III

Thanks

Ajay Kumar Pandey

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group