cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Writing files using multithreading to dbfs

daan_dw
New Contributor III

Hello,

I am reading in xml files from AWS S3 and storing them on dbfs:/ using multithreaded code. The code itself seems to be fine as for the first +- 100 000 files it works without issues and I can see the data arriving on DBFS.

However it will always throw the following error: 

FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/tmp_bolt/OUTBOUND_e6930c2a-a885-11ed-8d7c-00163e37acad_InformativePeakPowers_202302/OUTBOUND_e6930c2a-a885-11ed-8d7c-00163e37acad_InformativePeakPowers_202302/relfile.rels'
 
Note that the FileNotFoundError always has a different directory. Another note is the higher the thread count, the lower the number of files processed is before I get this error.

Any idea what is causing this?

Screenshot 2025-04-11 at 16.14.04.png

1 REPLY 1

SP_6721
Contributor

Hi @daan_dw 

I think this issue mainly comes from using multithreading to handle XML files while interacting with both S3 and DBFS. When the thread count gets too high, it likely causes race conditions.

To avoid this:

  • Try reducing the number of threads.
  • Make sure each thread writes to a unique directory to prevent any overlap.
  • Use dbutils.fs.refresh() periodically to keep DBFS metadata up-to-date and avoid any latency-related errors.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now