topic Re: how to parallel n number of process in databricks in Administration & Architecture

how to parallel n number of process in databricks

jitenjha11 — Tue, 19 Aug 2025 07:32:07 GMT

Requirement: I have a volume in which random txt file coming from MQ with random numbers. In my workspace I have python script. Also, i have created job which, when new file will come in volume it will trigger automatically.

My requirement is, I need some thing in middle which will run or execute n number of times when n number of files will come in volume with n number of python script, meaning python script is only one but it should call n number of times with n number of files. I do not want python scrpit as a multithreading or multiprocess to do this work is there any other way to do it.

I am attaching flow chart of understading my requirement.

Re: how to parallel n number of process in databricks

MujtabaNoori — Tue, 19 Aug 2025 08:10:36 GMT

HI @jitenjha11 ,

You have a couple of options to handle this scenario:

Batch Processing:
Once the n number of text files arrive in the volume, you can read them in batches, process the required data, and then move the processed files to an archive directory.

Iterative Processing:
Alternatively, you can loop through the directory in volume using dbutils commands, read each text file one by one, and process them sequentially.

However, I’d recommend using AutoLoader here. It’s more reliable since it automatically handles file detection and provides fault tolerance using checkpointing. Auto Loader reads files in micro-batches, ensuring that each file is processed exactly once. On top of this batch, you can run your script — for example, by using the foreachBatch() function.

Questions, let me know.

Re: how to parallel n number of process in databricks

jitenjha11 — Tue, 19 Aug 2025 11:21:56 GMT

kindly share any example for the autoloader which will call txt n time with python n times

Re: how to parallel n number of process in databricks

BR_DatabricksAI — Thu, 21 Aug 2025 07:35:32 GMT

Hello @jitenjha11 : You can do it same manner they way it has highlighted by @MujtabaNoori but you have to call the process process twice.
Sharing the sample reference code below :

Iterating through the files in each directory.

for directory in directories:
# Read files using Autoloader
df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.load(directory)

# # Process the data (e.g., write to Delta table)
df.writeStream.format("delta") \
.option("checkpointLocation", f"<<Location>>") \
.start(f"/mnt/delta/<<location>>")

2nd Process
directories = ["/mnt/data/src1", "/mnt/data/src2"]

for directory in directories:
# Call external Python script with arguments
subprocess.run(["python", "process_data.py", directory])

I would request you to use the workflow which provide you the flexibility to run the process in for each loop and when new files arrived you pass the new file name as parameter and call the second notebook.
Please go through the below link this might help.
For Each In Databricks Workflows. One For Each, Each For All! | by René Luijk | Medium