cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

how to parallel n number of process in databricks

jitenjha11
New Contributor II

Requirement: I have a volume in which random txt file coming from MQ with random numbers. In my workspace I have python script. Also, i have created job which, when new file will come in volume it will trigger automatically. 

My requirement is, I need some thing in middle which will run or execute n number of times when n number of files will come in volume with n number of python script, meaning python script is only one but it should call n number of times with n number of files. I do not want python scrpit as a multithreading or multiprocess to do this work is there any other way to do it. 

I am attaching flow chart of understading my requirement. 

3 REPLIES 3

MujtabaNoori
New Contributor III

HI @jitenjha11 ,

You have a couple of options to handle this scenario:

Batch Processing:
Once the n number of text files arrive in the volume, you can read them in batches, process the required data, and then move the processed files to an archive directory.

Iterative Processing:
Alternatively, you can loop through the directory in volume using dbutils commands, read each text file one by one, and process them sequentially.

However, Iโ€™d recommend using AutoLoader here. Itโ€™s more reliable since it automatically handles file detection and provides fault tolerance using checkpointing. Auto Loader reads files in micro-batches, ensuring that each file is processed exactly once. On top of this batch, you can run your script โ€” for example, by using the foreachBatch() function.

Questions, let me know.

jitenjha11
New Contributor II

kindly share any example for the autoloader which will call txt n time with python n times

 

BR_DatabricksAI
Contributor III

Hello @jitenjha11  : You can do it same manner they way it has highlighted by @MujtabaNoori but you have to call the process process twice.
Sharing the sample reference code below :

Iterating through the files in each directory. 

for directory in directories:
# Read files using Autoloader
df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.load(directory)

# # Process the data (e.g., write to Delta table)
df.writeStream.format("delta") \
.option("checkpointLocation", f"<<Location>>") \
.start(f"/mnt/delta/<<location>>")

2nd Process 
directories = ["/mnt/data/src1", "/mnt/data/src2"]

for directory in directories:
# Call external Python script with arguments
subprocess.run(["python", "process_data.py", directory])

I would request you to use the workflow which provide you the flexibility to run the process in for each loop and when new files arrived you pass the new file name as parameter and call the second notebook. 
Please go through the below link this might help. 
For Each In Databricks Workflows. One For Each, Each For All! | by Renรฉ Luijk | Medium

 

BR