10-25-2024 08:41 AM
Hello All,
I am trying to fetch data from different sources for tables driven by a metadata table. Data will get fetched from sources using jdbc connector for each table mentioned in the metadata table. A scheduled job is responsible for fetching the data for each table. Now with a huge number of new tables, I want to achieve a faster and effective way of data ingestion using parallel processing. I tried using the Maximum concurrent runs in workflow and I was expecting 6 parallel runs to happen if I put concurrent runs=6. But it shows only one run. Does this happen at executor level? What is the expected outcome of this option Max concurrent run?
10-30-2024 12:15 AM
Soo.. You use a loop to go through metadata table and then retrieve and ingest files using JDBC?
If so, then the concurrent runs won't be helpful. Concurrent runs means the number of how many runs of that job can be ran side by side. For you, this would probably mean that you would be ingesting the same data 6 times, if you were to run the job 6 times.
If you want to retrieve and ingest those tables concurrently, you can either:
If it's easy for you to describe the process as a DAG (directed acyclic graph), I'd say that utilizing Databricks' tasks is pretty straight forward. You could also try out https://docs.databricks.com/en/jobs/for-each.html, but I'm not sure how the concurrency works with that one.
10-26-2024 02:02 PM
Hi,
It seems the run is getting queued. It might be due to following settings (except the 3rd):
10-29-2024 10:51 AM
Hi Angad,
No, the runs are not getting queued. As this property is a job level, I was expecting it to run concurrently or get queued, but we can only see 1 run of the workflow always even if concurrent runs is set to 6.
10-29-2024 10:16 PM
The Maximum concurrent runs parameter allows multiple runs of the same workflow to be executed in parallel. Since you've switched the queue parameter on, anything higher than 6 will be queued. This is only valid if the same workflow is triggered multiple times.
We can help you better if you provide more details on your workflow setup, how it is triggered. If it 1 workflow or multiple workflows.
You've mentioned that only 1 workflow is running. And you've also mentioned there is a scheduled job for each table. Is it the same job/workflow for all tables or different ones for each? Since you have scheduled your job at a certain time, how is it getting triggered multiple times?
If you've scheduled multiple jobs all using the same notebook and different parameters, the Maximum concurrent runs parameter will not help you.
10-30-2024 12:15 AM
Soo.. You use a loop to go through metadata table and then retrieve and ingest files using JDBC?
If so, then the concurrent runs won't be helpful. Concurrent runs means the number of how many runs of that job can be ran side by side. For you, this would probably mean that you would be ingesting the same data 6 times, if you were to run the job 6 times.
If you want to retrieve and ingest those tables concurrently, you can either:
If it's easy for you to describe the process as a DAG (directed acyclic graph), I'd say that utilizing Databricks' tasks is pretty straight forward. You could also try out https://docs.databricks.com/en/jobs/for-each.html, but I'm not sure how the concurrency works with that one.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group