cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Thousands of ETL pipelines with long execution times and small dataset sizes

GijsM
New Contributor
Hi,
I work for a small company, we're mostly focussing on small retail and e-commerce customers. We provides data analysis and automated data connections between their platforms.
Most of our datasets are things like order data, google ads click data, and are quite small. Most of them are in the range of few megabytes up to a gigabyte of data.
Problem 1 is that we have a LOT of daily pipelines (around 2000 pipeline runs each day), which all need custom configuration. These are not very standard pipelines because of lots of 'custom' data sources specific to our customers' needs. These data sources are pulled by us using in-house developed connectors to, for example, REST api's, external databases, etc. We do operations like merging data on data that can only be retrieved from an API, for example, order data with a customer ID merge on customer data, by retrieving customers one by one from their retail platform service.
Problem 2 is that because of the limitations of these APIs, most take quite long, some up to an hour to load in the entire dataset. Because of this our server is idle for a long time, waiting for new data to come in.
In a nutshell, we have a large amount of ETL pipelines that need to be orchestrated to run on a schedule, daily, weekly, and sometimes hourly. The ETL pipelines have a long runtime and very little memory/cpu usage.
Up until this point, we used our in-house developed DAG application, where the data engineers developed the actual application and the backend. The data scientists use the application to create new pipelines. When we require new data operations, we develop and release them to our data scientists in the form of nodes to use in our graphical click-and-drag interface where nodes in the DAG represent data operations. Our backend then orchestrates the scheduling and executions of these pipelines and runs them in parallel on our servers.
This system is quite old and predates most ETL standards. We want to move on because we want to start working with bigger datasets and creating the click-and-drag pipelines is a long and tedious process in an interface written in 2010. We also want a more flexible way of writing ETL pipelines, like with Python notebooks and data lakes so we can work and innovate faster in modern data analysis areas like machine learning.
That is why we decided to move to an advanced data platform, namely Azure Databricks. While it seems like the perfect platform for our future workloads, migrating our current pipelines is resulting in quite a problem. Running one is extremely inefficient on the Databricks architecture. Spinning up a cluster takes a long time, because of all our libraries needed for the API connectors. Then loading in the data takes even longer, 20 minutes up to an hour. Then the memory will spike slightly up to 200 MB of the available 14 GB and the server performs the actual transformations within a few seconds. All of this can take half an hour, where a few seconds are effectively used. With 2000 daily tasks the cost of this will grow to a ridiculous size compared to our old system.
We know that there are ways to run your jobs in parallel in Databricks on all-purpose clusters, however, it doesn't seem like there is a good system in place for load balancing. The system is not built for loads of minuscule long waiting tasks at once, but rather huge, compute-optimized tasks.
What would be the best and simplest way for us to move on? We are looking at things like utilizing Azure functions/Containers to load our data and write it to the datalake, then using Databricks for the actual transformation. This can work but would result in some serious architectural headaches. How would you tackle this problem? Move away from Databricks entirely?
 
1 REPLY 1

brockb
New Contributor III
New Contributor III

Hi,

Thanks for the information, there is a lot to unpack and some assumptions that need to be made without fully understanding the details, so here are a few thoughts:

  • If the cluster start times longer because of the libraries you're installing, can alternate installation methods be considered? As an example, instead of pip installing upon cluster startup, consider persisting a wheel file to object storage, then copy it into the cluster for a local install instead of install from a remote provider.
  • For the ETL jobs that use Spark, the Databricks platform does a nice job with managing auto-scaling cluster resources. If some of the jobs in question are not using Spark at all, it may be worth reviewing your cloud provider's compute/auto-scaling capabilites
  • If you are a Databricks customer, I would recommend talking through these details with your account's Solution Architect since they may require some architecture deep-dives

Thanks.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.