Databricks Community

ogdendc · ‎07-21-2023

Lower costs. Increase productivity. Produce higher quality work. You just need to beware the for-loop. Why?

Whether scripting in Python, R, Scala, etc…for-loops are a great tool in any programmer’s toolbox. However...

there is an insidious type of for-loop

When it comes to working with Big Data, there is an insidious type of for-loop; a time-consuming and expensive type of for-loop. The goal of this blog is to raise awareness, while also providing some equally effective, much faster, and therefore much less costly alternatives. At first glance, a very innocent block of code; but if you have something like this in your program:

…you probably have the potential for some huge reductions in run time.

For anyone with a coding background, writing “loops” is a common practice. A loop is a process for iterating and repeating a block of code. Such a programming tool has many applications. I grew up using Fortran (yes, Fortran) and can’t count the number of loops I wrote in college. That is to say, writing loops was burned into my psyche before I started programming professionally, and I cannot shake my compulsion to code with loops. During my career, I’ve done a lot of coding to process a lot of data; and that’s involved a lot of loops along the way. Over time, I’ve learned new programming and scripting languages and find myself frequently using for-loops. In my job, I get to interact with many customers, and sometimes that leads to me seeing their code. What I see is a lot of for-loops out there in the world. A lot.

parallel processing is paramount

Here’s the problem: for-loops are a serial (not parallel) process. Why does that matter? In our brave new world of bigger and bigger data, it’s safe to say that parallel processing is paramount. If you’re working in the world of small data, then serial processing probably works fine. Or does it? Some of the customers I work with are writing code to process what could be considered small data, and yet are often using tools built specifically to handle Big Data; and they are perplexed as to why their for-loop is running for hours rather than minutes. I’ve worked with several customers who, when faced with this scenario, try to solve it by simply throwing more powerful compute (clusters of machines with processors and memory) at their long-running for-loops, and are perplexed when more computing power doesn’t actually solve the problem.

To illustrate the issue, suppose you have a block of code that you want to repeat one thousand times using a for-loop, and each iteration runs for one minute. One thousand consecutive (serial) minutes is a bit less than seventeen hours. Now suppose that you want to decrease this run time by throwing more powerful compute at this process; and suppose that more powerful compute is so much more powerful that it cuts the one-minute run time down to ten seconds (per iteration). That’s a significant reduction. That would turn seventeen hours into something closer to three hours. But what if three hours is still too long? How then to get the run time (in this hypothetical example) to something measured in minutes rather than hours? The answer could be to abandon the for-loop and embrace the power of parallel processing.

For data engineers and data scientists wrestling with large volumes of data, a very common tool to bring to the party is Apache Spark®. One superpower of Spark is its intrinsic ability to facilitate the distribution of compute across a cluster of machines. But Spark doesn’t take your serial for-loop and automatically turn it into distributed parallel processing (I’m probably foreshadowing new functionality available in some future version of Spark). As it stands today, if you throw a for-loop at Spark, you still have a serially processed iteration of code.

Let’s go back to the hypothetical for-loop, described above, that iterates a thousand times with each iteration running for one minute. Rather than throwing more powerful compute at it, what if we instead figured out how to rewrite the code so that it could run those cycles in parallel? In the extreme, all thousand cycles could run in parallel, resulting in a total run time of something closer to one minute (rather than seventeen hours). This may sound extreme, but this magnitude of improvement…this difference between parallel vs serial processing…is just that huge.

What is the insidious type of for-loop? One that iterates through subsets of rows in a dataframe, and independently processes each subset. For example, suppose one column in a dataframe is ‘geography’, indicating various locations for a retail company. A common use of a for-loop would be to iterate through each geography and process the data for each geography separately. There are many applications for such an approach. For example, we may want to build demand forecasting models that are specific to each geography. The details for how to efficiently produce such fine-grained forecasts can be found in this Databricks solution accelerator: link.

pandas does not prevent parallel processing

There is a common objection or concern that I’ve heard, to the idea of converting existing non-parallelized processing into something that is more “sparkified”: when the customer or colleague is using Pandas, and knows that Pandas is not a distributed-computing package, the objection is a lack of appetite for rewriting their existing code from Pandas into Spark (sans Pandas). Rest assured, using Pandas does not stand in your way of parallelizing your process. This is demonstrated in the accompanying code (link to accompanying code below).

The accompanying code demonstrates the insidious type of for-loop (one that iterates through subsets of data and independently processes each subset), while also demonstrating much faster alternative approaches, thanks to the parallel processing power of Spark. The four methods compared are: an iterative for-loop method, a groupBy.applyinPandas approach, the ThreadPoolExecutor method from concurrent.futures, and a PySpark (no Pandas) approach. The following chart depicts a comparison of run times for the four methods evaluated:

Image 6-24-23 at 3.00 PM.jpeg

lower costs while increasing productivity

Why is faster processing so important? This may seem like a rhetorical question, but there is often an overlooked aspect of why faster processing is important. The obvious answer would be cost. When you’re running your code and paying for compute, why pay for hours when you could pay for only minutes or seconds? If you have a production job that runs every night, and there’s a cost associated with each run, then the benefit of significantly reducing run time is clear. But there’s another answer to this question about why faster run time is better, and that is increased employee productivity. Perhaps the biggest benefit of faster parallel processing is how it furthers the ability for the human to develop the code in the first place. Whether you’re writing code to curate data and produce informed results, or training machine learning models to add value to business decisions, there is an iterative nature to the development of the process. Iterate, learn, improve, reiterate, learn more, improve more, etc. With every iteration, we learn and improve the approach. The more cycles we can run in a day, iteratively developing the logic and the code, the faster we can improve our approach. In other words, faster run times not only mean lower costs, but also mean increased employee productivity; which often leads to a higher quality work product as a result.

Lower costs. Increased productivity. Higher quality work product.
You just need to beware the for-loop.

for-loops are a great tool

I love for-loops and will continue to use them on a regular basis. For-loops are a great tool. There are many applications of for-loops that are not of the “insidious” variety. But if you’re using a for-loop to iterate through subsets of your data, and processing each subset independently, then your approach could probably benefit from some “sparkification” as shown in this accompanying code.

At Databricks, we are all about enabling our customers with lightning-fast data & AI applications. Whether that’s through tips-and-tricks in a blog like this, or via our world-record-setting query performance, we are ready to engage with your team to lower costs and increase productivity. For more information about this article, please contact the author via Databricks Community. For more information about Databricks, please contact a Databricks representative via https://www.databricks.com/company/contact.

by David C. Ogden 𝚫 Solutions Architect 𝚫 Databricks

Special thanks to my reviewers:   Rafi Kurlansik 𝚫 Lead Product Specialist 𝚫 Databricks
                                  Sumit Saraswat 𝚫 Solutions Architect 𝚫 Databricks

Disclaimer:  Opinions, ideas, and recommendations shared herein are those of the author alone; and should not be construed as an endorsement from Databricks.

dplante · ‎07-27-2023

Great post! Will definitely share with coworkers!

The key takeaway is using the right tool for the job and using the tool correctly - when you don’t, it will be very expensive.

It would be great at runtime to analyze the longer job runs and give warnings that the code is inefficient

ogdendc · ‎07-28-2023

Thank you @dplante. Maybe your idea for having an automatic efficiency analysis of your code will be a new feature from LakehouseIQ someday. Wouldn't that be cool!

NhanNguyen · ‎03-24-2024

Hi @ogdendc ,

Thanks for you post, but in case I actually need to loop over a subset of data because some data quality check of each file, which way i can replace for loop?

Thanks,

ogdendc · ‎03-25-2024

Hi @NhanNguyen . If you already have some pandas code written to do the data quality check, then I'd say that you'd lean toward method 2. But if you are starting from scratch and comfortable coding in PySpark (without Pandas) then I'd say you'd go with method 4. When you say "check of each file", you'd need to have some group-by variable that designated each file. Does this answer your question?

I_Ku · ‎03-29-2024

Hey, @ogdendc,

This is a super interesting post. I came across this article while I was flabbergasted by the bad results of my tests on ThreadPoolExecutor. But I cannot imagine how I can replace iterating over a list with datasets names and running a notebook with each name as a parameter with some of the other methods you talk about here. Even if there is no other solution for this, can you publish how you implemented your case above with the four approaches (the code itself)? This could help me and the other readers in finding better solution for our problems.

ogdendc · ‎03-29-2024

@I_Ku Hello! Thank you for the input and question. Hey...I think you may have missed the link in the post to the code. For reference, here it is: https://github.com/ogdendc/beware-the-for-loop <-- not sure if this code will help you with your current use case, but I hope so! Please let me know how it turns out.

NhanNguyen · ‎03-29-2024

Hi @ogdendc ,

The method group by quite good, it also come on top of mind, thanks a lot for your sharing.

Databricks Community

Beware the For-Loop

Best practices for safe data experimentation with Databricks

Top 10 query performance tuning tips for Databricks Serverless SQL

Metadata-Driven ETL Framework in Databricks (Part-1)