11-14-2024 09:11 PM
Here is my code for converting one column field of a data frame to time data type:
col_value = df.select(df.columns[0]).first()[0]
start_time = time.time()
col_value = datetime.strftime(col_value, "%Y-%m-%d %H:%M:%S") \
if isinstance(col_value, datetime) \
else col_value
end_time = time.time()
elapsed_time = end_time - start_time
print(elapsed_time)
The elapsed_time value is 0.00011396408081054688, which make sense it should cost little effort.
However, after I put this code inside a Python loop, things turn strange. Here is the code
for col in df.columns:
col_value = df.select(col).first()[0]
start_time = time.time()
col_value = datetime.strftime(col_value, "%Y-%m-%d %H:%M:%S") \
if isinstance(col_value, datetime) \
else col_value
end_time = time.time()
elapsed_time = end_time - start_time
print(elapsed_time)
After finishing running this code, I found the elapsed_time increased to 5 seconds!
Then I remove the time convert logic and re-do the statistics again including the whole loop:
loop_start_time = time.time()
for col in col_list:
start_time = time.time()
# Nothing was done here
end_time = time.time()
elapsed_time = end_time - start_time
print(col, elapsed_time)
loop_end_time = time.time()
loop_elapsed_time = loop_end_time - loop_start_time
print(f"Loop time cost {loop_elapsed_time} seconds")
It looks like each round without logic will cost more than 5 seconds as well, but the whole loop only costs 0.0026 second.
Why did this happen? Did I miss something? What is the correct why to measure the cost of each Python statement and function?
11-15-2024 08:09 AM
The behavior you're observing is likely due to a combination of factors related to how Python executes code and how time is measured. Let's break down the issues and provide some recommendations for more accurate timing:
For this you can try:
11-15-2024 08:09 AM
The behavior you're observing is likely due to a combination of factors related to how Python executes code and how time is measured. Let's break down the issues and provide some recommendations for more accurate timing:
For this you can try:
a month ago
Thank you for your advice. I found some learning material about how to do performance profiing in Python. The cProfile get correct result for me.
a month ago
You've asked what is the correct way to measure performance - whilst the above is correct, it won't necessarily help you for production ETL. When jobs start running into hours you need to start using the spark UI and looking at cluster metrics to understand how your work is being distributed across the nodes.
a month ago
How many columns do you have?
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group