cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

What is the correct way to measure the performance of a Databrick notebook?

guangyi
Contributor III

Here is my code for converting one column field of a data frame to time data type:

 

 

col_value = df.select(df.columns[0]).first()[0]


start_time = time.time()
col_value = datetime.strftime(col_value, "%Y-%m-%d %H:%M:%S") \
           if isinstance(col_value, datetime) \
           else col_value
end_time = time.time()
elapsed_time = end_time - start_time
print(elapsed_time) 

 

 

The elapsed_time value is 0.00011396408081054688, which make sense it should cost little effort.

However, after I put this code inside a Python loop, things turn strange. Here is the code

 

 

for col in df.columns:
  col_value = df.select(col).first()[0]

  start_time = time.time()
  col_value = datetime.strftime(col_value, "%Y-%m-%d %H:%M:%S") \
              if isinstance(col_value, datetime) \
              else col_value
  end_time = time.time()
  elapsed_time = end_time - start_time
  print(elapsed_time)

 

 

After finishing running this code, I found the elapsed_time increased to 5 seconds!

Then I remove the time convert logic and re-do the statistics again including the whole loop:

 

 

loop_start_time = time.time()
for col in col_list:
  start_time = time.time()
  # Nothing was done here
  end_time = time.time()
  elapsed_time = end_time - start_time
  print(col, elapsed_time)

loop_end_time = time.time()
loop_elapsed_time = loop_end_time - loop_start_time
print(f"Loop time cost {loop_elapsed_time} seconds")

 

 

It looks like each round without logic will cost more than 5 seconds as well, but the whole loop only costs 0.0026 second.

Why did this happen? Did I miss something? What is the correct why to measure the cost of each Python statement and function?

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group