Databricks Community

DBEnthusiast · ‎12-02-2023

Hi Databricks Gurus !

I am trying to run a very simple snippet :

data_emp=[["1","sarvan","1"],["2","John","2"],["3","Jose","1"]]

emp_columns=["EmpId","Name","Dept"]

df=spark.createDataFrame(data=data_emp, schema=emp_columns)

df.show()

--------

Based on a general understanding data bricks should create at the most 2 Jobs

One to read the data(this works for files like that, don't know if it would apply here)

One for show()

But it somehow creating 3 jobs

Can someone explain why is the behavior ?

Kaniz_Fatma · ‎12-03-2023

Hi @DBEnthusiast, Hello! Let’s dive into the behaviour you’re observing with your simple Databricks snippet.

The code you provided creates a DataFrame named df using the createDataFrame method. Then, it displays the first few rows of the DataFrame using the show() method. However, you’ve noticed that it’s resulted in three jobs being created instead of the expected two.

Here’s what’s happening:

Job 1: DataFrame Creation

When you execute spark.createDataFrame(data=data_emp, schema=emp_columns), Databricks creates a job to read the data and make the DataFrame. This job is responsible for processing the input data and constructing the DataFrame based on the specified schema.

Job 2: Show Operation

The df. show() call triggers another job. This job is responsible for formatting and displaying the first few rows of the DataFrame in the console output.

Additional Job (Unexpected)

The third job you’re observing is unexpected. It’s likely related to internal optimizations or resource management within Databricks. While it’s not explicitly part of your code, Databricks may perform additional tasks behind the scenes to optimize execution or manage resources efficiently.

Why the Extra Job?

Databricks is a distributed computing platform, and its behaviour can be influenced by various factors such as data partitioning, caching, and execution plans.
When you call show(), Databricks may perform additional tasks like data serialization, formatting, and handling distributed execution across worker nodes. These tasks can lead to an extra job being created.

Recommendations:

While the exact reason for the third job might not be immediately apparent, it’s generally not a cause for concern. Databricks handles many optimizations transparently.
If you’re curious about the specifics, you can explore the query execution plan using the explain() method on your DataFrame. This will provide insights into the underlying execution steps.
Remember that Databricks is designed to handle complex distributed workloads efficiently, and sometimes, the internal behaviour may not align precisely with our expectations based on a high-level understanding.

Feel free to explore further or ask any additional questions! 😊

View solution in original post

Kaniz_Fatma · ‎12-03-2023

Hi @DBEnthusiast, Hello! Let’s dive into the behaviour you’re observing with your simple Databricks snippet.

The code you provided creates a DataFrame named df using the createDataFrame method. Then, it displays the first few rows of the DataFrame using the show() method. However, you’ve noticed that it’s resulted in three jobs being created instead of the expected two.

Here’s what’s happening:

Job 1: DataFrame Creation

When you execute spark.createDataFrame(data=data_emp, schema=emp_columns), Databricks creates a job to read the data and make the DataFrame. This job is responsible for processing the input data and constructing the DataFrame based on the specified schema.

Job 2: Show Operation

The df. show() call triggers another job. This job is responsible for formatting and displaying the first few rows of the DataFrame in the console output.

Additional Job (Unexpected)

The third job you’re observing is unexpected. It’s likely related to internal optimizations or resource management within Databricks. While it’s not explicitly part of your code, Databricks may perform additional tasks behind the scenes to optimize execution or manage resources efficiently.

Why the Extra Job?

Databricks is a distributed computing platform, and its behaviour can be influenced by various factors such as data partitioning, caching, and execution plans.
When you call show(), Databricks may perform additional tasks like data serialization, formatting, and handling distributed execution across worker nodes. These tasks can lead to an extra job being created.

Recommendations:

While the exact reason for the third job might not be immediately apparent, it’s generally not a cause for concern. Databricks handles many optimizations transparently.
If you’re curious about the specifics, you can explore the query execution plan using the explain() method on your DataFrame. This will provide insights into the underlying execution steps.
Remember that Databricks is designed to handle complex distributed workloads efficiently, and sometimes, the internal behaviour may not align precisely with our expectations based on a high-level understanding.

Feel free to explore further or ask any additional questions! 😊

DBEnthusiast · ‎12-03-2023

Thank You @Kaniz_Fatma !!

I was also suspecting the same and your response helped in the conclusion

Kaniz_Fatma · ‎12-10-2023

I want to express my gratitude for your effort in selecting the most suitable solution. It's great to hear that your query has been successfully resolved. Thank you for your contribution.

Databricks Community

More than expected number of Jobs created in Databricks

Connect with Databricks Users in Your Area

Submit your feedback and win a $25 gift card!

Databricks Unity Catalog Workshop

Join us at the Databricks Generative AI World Cup (Virtual Hackathon)

Upskill on Databricks in just an hour

Supernovas, Black Holes and Streaming Data