cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

More than expected number of Jobs created in Databricks

DBEnthusiast
New Contributor III

Hi Databricks Gurus !

I am trying to run a very simple snippet :

data_emp=[["1","sarvan","1"],["2","John","2"],["3","Jose","1"]]

emp_columns=["EmpId","Name","Dept"]

df=spark.createDataFrame(data=data_emp, schema=emp_columns)

df.show()

 

--------

Based on a general understanding data bricks should create at the most 2 Jobs

One to read the data(this works for files like that, don't know if it would apply here)

One for show()

 

But it somehow creating 3 jobs 

 

Can someone explain why is the behavior ?

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz_Fatma
Community Manager
Community Manager

Hi @DBEnthusiast, Hello! Let’s dive into the behaviour you’re observing with your simple Databricks snippet.

 

The code you provided creates a DataFrame named df using the createDataFrame method. Then, it displays the first few rows of the DataFrame using the show() method. However, you’ve noticed that it’s resulted in three jobs being created instead of the expected two.

 

Here’s what’s happening:

 

Job 1: DataFrame Creation

  • When you execute spark.createDataFrame(data=data_emp, schema=emp_columns), Databricks creates a job to read the data and make the DataFrame. This job is responsible for processing the input data and constructing the DataFrame based on the specified schema.

Job 2: Show Operation

  • The df. show() call triggers another job. This job is responsible for formatting and displaying the first few rows of the DataFrame in the console output.

Additional Job (Unexpected)

  • The third job you’re observing is unexpected. It’s likely related to internal optimizations or resource management within Databricks. While it’s not explicitly part of your code, Databricks may perform additional tasks behind the scenes to optimize execution or manage resources efficiently.

Why the Extra Job?

  • Databricks is a distributed computing platform, and its behaviour can be influenced by various factors such as data partitioning, caching, and execution plans.
  • When you call show(), Databricks may perform additional tasks like data serialization, formatting, and handling distributed execution across worker nodes. These tasks can lead to an extra job being created.

Recommendations:

  • While the exact reason for the third job might not be immediately apparent, it’s generally not a cause for concern. Databricks handles many optimizations transparently.
  • If you’re curious about the specifics, you can explore the query execution plan using the explain() method on your DataFrame. This will provide insights into the underlying execution steps.
  • Remember that Databricks is designed to handle complex distributed workloads efficiently, and sometimes, the internal behaviour may not align precisely with our expectations based on a high-level understanding.

Feel free to explore further or ask any additional questions! 😊

View solution in original post

3 REPLIES 3

Kaniz_Fatma
Community Manager
Community Manager

Hi @DBEnthusiast, Hello! Let’s dive into the behaviour you’re observing with your simple Databricks snippet.

 

The code you provided creates a DataFrame named df using the createDataFrame method. Then, it displays the first few rows of the DataFrame using the show() method. However, you’ve noticed that it’s resulted in three jobs being created instead of the expected two.

 

Here’s what’s happening:

 

Job 1: DataFrame Creation

  • When you execute spark.createDataFrame(data=data_emp, schema=emp_columns), Databricks creates a job to read the data and make the DataFrame. This job is responsible for processing the input data and constructing the DataFrame based on the specified schema.

Job 2: Show Operation

  • The df. show() call triggers another job. This job is responsible for formatting and displaying the first few rows of the DataFrame in the console output.

Additional Job (Unexpected)

  • The third job you’re observing is unexpected. It’s likely related to internal optimizations or resource management within Databricks. While it’s not explicitly part of your code, Databricks may perform additional tasks behind the scenes to optimize execution or manage resources efficiently.

Why the Extra Job?

  • Databricks is a distributed computing platform, and its behaviour can be influenced by various factors such as data partitioning, caching, and execution plans.
  • When you call show(), Databricks may perform additional tasks like data serialization, formatting, and handling distributed execution across worker nodes. These tasks can lead to an extra job being created.

Recommendations:

  • While the exact reason for the third job might not be immediately apparent, it’s generally not a cause for concern. Databricks handles many optimizations transparently.
  • If you’re curious about the specifics, you can explore the query execution plan using the explain() method on your DataFrame. This will provide insights into the underlying execution steps.
  • Remember that Databricks is designed to handle complex distributed workloads efficiently, and sometimes, the internal behaviour may not align precisely with our expectations based on a high-level understanding.

Feel free to explore further or ask any additional questions! 😊

Thank You @Kaniz_Fatma !!

I was also suspecting the same and your response helped in the conclusion

Kaniz_Fatma
Community Manager
Community Manager

I want to express my gratitude for your effort in selecting the most suitable solution. It's great to hear that your query has been successfully resolved. Thank you for your contribution.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group