cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Pyspark or Scala ?

William_Scardua
Valued Contributor

Hi guys,

Many people use pyspark to develop their pipelines, in your opinion in which cases is it better to use one or the other? Or is it better to choose a single language?

Thanks

2 REPLIES 2

XP
Databricks Employee
Databricks Employee

That is a complex question, but I'll do my best to break it down.

Part of the beauty of Databricks is that it is a platform for your entire data community. In the old days there used to be differences between Scala, Python, SQL, R, Java. Nowadays, each language is well supported and for the most part there is parity and harmony across the API's.

Databricks is truly polyglot, so it's reasonable for some teams/individuals to use Scala, some to use SQL and some to use Python or R or Java. You could even use multiple languages in a single pipeline or notebook. That being said, it's often best for your team to standardize around a single language and style. This will make your code easier to reason about, test and write; improving overall developer Quality of Life.

Here are some additional things you should consider when choosing a language:

  • What language(s) can your team(s) best support?
  • What language(s) is my code base in today?
  • How complex is the logic in my pipelines? Python and Scala are easier to write tests for.
  • Are there any 3rd party libraries I need/want to use and support?
  • Are there any external systems I need to integrate with that have better support in a particular language?
  • Some code may be easier to express in Python or Scala over SQL, while other code may be easier to reason about and easier to write in SQL.
  • Depending on your use case you will likely find more community support and reference code for one language or another. For more generic pipeline development and data management there will be more support in Python or SQL. It's important to note, any advantage here is being significantly flattened by AI coding assistants.

    tldr; Use the language that works best for you and your data community. For me thats often python or SQL.

hari-prasad
Valued Contributor II

Hi @William_Scardua,

It is advisable to consider using Python (or PySpark) due to Spark's comprehensive API support for Python. Furthermore, Databricks currently supports Delta Live Tables (DLT) with Python, but does not support Scala at this time. Additionally, you can extend PySpark with various data quality libraries written in Python, such as Great Expectations.

Regards,
Hari Prasad



Regards,
Hari Prasad

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now