Pyspark or Scala ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-27-2023 04:00 PM
Hi guys,
Many people use pyspark to develop their pipelines, in your opinion in which cases is it better to use one or the other? Or is it better to choose a single language?
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-17-2025 09:22 AM
That is a complex question, but I'll do my best to break it down.
Part of the beauty of Databricks is that it is a platform for your entire data community. In the old days there used to be differences between Scala, Python, SQL, R, Java. Nowadays, each language is well supported and for the most part there is parity and harmony across the API's.
Databricks is truly polyglot, so it's reasonable for some teams/individuals to use Scala, some to use SQL and some to use Python or R or Java. You could even use multiple languages in a single pipeline or notebook. That being said, it's often best for your team to standardize around a single language and style. This will make your code easier to reason about, test and write; improving overall developer Quality of Life.
Here are some additional things you should consider when choosing a language:
- What language(s) can your team(s) best support?
- What language(s) is my code base in today?
- How complex is the logic in my pipelines? Python and Scala are easier to write tests for.
- Are there any 3rd party libraries I need/want to use and support?
- Are there any external systems I need to integrate with that have better support in a particular language?
- Some code may be easier to express in Python or Scala over SQL, while other code may be easier to reason about and easier to write in SQL.
- Depending on your use case you will likely find more community support and reference code for one language or another. For more generic pipeline development and data management there will be more support in Python or SQL. It's important to note, any advantage here is being significantly flattened by AI coding assistants.
tldr; Use the language that works best for you and your data community. For me thats often python or SQL.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-17-2025 10:13 AM - edited 01-17-2025 10:13 AM
Hi @William_Scardua,
It is advisable to consider using Python (or PySpark) due to Spark's comprehensive API support for Python. Furthermore, Databricks currently supports Delta Live Tables (DLT) with Python, but does not support Scala at this time. Additionally, you can extend PySpark with various data quality libraries written in Python, such as Great Expectations.
Regards,
Hari Prasad
Regards,
Hari Prasad

