cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Best language to use

Ryan_Chynoweth
Honored Contributor III

Databricks supports SQL, Scala, Python, and R. Is there a most performant language to use on Databricks? I know SQL well but would like to get into one of the other languages and don't know which to focus on.

1 ACCEPTED SOLUTION

Accepted Solutions

Mooune_DBU
Valued Contributor

If you're looking to learn a new language to develop Data Engineering/Science and/or Machine-Learning code in general, the choice is really up-to-you between R or Python based on which syntax you'll find more intuitive (Keeping in mind that based on public stats, you'll find that Python is more popular, since R was originally designed by statisticians for statisticians).

When it comes to Spark for analyzing large amounts of data: at a lower level, spark executes Scala code, meaning that both Python/R APIs are just easy means to communicate with the spark engine (i.e. catalyst optimizer) for writing highly optimized Scala code based on what you're trying to achieve, with minor/negligible language-specific overhead. So if you're used to high-level interpreted languages without having to worry about what's happening at a lower level, then Python or R would be the way to go. While I personally like the Python API, I do encourage you to learn the basics of Scala for multi-threading, because if you have a lot of existing SQL-based workloads and code, you can with minimal Scala knowledge multi-thread the execution of these queries to optimize performance and resource usage.

To conclude, when it comes to using Databricks, there's no real language winner, the value and beauty of the platform is the ability to mingle between all and get the best out of each, to maximize efficiency of your code (e.g. leverage Scala multi-threading on top of SQL queries).

Hope this helps

View solution in original post

3 REPLIES 3

Mooune_DBU
Valued Contributor

If you're looking to learn a new language to develop Data Engineering/Science and/or Machine-Learning code in general, the choice is really up-to-you between R or Python based on which syntax you'll find more intuitive (Keeping in mind that based on public stats, you'll find that Python is more popular, since R was originally designed by statisticians for statisticians).

When it comes to Spark for analyzing large amounts of data: at a lower level, spark executes Scala code, meaning that both Python/R APIs are just easy means to communicate with the spark engine (i.e. catalyst optimizer) for writing highly optimized Scala code based on what you're trying to achieve, with minor/negligible language-specific overhead. So if you're used to high-level interpreted languages without having to worry about what's happening at a lower level, then Python or R would be the way to go. While I personally like the Python API, I do encourage you to learn the basics of Scala for multi-threading, because if you have a lot of existing SQL-based workloads and code, you can with minimal Scala knowledge multi-thread the execution of these queries to optimize performance and resource usage.

To conclude, when it comes to using Databricks, there's no real language winner, the value and beauty of the platform is the ability to mingle between all and get the best out of each, to maximize efficiency of your code (e.g. leverage Scala multi-threading on top of SQL queries).

Hope this helps

Kristof
New Contributor III

I recently had conversation with Databricks architect, and he make me realised that Databricks customers are mostly using Python and they invest in this language. If you look at Delta Live tables, written in Scala but users can only use it with Python or SQL.

Second example is Databricks connect, it used to support scala but now its only Python. This speak for themselves. I would start with Python first.

Anonymous
Not applicable

It total depends on you? BTW, you can choose Python and SQL