- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-28-2022 09:11 AM
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-28-2022 12:45 PM
If you're using the DataFrame API it all gets run in the JVM, just like sql queries. The exception is UDFs which have to transfer data to Python land to execute.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-28-2022 11:16 AM
In short, no there's no difference. However, there does need to be a translation, like you read somewhere, so it could add a negligible amount of time to the workload. However, the performance doesn't degrade significantly enough to matter.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-28-2022 12:45 PM
If you're using the DataFrame API it all gets run in the JVM, just like sql queries. The exception is UDFs which have to transfer data to Python land to execute.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-28-2022 09:50 PM
Can you provide any resource where I would be able to look into it ?
Just wondering is python code converted to SQL at the end ?
Or as the other person mentioned it is converted to scala.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-28-2022 12:47 PM
Python API have an extra layer in runtime, which leverage local socket to transfer data. So it might have some performance gap due to the transformation, but should not large for most of scenarios.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-28-2022 12:52 PM
Quite a bit performance depending upon where you are running
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-28-2022 09:47 PM
Hi just did it. Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-28-2022 02:13 PM
There should not be difference between One or other, at the end, every code should be translated to machine language in orden to run on a computer, it’s possible that the translation process be harder in some cases that others, however, that translation process could be harder for python (some cases) and for SQL (some other cases).
My recomendation is that you use every language for every use case.
SQL as a first option and when you have to process bunch of data on a structured format.
Python when you have certain complexity not supported by SQL.
Regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-28-2022 02:29 PM
Python is the choice for the ML/AI workloads while SQL would be for data based MDM modeling. Pretty much similar performance with certain assumptions. DB optimization is a must assumption for performance benchmarking.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-28-2022 02:35 PM
Relatively similar performance for simple use cases. Higher end tasks and pythons the better bet
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-28-2022 02:37 PM
I think it all gets converted to scala in the end. Shouldn’t be different.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-28-2022 09:49 PM
Can you provide any source ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-29-2022 10:17 PM
To add on the consideration of UDFs, try to consider using HOFs (Higher Order Functions) whenever possible first as there is a signifcant performance benefit as seen here.

