Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
Hi Team,We have few prod tables which are created in s3 bucket, that have grown now very large, these tables are getting real time data continuously from round the clock databricks workflows; we would like run the optimization commands(optimize, zord...
Hi @Sriram Kumar We haven't heard from you since the last response from @Suteja Kanuri . Kindly share the information with us, and in return, we will provide you with the necessary solution.Thanks and Regards
Howdy - I recently took a table FACT_TENDER and made it into a medalliona tyle TABLE to test performance since I suspected medallion would be quicker. Key differences: Both tables use bronze dataoriginal has all logic in one long notebookMERGE INTO t...
Have gone through the documentation, still cannot understand it.How is bloom filter indexing a column different from z ordering a column?Can somebody explain to me what exactly happens while these two techniques are applied?
hey @Daniel Sahal 1-A Bloomfilter index is a space-efficient data structure that enables data skipping on chosen columns, particularly for fields containing arbitrary textrefer this code snipet to create bloom filter CREATE BLOOMFILTER INDEX
ON [TAB...
We are migrating a job from onprem to databricks. We are trying to optimize the jobs but couldn't use bucketing because by default databricks stores all tables as delta table and it shows error that bucketing is not supported for delta. Is there anyw...
Hi @Arun Balaji ,bucketing is not supported for the delta tables as you have noticed.For the optimization and best practices with delta tables check this:https://docs.databricks.com/optimizations/index.htmlhttps://docs.databricks.com/delta/best-prac...
Recommendations for performance tuning best practices on DatabricksWe recommend also checking out this article from my colleague @Franco Patano on best practices for performance tuning on Databricks.Performance tuning your workloads is an important...
Hello,We are new on Databricks and we would like to know if our working method are good.Currently, we are working like this :spark.sql("CREATE TABLE Temp (SELECT avg(***), sum(***) FROM aaa LEFT JOIN bbb WHERE *** >= ***)")With this method, are we us...
Spark will handle the map/reduce for you.So as long as you use Spark provided functions, be it in scala, python or sql (or even R) you will be using distributed processing.You just care about what you want as a result.And afterwards when you are more...
You can also tune the JVM's GC parameters directly, if you mean the pauses are too long. Set "spark.executor.extraJavaOptions", but it does require knowing a thing or two about how to tune for what performance goal.