Databricks Community

Chris_Shehu · ‎03-21-2022

I'm trying to find the best strategy for handling big data sets. In this case I have something that is 450 million records. I'm pulling the data from SQL Server very quickly but when I try to push the data to the Delta Table OR a Azure Container the compute resource locks up and never completes. I end up canceling the process after an hour. Looking at the logs it looks like the compute resource keeps hitting memory issues.

Hubert-Dudek · ‎03-22-2022

look for data skews; some partitions can be very big, some small because of incorrect partitioning. You can use Spark UI to do that but also debug your code a bit (get getNumPartitions()) specially sql can divide it unequally to partitions (there are settings in conenctor lowerBound etc). You could try to have number of partitions as workers cores multiply by X (so they will be processed step by step in queue),
increase shuffle size spark.sql.shuffle.partitions default is 200 try bigger, you should calculate it as data size divided by size of partition,
increase size of driver to be 2 times bigger than executor (but to get optimal size please analyze load - in databricks on cluster tab look to Metrics there is Ganglia or even better integrate datadog with cluster),
check wide transformations, ones which need to shuffle data between partitions, group them together to do one shuffle only,
if you need to filter data if possible do it after read from sql so it will be predicative push so it will add where in sql query,
make sure that everything run in distributed way, specially udf, you need to use vectorized pandas udfs so they will run on executors. Don't use collect etc.
Regarding infrastructure use more workers and check that your ADLS is connected via private link. Monitor save progress in folder. You can also use premium ADLS which is faster.
sometimes I process big data as stream as it is easier with big data sets, in that scenario you would need kafka (can be confluent cloud) between SQL and Databricks

View solution in original post

Atanu · ‎03-21-2022

@Christopher Shehu if you are seeing clusters are hitting memory limit, you may try increasing the cluster size.

Other points to consider:

Avoid memory intensive operations like:
collect()
- operator, which brings a large amount of data to the driver.
- Conversion of a large DataFrame to Pandas

Please find more details here -

https://kb.databricks.com/jobs/driver-unavailable.html

You may consider reading this too -

https://docs.microsoft.com/en-us/azure/databricks/kb/jobs/job-fails-maxresultsize-exception

Hubert-Dudek · ‎03-22-2022

look for data skews; some partitions can be very big, some small because of incorrect partitioning. You can use Spark UI to do that but also debug your code a bit (get getNumPartitions()) specially sql can divide it unequally to partitions (there are settings in conenctor lowerBound etc). You could try to have number of partitions as workers cores multiply by X (so they will be processed step by step in queue),
increase shuffle size spark.sql.shuffle.partitions default is 200 try bigger, you should calculate it as data size divided by size of partition,
increase size of driver to be 2 times bigger than executor (but to get optimal size please analyze load - in databricks on cluster tab look to Metrics there is Ganglia or even better integrate datadog with cluster),
check wide transformations, ones which need to shuffle data between partitions, group them together to do one shuffle only,
if you need to filter data if possible do it after read from sql so it will be predicative push so it will add where in sql query,
make sure that everything run in distributed way, specially udf, you need to use vectorized pandas udfs so they will run on executors. Don't use collect etc.
Regarding infrastructure use more workers and check that your ADLS is connected via private link. Monitor save progress in folder. You can also use premium ADLS which is faster.
sometimes I process big data as stream as it is easier with big data sets, in that scenario you would need kafka (can be confluent cloud) between SQL and Databricks

Chris_Shehu · ‎03-22-2022

This is helpful I think I need to look closer at the process and see what needs to be done. The Azure Databricks documentation on pyspark partitioning is lacking.

Anonymous · ‎03-25-2022

Cherish your data. “Keep your raw data raw: don't manipulate it without having a copy,” says Teal. Visualize the information. Show your workflow. Use version control. Record metadata. Automate, automate, automate. Make computing time count. Capture your environment.

LiveTheOrangeLife.com

Wilynan · ‎08-11-2023

I think you should consult experts in Big Data for advice on this issue

Databricks Community

What is the best way to handle big data sets?

Join Us as a Local Community Builder!

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples

Level Up with Databricks Specialist Sessions

🌟 Community Pulse: Your Weekly Roundup! November 07 – 13, 2025

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐