cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

What is the best way to handle big data sets?

Chris_Shehu
Valued Contributor III

I'm trying to find the best strategy for handling big data sets. In this case I have something that is 450 million records. I'm pulling the data from SQL Server very quickly but when I try to push the data to the Delta Table OR a Azure Container the compute resource locks up and never completes. I end up canceling the process after an hour. Looking at the logs it looks like the compute resource keeps hitting memory issues.

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III
  • look for data skews; some partitions can be very big, some small because of incorrect partitioning. You can use Spark UI to do that but also debug your code a bit (get getNumPartitions()) specially sql can divide it unequally to partitions (there are settings in conenctor lowerBound etc). You could try to have number of partitions as workers cores multiply by X (so they will be processed step by step in queue),
  • increase shuffle size spark.sql.shuffle.partitions default is 200 try bigger, you should calculate it as data size divided by size of partition,
  • increase size of driver to be 2 times bigger than executor (but to get optimal size please analyze load - in databricks on cluster tab look to Metrics there is Ganglia or even better integrate datadog with cluster),
  • check wide transformations, ones which need to shuffle data between partitions, group them together to do one shuffle only,
  • if you need to filter data if possible do it after read from sql so it will be predicative push so it will add where in sql query,
  • make sure that everything run in distributed way, specially udf, you need to use vectorized pandas udfs so they will run on executors. Don't use collect etc.
  • Regarding infrastructure use more workers and check that your ADLS is connected via private link. Monitor save progress in folder. You can also use premium ADLS which is faster.
  • sometimes I process big data as stream as it is easier with big data sets, in that scenario you would need kafka (can be confluent cloud) between SQL and Databricks

View solution in original post

5 REPLIES 5

Atanu
Esteemed Contributor
Esteemed Contributor

@Christopher Shehu​  if you are seeing clusters are hitting memory limit, you may try increasing the cluster size.

Other points to consider:

  • Avoid memory intensive operations like:
  • collect()
    •  operator, which brings a large amount of data to the driver.
    • Conversion of a large DataFrame to Pandas

Please find more details here -

https://kb.databricks.com/jobs/driver-unavailable.html

You may consider reading this too -

https://docs.microsoft.com/en-us/azure/databricks/kb/jobs/job-fails-maxresultsize-exception

Hubert-Dudek
Esteemed Contributor III
  • look for data skews; some partitions can be very big, some small because of incorrect partitioning. You can use Spark UI to do that but also debug your code a bit (get getNumPartitions()) specially sql can divide it unequally to partitions (there are settings in conenctor lowerBound etc). You could try to have number of partitions as workers cores multiply by X (so they will be processed step by step in queue),
  • increase shuffle size spark.sql.shuffle.partitions default is 200 try bigger, you should calculate it as data size divided by size of partition,
  • increase size of driver to be 2 times bigger than executor (but to get optimal size please analyze load - in databricks on cluster tab look to Metrics there is Ganglia or even better integrate datadog with cluster),
  • check wide transformations, ones which need to shuffle data between partitions, group them together to do one shuffle only,
  • if you need to filter data if possible do it after read from sql so it will be predicative push so it will add where in sql query,
  • make sure that everything run in distributed way, specially udf, you need to use vectorized pandas udfs so they will run on executors. Don't use collect etc.
  • Regarding infrastructure use more workers and check that your ADLS is connected via private link. Monitor save progress in folder. You can also use premium ADLS which is faster.
  • sometimes I process big data as stream as it is easier with big data sets, in that scenario you would need kafka (can be confluent cloud) between SQL and Databricks

Chris_Shehu
Valued Contributor III

This is helpful I think I need to look closer at the process and see what needs to be done. The Azure Databricks documentation on pyspark partitioning is lacking.

Anonymous
Not applicable

Cherish your data. “Keep your raw data raw: don't manipulate it without having a copy,” says Teal. Visualize the information. Show your workflow. Use version control. Record metadata. Automate, automate, automate. Make computing time count. Capture your environment.

LiveTheOrangeLife.com

Wilynan
New Contributor II

I think you should consult experts in Big Data for advice on this issue

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.