topic Re: What is the best way to handle big data sets? in Data Engineering

What is the best way to handle big data sets?

Chris_Shehu — Tue, 22 Mar 2022 04:59:31 GMT

I'm trying to find the best strategy for handling big data sets. In this case I have something that is 450 million records. I'm pulling the data from SQL Server very quickly but when I try to push the data to the Delta Table OR a Azure Container the compute resource locks up and never completes. I end up canceling the process after an hour. Looking at the logs it looks like the compute resource keeps hitting memory issues.

Re: What is the best way to handle big data sets?

Atanu — Tue, 22 Mar 2022 05:39:00 GMT

@Christopher Shehu if you are seeing clusters are hitting memory limit, you may try increasing the cluster size.

Other points to consider:

Avoid memory intensive operations like:
collect()
- operator, which brings a large amount of data to the driver.
- Conversion of a large DataFrame to Pandas

Please find more details here -

https://kb.databricks.com/jobs/driver-unavailable.html

You may consider reading this too -

https://docs.microsoft.com/en-us/azure/databricks/kb/jobs/job-fails-maxresultsize-exception

Re: What is the best way to handle big data sets?

Hubert-Dudek — Tue, 22 Mar 2022 10:49:45 GMT

look for data skews; some partitions can be very big, some small because of incorrect partitioning. You can use Spark UI to do that but also debug your code a bit (get getNumPartitions()) specially sql can divide it unequally to partitions (there are settings in conenctor lowerBound etc). You could try to have number of partitions as workers cores multiply by X (so they will be processed step by step in queue),
increase shuffle size spark.sql.shuffle.partitions default is 200 try bigger, you should calculate it as data size divided by size of partition,
increase size of driver to be 2 times bigger than executor (but to get optimal size please analyze load - in databricks on cluster tab look to Metrics there is Ganglia or even better integrate datadog with cluster),
check wide transformations, ones which need to shuffle data between partitions, group them together to do one shuffle only,
if you need to filter data if possible do it after read from sql so it will be predicative push so it will add where in sql query,
make sure that everything run in distributed way, specially udf, you need to use vectorized pandas udfs so they will run on executors. Don't use collect etc.
Regarding infrastructure use more workers and check that your ADLS is connected via private link. Monitor save progress in folder. You can also use premium ADLS which is faster.
sometimes I process big data as stream as it is easier with big data sets, in that scenario you would need kafka (can be confluent cloud) between SQL and Databricks

Re: What is the best way to handle big data sets?

Chris_Shehu — Tue, 22 Mar 2022 13:43:46 GMT

This is helpful I think I need to look closer at the process and see what needs to be done. The Azure Databricks documentation on pyspark partitioning is lacking.

Re: What is the best way to handle big data sets?

Anonymous — Fri, 25 Mar 2022 12:04:26 GMT

Cherish your data. “Keep your raw data raw: don't manipulate it without having a copy,” says Teal. Visualize the information. Show your workflow. Use version control. Record metadata. Automate, automate, automate. Make computing time count. Capture your environment.

LiveTheOrangeLife.com

Re: What is the best way to handle big data sets?

Wilynan — Fri, 11 Aug 2023 13:41:05 GMT

I think you should consult experts in Big Data for advice on this issue