โ03-21-2022 09:59 PM
I'm trying to find the best strategy for handling big data sets. In this case I have something that is 450 million records. I'm pulling the data from SQL Server very quickly but when I try to push the data to the Delta Table OR a Azure Container the compute resource locks up and never completes. I end up canceling the process after an hour. Looking at the logs it looks like the compute resource keeps hitting memory issues.
โ03-22-2022 03:49 AM
โ03-21-2022 10:39 PM
@Christopher Shehuโ if you are seeing clusters are hitting memory limit, you may try increasing the cluster size.
Other points to consider:
Please find more details here -
https://kb.databricks.com/jobs/driver-unavailable.html
You may consider reading this too -
https://docs.microsoft.com/en-us/azure/databricks/kb/jobs/job-fails-maxresultsize-exception
โ03-22-2022 03:49 AM
โ03-22-2022 06:43 AM
This is helpful I think I need to look closer at the process and see what needs to be done. The Azure Databricks documentation on pyspark partitioning is lacking.
โ03-25-2022 05:04 AM
Cherish your data. โKeep your raw data raw: don't manipulate it without having a copy,โ says Teal. Visualize the information. Show your workflow. Use version control. Record metadata. Automate, automate, automate. Make computing time count. Capture your environment.
โ08-11-2023 06:41 AM
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group