03-21-2022 09:59 PM
I'm trying to find the best strategy for handling big data sets. In this case I have something that is 450 million records. I'm pulling the data from SQL Server very quickly but when I try to push the data to the Delta Table OR a Azure Container the compute resource locks up and never completes. I end up canceling the process after an hour. Looking at the logs it looks like the compute resource keeps hitting memory issues.
03-22-2022 03:49 AM
03-21-2022 10:39 PM
@Christopher Shehu if you are seeing clusters are hitting memory limit, you may try increasing the cluster size.
Other points to consider:
Please find more details here -
https://kb.databricks.com/jobs/driver-unavailable.html
You may consider reading this too -
https://docs.microsoft.com/en-us/azure/databricks/kb/jobs/job-fails-maxresultsize-exception
03-22-2022 03:49 AM
03-22-2022 06:43 AM
This is helpful I think I need to look closer at the process and see what needs to be done. The Azure Databricks documentation on pyspark partitioning is lacking.
 
					
				
		
03-25-2022 05:04 AM
Cherish your data. “Keep your raw data raw: don't manipulate it without having a copy,” says Teal. Visualize the information. Show your workflow. Use version control. Record metadata. Automate, automate, automate. Make computing time count. Capture your environment.
08-11-2023 06:41 AM
 
					
				
				
			
		
 
					
				
				
			
		
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now