05-06-2022 02:15 AM
Hi,
I need to ingest 60 millions json files from S3 and have create a Delta Live Tables to ingest these data to delta table with Auto Loader. However the input rate in my DLT is always around 8 records/second no matter how many worker I add to the DLT. I'm using all the default setting and have set the DLT to production mode. Is there any config that I need to add to increase the input rate for my DLT?
Thanks,
05-06-2022 09:01 AM
Please consider the following:
05-06-2022 09:01 AM
Please consider the following:
05-07-2022 06:11 AM
Hi Hubert,
Thank you very much for your answer. I have try your suggestion and have some follow up questions that I post inline with your answer here.
If you can help with these follow up questions that would be greatly appreciated. I'm very new to the Spark/Databricks and data analysis field in general so trying my best to learn here.
Thanks
05-07-2022 09:38 AM
Regarding private VPC yes that's the link https://docs.databricks.com/administration-guide/cloud-configurations/aws/customer-managed-vpc.html?
Regarding region is easier when you set up databricks you choose region and availability zone, the same for each S3 bucket. Just make sure that it is the same for both. For example us-west-2 etc.
05-07-2022 09:38 PM
Thanks, I'll follow the guide to setup the VPC.
Regarding the S3 transfer acceleration, do you know how to connect to it from Databricks without getting the "IllegalStateException: To enable accelerate mode, please use AmazonS3ClientBuilder.withAccelerateModeEnabled(true)" error?
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group