- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-06-2022 02:15 AM
Hi,
I need to ingest 60 millions json files from S3 and have create a Delta Live Tables to ingest these data to delta table with Auto Loader. However the input rate in my DLT is always around 8 records/second no matter how many worker I add to the DLT. I'm using all the default setting and have set the DLT to production mode. Is there any config that I need to add to increase the input rate for my DLT?
Thanks,
- Labels:
-
Auto-loader
-
Delta
-
Delta Live Tables
-
DLT
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-06-2022 09:01 AM
Please consider the following:
- consider having driver 2 times bigger than worker,
- check is S3 in the same region, is communicating via the private gateway (local IPs),
- enable S3 transfer acceleration,
- in ingestion please user autoloader as described here https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-data-sources.html
- increase c loudFiles.maxBytesPerTrigger and cloudFiles.maxFilesPerTrigger autoloader options
- analyze parallelism using Spark UI (every CPU should process 1 task at the same time so if you have 64 cores, 64 files should be processed simultaneously)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-06-2022 09:01 AM
Please consider the following:
- consider having driver 2 times bigger than worker,
- check is S3 in the same region, is communicating via the private gateway (local IPs),
- enable S3 transfer acceleration,
- in ingestion please user autoloader as described here https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-data-sources.html
- increase c loudFiles.maxBytesPerTrigger and cloudFiles.maxFilesPerTrigger autoloader options
- analyze parallelism using Spark UI (every CPU should process 1 task at the same time so if you have 64 cores, 64 files should be processed simultaneously)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-07-2022 06:11 AM
Hi Hubert,
Thank you very much for your answer. I have try your suggestion and have some follow up questions that I post inline with your answer here.
- consider having driver 2 times bigger than worker,
- I have tried this but didn't see a clear improvement
- check is S3 in the same region, is communicating via the private gateway (local IPs),
- I'm not sure how to do this. Is this guidance what you mean: https://docs.databricks.com/administration-guide/cloud-configurations/aws/customer-managed-vpc.html?
- enable S3 transfer acceleration,
- I have enable transfer acceleration in my S3 bucket, however when I tried to call the acceleration endpoint from my pipeline I got this error: "IllegalStateException: To enable accelerate mode, please use AmazonS3ClientBuilder.withAccelerateModeEnabled(true)". I have googled but couldn't find how to set this option in Databricks.
- in ingestion please user autoloader as described here https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-data-sources.html
- I have done this from the initial setup
- increase c loudFiles.maxBytesPerTrigger and cloudFiles.maxFilesPerTrigger autoloader options
- I have set this to "10gb" for maxBytes and 10,000 for maxFiles. This is the change that make a clear different in performance as it increase the input rate from 8 to around 20 records/s. Not sure what's the maximum number that I can set here though.
- analyze parallelism using Spark UI (every CPU should process 1 task at the same time so if you have 64 cores, 64 files should be processed simultaneously)
- I'm not very familiar with the SparkUI so not sure where to look at to analyze the thing you mentioned above.
If you can help with these follow up questions that would be greatly appreciated. I'm very new to the Spark/Databricks and data analysis field in general so trying my best to learn here.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-07-2022 09:38 AM
Regarding private VPC yes that's the link https://docs.databricks.com/administration-guide/cloud-configurations/aws/customer-managed-vpc.html?
Regarding region is easier when you set up databricks you choose region and availability zone, the same for each S3 bucket. Just make sure that it is the same for both. For example us-west-2 etc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-07-2022 09:38 PM
Thanks, I'll follow the guide to setup the VPC.
Regarding the S3 transfer acceleration, do you know how to connect to it from Databricks without getting the "IllegalStateException: To enable accelerate mode, please use AmazonS3ClientBuilder.withAccelerateModeEnabled(true)" error?

