Databricks

tomnguyen_195 · ‎05-06-2022

Hi,

I need to ingest 60 millions json files from S3 and have create a Delta Live Tables to ingest these data to delta table with Auto Loader. However the input rate in my DLT is always around 8 records/second no matter how many worker I add to the DLT. I'm using all the default setting and have set the DLT to production mode. Is there any config that I need to add to increase the input rate for my DLT?

Thanks,

Hubert-Dudek · ‎05-06-2022

Please consider the following:

consider having driver 2 times bigger than worker,
check is S3 in the same region, is communicating via the private gateway (local IPs),
enable S3 transfer acceleration,
in ingestion please user autoloader as described here https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-data-sources.html
increase c loudFiles.maxBytesPerTrigger and cloudFiles.maxFilesPerTrigger autoloader options
analyze parallelism using Spark UI (every CPU should process 1 task at the same time so if you have 64 cores, 64 files should be processed simultaneously)

View solution in original post

Hubert-Dudek · ‎05-06-2022

Please consider the following:

consider having driver 2 times bigger than worker,
check is S3 in the same region, is communicating via the private gateway (local IPs),
enable S3 transfer acceleration,
in ingestion please user autoloader as described here https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-data-sources.html
increase c loudFiles.maxBytesPerTrigger and cloudFiles.maxFilesPerTrigger autoloader options
analyze parallelism using Spark UI (every CPU should process 1 task at the same time so if you have 64 cores, 64 files should be processed simultaneously)

tomnguyen_195 · ‎05-07-2022

Hi Hubert,

Thank you very much for your answer. I have try your suggestion and have some follow up questions that I post inline with your answer here.

consider having driver 2 times bigger than worker,
- I have tried this but didn't see a clear improvement
check is S3 in the same region, is communicating via the private gateway (local IPs),
- I'm not sure how to do this. Is this guidance what you mean: https://docs.databricks.com/administration-guide/cloud-configurations/aws/customer-managed-vpc.html?
enable S3 transfer acceleration,
- I have enable transfer acceleration in my S3 bucket, however when I tried to call the acceleration endpoint from my pipeline I got this error: "IllegalStateException: To enable accelerate mode, please use AmazonS3ClientBuilder.withAccelerateModeEnabled(true)". I have googled but couldn't find how to set this option in Databricks.
in ingestion please user autoloader as described here https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-data-sources.html
- I have done this from the initial setup
increase c loudFiles.maxBytesPerTrigger and cloudFiles.maxFilesPerTrigger autoloader options
- I have set this to "10gb" for maxBytes and 10,000 for maxFiles. This is the change that make a clear different in performance as it increase the input rate from 8 to around 20 records/s. Not sure what's the maximum number that I can set here though.
analyze parallelism using Spark UI (every CPU should process 1 task at the same time so if you have 64 cores, 64 files should be processed simultaneously)
- I'm not very familiar with the SparkUI so not sure where to look at to analyze the thing you mentioned above.

If you can help with these follow up questions that would be greatly appreciated. I'm very new to the Spark/Databricks and data analysis field in general so trying my best to learn here.

Thanks

Hubert-Dudek · ‎05-07-2022

Regarding private VPC yes that's the link https://docs.databricks.com/administration-guide/cloud-configurations/aws/customer-managed-vpc.html?

Regarding region is easier when you set up databricks you choose region and availability zone, the same for each S3 bucket. Just make sure that it is the same for both. For example us-west-2 etc.

tomnguyen_195 · ‎05-07-2022

Thanks, I'll follow the guide to setup the VPC.

Regarding the S3 transfer acceleration, do you know how to connect to it from Databricks without getting the "IllegalStateException: To enable accelerate mode, please use AmazonS3ClientBuilder.withAccelerateModeEnabled(true)" error?

Kaniz · ‎06-05-2022

Hi @thanh nguyen, Here is an excellent explanation of the issue faced you.

Please have a look.

Kaniz · ‎06-14-2022

Hi @thanh nguyen , We haven’t heard from you on the last response from me, and I was checking back to see if you have a resolution yet. If you have any solution, please share it with the community as it can be helpful to others. Otherwise, we will respond with more details and try to help.

Databricks

Increase input rate in Delta Live Tables

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs