cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Increase input rate in Delta Live Tables

tomnguyen_195
New Contributor III

Hi,

I need to ingest 60 millions json files from S3 and have create a Delta Live Tables to ingest these data to delta table with Auto Loader. However the input rate in my DLT is always around 8 records/second no matter how many worker I add to the DLT. I'm using all the default setting and have set the DLT to production mode. Is there any config that I need to add to increase the input rate for my DLT?

Thanks,

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

Please consider the following:

  • consider having driver 2 times bigger than worker,
  • check is S3 in the same region, is communicating via the private gateway (local IPs),
  • enable S3 transfer acceleration,
  • in ingestion please user autoloader as described here https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-data-sources.html
  • increase c loudFiles.maxBytesPerTrigger and cloudFiles.maxFilesPerTrigger autoloader options
  • analyze parallelism using Spark UI (every CPU should process 1 task at the same time so if you have 64 cores, 64 files should be processed simultaneously)

View solution in original post

6 REPLIES 6

Hubert-Dudek
Esteemed Contributor III

Please consider the following:

  • consider having driver 2 times bigger than worker,
  • check is S3 in the same region, is communicating via the private gateway (local IPs),
  • enable S3 transfer acceleration,
  • in ingestion please user autoloader as described here https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-data-sources.html
  • increase c loudFiles.maxBytesPerTrigger and cloudFiles.maxFilesPerTrigger autoloader options
  • analyze parallelism using Spark UI (every CPU should process 1 task at the same time so if you have 64 cores, 64 files should be processed simultaneously)

Hi Hubert,

Thank you very much for your answer. I have try your suggestion and have some follow up questions that I post inline with your answer here.

  • consider having driver 2 times bigger than worker,
    • I have tried this but didn't see a clear improvement
  • check is S3 in the same region, is communicating via the private gateway (local IPs),
  • enable S3 transfer acceleration,
    • I have enable transfer acceleration in my S3 bucket, however when I tried to call the acceleration endpoint from my pipeline I got this error: "IllegalStateException: To enable accelerate modeplease use AmazonS3ClientBuilder.withAccelerateModeEnabled(true)". I have googled but couldn't find how to set this option in Databricks.
  • in ingestion please user autoloader as described here https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-data-sources.html
    • I have done this from the initial setup
  • increase c loudFiles.maxBytesPerTrigger and cloudFiles.maxFilesPerTrigger autoloader options
    • I have set this to "10gb" for maxBytes and 10,000 for maxFiles. This is the change that make a clear different in performance as it increase the input rate from 8 to around 20 records/s. Not sure what's the maximum number that I can set here though.
  • analyze parallelism using Spark UI (every CPU should process 1 task at the same time so if you have 64 cores, 64 files should be processed simultaneously)
    • I'm not very familiar with the SparkUI so not sure where to look at to analyze the thing you mentioned above.

If you can help with these follow up questions that would be greatly appreciated. I'm very new to the Spark/Databricks and data analysis field in general so trying my best to learn here.

Thanks

Hubert-Dudek
Esteemed Contributor III

Regarding private VPC yes that's the link https://docs.databricks.com/administration-guide/cloud-configurations/aws/customer-managed-vpc.html?

Regarding region is easier when you set up databricks you choose region and availability zone, the same for each S3 bucket. Just make sure that it is the same for both. For example us-west-2 etc.

Thanks, I'll follow the guide to setup the VPC.

Regarding the S3 transfer acceleration, do you know how to connect to it from Databricks without getting the "IllegalStateException: To enable accelerate modeplease use AmazonS3ClientBuilder.withAccelerateModeEnabled(true)" error?

Hi @thanh nguyen​, Here is an excellent explanation of the issue faced you.

Please have a look.

Kaniz
Community Manager
Community Manager

Hi @thanh nguyen​ ​, We haven’t heard from you on the last response from me, and I was checking back to see if you have a resolution yet. If you have any solution, please share it with the community as it can be helpful to others. Otherwise, we will respond with more details and try to help.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.