cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Increase input rate in Delta Live Tables

tomnguyen_195
New Contributor III

Hi,

I need to ingest 60 millions json files from S3 and have create a Delta Live Tables to ingest these data to delta table with Auto Loader. However the input rate in my DLT is always around 8 records/second no matter how many worker I add to the DLT. I'm using all the default setting and have set the DLT to production mode. Is there any config that I need to add to increase the input rate for my DLT?

Thanks,

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

Please consider the following:

  • consider having driver 2 times bigger than worker,
  • check is S3 in the same region, is communicating via the private gateway (local IPs),
  • enable S3 transfer acceleration,
  • in ingestion please user autoloader as described here https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-data-sources.html
  • increase c loudFiles.maxBytesPerTrigger and cloudFiles.maxFilesPerTrigger autoloader options
  • analyze parallelism using Spark UI (every CPU should process 1 task at the same time so if you have 64 cores, 64 files should be processed simultaneously)

View solution in original post

4 REPLIES 4

Hubert-Dudek
Esteemed Contributor III

Please consider the following:

  • consider having driver 2 times bigger than worker,
  • check is S3 in the same region, is communicating via the private gateway (local IPs),
  • enable S3 transfer acceleration,
  • in ingestion please user autoloader as described here https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-data-sources.html
  • increase c loudFiles.maxBytesPerTrigger and cloudFiles.maxFilesPerTrigger autoloader options
  • analyze parallelism using Spark UI (every CPU should process 1 task at the same time so if you have 64 cores, 64 files should be processed simultaneously)

Hi Hubert,

Thank you very much for your answer. I have try your suggestion and have some follow up questions that I post inline with your answer here.

  • consider having driver 2 times bigger than worker,
    • I have tried this but didn't see a clear improvement
  • check is S3 in the same region, is communicating via the private gateway (local IPs),
  • enable S3 transfer acceleration,
    • I have enable transfer acceleration in my S3 bucket, however when I tried to call the acceleration endpoint from my pipeline I got this error: "IllegalStateException: To enable accelerate modeplease use AmazonS3ClientBuilder.withAccelerateModeEnabled(true)". I have googled but couldn't find how to set this option in Databricks.
  • in ingestion please user autoloader as described here https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-data-sources.html
    • I have done this from the initial setup
  • increase c loudFiles.maxBytesPerTrigger and cloudFiles.maxFilesPerTrigger autoloader options
    • I have set this to "10gb" for maxBytes and 10,000 for maxFiles. This is the change that make a clear different in performance as it increase the input rate from 8 to around 20 records/s. Not sure what's the maximum number that I can set here though.
  • analyze parallelism using Spark UI (every CPU should process 1 task at the same time so if you have 64 cores, 64 files should be processed simultaneously)
    • I'm not very familiar with the SparkUI so not sure where to look at to analyze the thing you mentioned above.

If you can help with these follow up questions that would be greatly appreciated. I'm very new to the Spark/Databricks and data analysis field in general so trying my best to learn here.

Thanks

Hubert-Dudek
Esteemed Contributor III

Regarding private VPC yes that's the link https://docs.databricks.com/administration-guide/cloud-configurations/aws/customer-managed-vpc.html?

Regarding region is easier when you set up databricks you choose region and availability zone, the same for each S3 bucket. Just make sure that it is the same for both. For example us-west-2 etc.

Thanks, I'll follow the guide to setup the VPC.

Regarding the S3 transfer acceleration, do you know how to connect to it from Databricks without getting the "IllegalStateException: To enable accelerate modeplease use AmazonS3ClientBuilder.withAccelerateModeEnabled(true)" error?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group