Databricks Community

tomnguyen_195 · ‎05-06-2022

Hi,

I need to ingest 60 millions json files from S3 and have create a Delta Live Tables to ingest these data to delta table with Auto Loader. However the input rate in my DLT is always around 8 records/second no matter how many worker I add to the DLT. I'm using all the default setting and have set the DLT to production mode. Is there any config that I need to add to increase the input rate for my DLT?

Thanks,

Hubert-Dudek · ‎05-06-2022

Please consider the following:

consider having driver 2 times bigger than worker,
check is S3 in the same region, is communicating via the private gateway (local IPs),
enable S3 transfer acceleration,
in ingestion please user autoloader as described here https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-data-sources.html
increase c loudFiles.maxBytesPerTrigger and cloudFiles.maxFilesPerTrigger autoloader options
analyze parallelism using Spark UI (every CPU should process 1 task at the same time so if you have 64 cores, 64 files should be processed simultaneously)

View solution in original post

Hubert-Dudek · ‎05-06-2022

Please consider the following:

consider having driver 2 times bigger than worker,
check is S3 in the same region, is communicating via the private gateway (local IPs),
enable S3 transfer acceleration,
in ingestion please user autoloader as described here https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-data-sources.html
increase c loudFiles.maxBytesPerTrigger and cloudFiles.maxFilesPerTrigger autoloader options
analyze parallelism using Spark UI (every CPU should process 1 task at the same time so if you have 64 cores, 64 files should be processed simultaneously)

tomnguyen_195 · ‎05-07-2022

Hi Hubert,

Thank you very much for your answer. I have try your suggestion and have some follow up questions that I post inline with your answer here.

consider having driver 2 times bigger than worker,
- I have tried this but didn't see a clear improvement
check is S3 in the same region, is communicating via the private gateway (local IPs),
- I'm not sure how to do this. Is this guidance what you mean: https://docs.databricks.com/administration-guide/cloud-configurations/aws/customer-managed-vpc.html?
enable S3 transfer acceleration,
- I have enable transfer acceleration in my S3 bucket, however when I tried to call the acceleration endpoint from my pipeline I got this error: "IllegalStateException: To enable accelerate mode, please use AmazonS3ClientBuilder.withAccelerateModeEnabled(true)". I have googled but couldn't find how to set this option in Databricks.
in ingestion please user autoloader as described here https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-data-sources.html
- I have done this from the initial setup
increase c loudFiles.maxBytesPerTrigger and cloudFiles.maxFilesPerTrigger autoloader options
- I have set this to "10gb" for maxBytes and 10,000 for maxFiles. This is the change that make a clear different in performance as it increase the input rate from 8 to around 20 records/s. Not sure what's the maximum number that I can set here though.
analyze parallelism using Spark UI (every CPU should process 1 task at the same time so if you have 64 cores, 64 files should be processed simultaneously)
- I'm not very familiar with the SparkUI so not sure where to look at to analyze the thing you mentioned above.

If you can help with these follow up questions that would be greatly appreciated. I'm very new to the Spark/Databricks and data analysis field in general so trying my best to learn here.

Thanks

Hubert-Dudek · ‎05-07-2022

Regarding private VPC yes that's the link https://docs.databricks.com/administration-guide/cloud-configurations/aws/customer-managed-vpc.html?

Regarding region is easier when you set up databricks you choose region and availability zone, the same for each S3 bucket. Just make sure that it is the same for both. For example us-west-2 etc.

tomnguyen_195 · ‎05-07-2022

Thanks, I'll follow the guide to setup the VPC.

Regarding the S3 transfer acceleration, do you know how to connect to it from Databricks without getting the "IllegalStateException: To enable accelerate mode, please use AmazonS3ClientBuilder.withAccelerateModeEnabled(true)" error?

Databricks Community

Increase input rate in Delta Live Tables

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon