Databricks

Tico23 · ‎03-05-2023

After successfully loading 3 small files (2 KB each) in from AWS S3 using Auto Loader for learning purposes, I got, few hours later, a "AWS Free tier limit alert", although I haven't used the AWS account for a while.

Does this streaming service on Databricks that runs all the time consume requests even if no files/data are uploaded?

Is this normal or did I overlook some hidden configuration?

daniel_sahal · ‎03-05-2023

@Alexander Mora Araya

It somehow needs to check if there's a new file on the storage, so yes - it will consume request if it runs continuously.

View solution in original post

daniel_sahal · ‎03-05-2023

@Alexander Mora Araya

It somehow needs to check if there's a new file on the storage, so yes - it will consume request if it runs continuously.

Debayan · ‎03-06-2023

Hi, Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage. Auto Loader can load data files from AWS S3 (s3://), Azure Data Lake Storage Gen2 (ADLS Gen2, abfss://), Google Cloud Storage (GCS, gs://), Azure Blob Storage (wasbs://), ADLS Gen1 (adl://), and Databricks File System (DBFS, dbfs:/). Auto Loader can ingest JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.

Auto Loader provides a Structured Streaming source called cloudFiles. Given an input directory path on the cloud file storage, the cloudFiles source automatically processes new files as they arrive, with the option of also processing existing files in that directory. Auto Loader has support for both Python and SQL in Delta Live Tables.

You can use Auto Loader to process billions of files to migrate or backfill a table. Auto Loader scales to support near real-time ingestion of millions of files per hour.

Could you please reverify if the cloud storage is receiving any files or not?

Please refer: https://docs.databricks.com/ingestion/auto-loader/index.html

Please let us know if this helps.

Also please tag @Debayan with your next response which will notify me, Thank you!

Tico23 · ‎03-06-2023

@Debayan Mukherjee

Thanks for this explanation. Everything worked fine when I tested it, as I mentioned above. The only thing is that it continuously makes requests to S3 to check if new data needs to be pull. Am I wrong here?

Databricks

AmazonS3 with Autoloader consume "too many" requests or maybe not!

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI