03-05-2023 04:41 AM
After successfully loading 3 small files (2 KB each) in from AWS S3 using Auto Loader for learning purposes, I got, few hours later, a "AWS Free tier limit alert", although I haven't used the AWS account for a while.
Does this streaming service on Databricks that runs all the time consume requests even if no files/data are uploaded?
Is this normal or did I overlook some hidden configuration?
03-05-2023 10:56 PM
@Alexander Mora Araya
It somehow needs to check if there's a new file on the storage, so yes - it will consume request if it runs continuously.
03-05-2023 10:56 PM
@Alexander Mora Araya
It somehow needs to check if there's a new file on the storage, so yes - it will consume request if it runs continuously.
03-06-2023 08:25 AM
Hi, Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage. Auto Loader can load data files from AWS S3 (s3://), Azure Data Lake Storage Gen2 (ADLS Gen2, abfss://), Google Cloud Storage (GCS, gs://), Azure Blob Storage (wasbs://), ADLS Gen1 (adl://), and Databricks File System (DBFS, dbfs:/). Auto Loader can ingest JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.
Auto Loader provides a Structured Streaming source called cloudFiles. Given an input directory path on the cloud file storage, the cloudFiles source automatically processes new files as they arrive, with the option of also processing existing files in that directory. Auto Loader has support for both Python and SQL in Delta Live Tables.
You can use Auto Loader to process billions of files to migrate or backfill a table. Auto Loader scales to support near real-time ingestion of millions of files per hour.
Could you please reverify if the cloud storage is receiving any files or not?
Please refer: https://docs.databricks.com/ingestion/auto-loader/index.html
Please let us know if this helps.
Also please tag @Debayan with your next response which will notify me, Thank you!
03-06-2023 08:42 AM
@Debayan Mukherjee
Thanks for this explanation. Everything worked fine when I tested it, as I mentioned above. The only thing is that it continuously makes requests to S3 to check if new data needs to be pull. Am I wrong here?
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group