Databricks Community

kpendergast · ‎05-24-2022

As the titles states I would like to hear how others have setup an AWS s3 bucket to source data with auto loader while supporting the capabilities to archive files after a certain period of time into glacier objects.

We currently have about 20 million json files (avg file size is 400kb) in an s3 bucket which they get written to the root of the bucket. 280k new files added per day. I need to move the old files that have been loaded by auto loaded into a glacier archive directory.

We've tried moving them out of the bucket to a new bucket but run our of memory using datasync in aws and have had a limit increase support ticket open for almost 2 weeks. This feels like to much of a hassle for something that should be fairly straight forward.

By adding a directory to the root in this case would cause issues with auto loader trying to read the files and us being charged for accessing glacier objects by aws.

The other option I can think of would be to have the new files written to one directory and have the archive files in glacier written to a separate directory which is not a sub directory of the other. It would look like this:

../new_data/

../archive/

Auto loader would then be pointed to only "/new_data/" and would only see those files. Checkpoints would need to be reset but I think we would accomplish our goal. Also, is this approach too simple or should the directory have more context added for example "/new_data/use_case_name/" to support other data that may be added at a future date for other use cases?

Thanks in advance!

Prabakar · ‎05-24-2022

@Ken Pendergast To setup Databricks with auto loader, please follow the below document.

https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html

Fetching data from Glacier is not supported. however, you can try one of the following configurations in the cluster and it might help.

Setting ignoreMissingFiles configuration on the notebook before the stream read?

spark.conf.set("spark.sql.files.ignoreMissingFiles", "true")

We should be able to ignore archived files with this.

OR

Set badRecordsPath option on readStream (We would recommend setting the first one)

val df = spark.readStream
.option("badRecordsPath", "/tmp/badRecordsPath")
.format("cloudFiles")
.
.
.load()

kpendergast · ‎05-24-2022

Thank for the info. We have auto loader running daily just fine all the files are in delta. My question is more on the aws side as there no clear best practice for sourcing data from s3 then moving the files to glacier without moving them to another bucket. The s3 with delta and other files in the workspace are completely separate from this.

I think the two directory approach would work but don’t have an easy way to test this approach.

Removing the files from the source that auto loader reads from is a must as the job takes proportionally longer to run as the amount of files grow. I also build the schema from reading over the files in the s3 source bucket. With the 20million files present it’s requires a lot of compute resources than if there were a couple million. Each json file contains 260 fields at a very deep level. I have changes being made to the files soon and need the bucket simplified for new data being written for auto loader to read and for aws to move after a couple days into glacier objects as backup in case we need them again.

Databricks Community

Best AWS S3 Bucket Configuration for Auto Loader with Support for Glacier and Future Use Cases

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!