Need som help with an issue loading a subdirectory from S3 bucket using auto-loader. For example:
S3://path1/path2/databases*/paths/
In databases there are various versions of databases. For example
path1/path2/database_v1/sub_path/*.parquet
path1/path2/database_v2/sub_path/*.parquet
path1/path2/database_v3/sub_path/*.parquet
What's happening? - Well somehow it takes "database*" as a directory name literally. When it does not found that path it move one path behind.
"Listing s3://path1..."
And obviously it stay in that listening because from path1 to sub_path/*.parquet there are a lot of different schemas to explore.
Already tried "cloudFiles.recursiveFileLookup": "true"
Also tried to pass a list but Databricks does not supports directory list.
Code:
autoloader_options = {
"cloudFiles.format": "parquet",
"cloudFiles.schemaLocation":f'{defs["schema_checkpoint_name"]}'
}
# AutoLoader
readstream_dataframe_autoloader = (
spark.readStream
.format("cloudFiles")
.options(**autoloader_options)
.load(
'S3://path1/path2/databases*/sub_path/bank_fee '
)
)
# No Autoloader works perfectly. But the project precise to use Auto loader feture.
df_transaction = (
spark.readStream
.format("parquet")
.option("rowsPerSecond", 100)
.schema(<someschema>)
.load("S3://path1/path2/databases*/paths/")
)