Spark streaming auto loader wildcard not working

Jozhua — Fri, 22 Sep 2023 17:56:00 GMT

Need som help with an issue loading a subdirectory from S3 bucket using auto-loader. For example:

S3://path1/path2/databases*/paths/

In databases there are various versions of databases. For example

path1/path2/database_v1/sub_path/*.parquet

path1/path2/database_v2/sub_path/*.parquet

path1/path2/database_v3/sub_path/*.parquet

What's happening? - Well somehow it takes "database*" as a directory name literally. When it does not found that path it move one path behind.

"Listing s3://path1..."

And obviously it stay in that listening because from path1 to sub_path/*.parquet there are a lot of different schemas to explore.

Already tried "cloudFiles.recursiveFileLookup": "true"

Also tried to pass a list but Databricks does not supports directory list.

Code:

autoloader_options = { "cloudFiles.format": "parquet", "cloudFiles.schemaLocation":f'{defs["schema_checkpoint_name"]}' } # AutoLoader readstream_dataframe_autoloader = ( spark.readStream .format("cloudFiles") .options(**autoloader_options) .load( 'S3://path1/path2/databases*/sub_path/bank_fee ' ) ) # No Autoloader works perfectly. But the project precise to use Auto loader feture. df_transaction = ( spark.readStream .format("parquet") .option("rowsPerSecond", 100) .schema(<someschema>) .load("S3://path1/path2/databases*/paths/") )

topic Spark streaming auto loader wildcard not working in Data Engineering

Spark streaming auto loader wildcard not working