For reference - for anybody struggling with the same issues. All online examples using auto loader are written as one block statement on the form:
(spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
# The schema location directory keeps track of your data schema over time
.option("cloudFiles.schemaLocation", "<path_to_checkpoint>")
.load("<path_to_source_data>")
.writeStream
.option("checkpointLocation", "<path_to_checkpoint>")
.start("<path_to_target")
)
The solution was to split this into three as follows
df=(spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
# The schema location directory keeps track of your data schema over time
.option("cloudFiles.schemaLocation", "<path_to_checkpoint>")
.load("<path_to_source_data>"))
for c in df.columns:
df = df.withColumnRenamed(c, c.replace(" ", "_").replace("(","%28").replace(")","%29").replace("/","%2F"))
df.writeStream
.option("checkpointLocation", "<path_to_checkpoint>")
.start("<path_to_target")