I am currently taking the Data Engineering with Databricks course and have run into an error. I have also attempted this with my own data and had a similar error. In the lab, we are using autoloader to read a spark stream of csv files saved in the DBFS. The answer for this lab is:
# ANSWER
customers_checkpoint_path = f"{DA.paths.checkpoints}/customers"
(spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("cloudFiles.schemaLocation", customers_checkpoint_path)
.load("/databricks-datasets/retail-org/customers/")
.createOrReplaceTempView("customers_raw_temp"))
This results in an error message:
java.lang.UnsupportedOperationException: Schema inference is not supported for format: csv. Please specify the schema.
It seems that when using csv, a pre-defined schema is required. I attempted with my personal databricks data and had to create a schema first, then add that schema to my stream:
schema = StructType([
StructField("Test1",StringType(),True),
StructField("Test2",StringType(),True),
StructField("Test3",StringType(),True)])
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", source_format)
.option("header", "True")
.schema(schema)
.load(data_source)
Is this the best solution for this error or is there a way for autoloader to get the schema as shown in the solution to the Databricks lab?