Data Engineering with Databricks Module 6.3L Error...

Dave_Nithio · ‎10-12-2022

I am currently taking the Data Engineering with Databricks course and have run into an error. I have also attempted this with my own data and had a similar error. In the lab, we are using autoloader to read a spark stream of csv files saved in the DBFS. The answer for this lab is:

# ANSWER
customers_checkpoint_path = f"{DA.paths.checkpoints}/customers"
 
(spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "csv")
      .option("cloudFiles.schemaLocation", customers_checkpoint_path)
      .load("/databricks-datasets/retail-org/customers/")
      .createOrReplaceTempView("customers_raw_temp"))

This results in an error message:

java.lang.UnsupportedOperationException: Schema inference is not supported for format: csv. Please specify the schema.

It seems that when using csv, a pre-defined schema is required. I attempted with my personal databricks data and had to create a schema first, then add that schema to my stream:

schema = StructType([
  StructField("Test1",StringType(),True),
  StructField("Test2",StringType(),True),
  StructField("Test3",StringType(),True)])
 
spark.readStream
                  .format("cloudFiles")
                  .option("cloudFiles.format", source_format)
                  .option("header", "True")
                  .schema(schema)
                  .load(data_source)

Is this the best solution for this error or is there a way for autoloader to get the schema as shown in the solution to the Databricks lab?

Data Engineering with Databricks Module 6.3L Error: Autoload CSV