Data Engineering with Databricks Module 6.3L Error: Autoload CSV

Dave_Nithio — Wed, 12 Oct 2022 20:30:21 GMT

I am currently taking the Data Engineering with Databricks course and have run into an error. I have also attempted this with my own data and had a similar error. In the lab, we are using autoloader to read a spark stream of csv files saved in the DBFS. The answer for this lab is:

# ANSWER
customers_checkpoint_path = f"{DA.paths.checkpoints}/customers"
 
(spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "csv")
      .option("cloudFiles.schemaLocation", customers_checkpoint_path)
      .load("/databricks-datasets/retail-org/customers/")
      .createOrReplaceTempView("customers_raw_temp"))

This results in an error message:

java.lang.UnsupportedOperationException: Schema inference is not supported for format: csv. Please specify the schema.

It seems that when using csv, a pre-defined schema is required. I attempted with my personal databricks data and had to create a schema first, then add that schema to my stream:

schema = StructType([
  StructField("Test1",StringType(),True),
  StructField("Test2",StringType(),True),
  StructField("Test3",StringType(),True)])
 
spark.readStream
                  .format("cloudFiles")
                  .option("cloudFiles.format", source_format)
                  .option("header", "True")
                  .schema(schema)
                  .load(data_source)

Is this the best solution for this error or is there a way for autoloader to get the schema as shown in the solution to the Databricks lab?

Re: Data Engineering with Databricks Module 6.3L Error: Autoload CSV

Dave_Nithio — Wed, 12 Oct 2022 20:46:57 GMT

After a bit more research, it looks like I was using a cluster with an outdated DBR. I updated to 11.1 and no longer received the error

Re: Data Engineering with Databricks Module 6.3L Error: Autoload CSV

Anonymous — Wed, 12 Oct 2022 23:02:32 GMT

As a small aside, you don't need the third argument in the structfields

Re: Data Engineering with Databricks Module 6.3L Error: Autoload CSV

Hubert-Dudek — Sun, 16 Oct 2022 19:04:05 GMT

Yes recently it was improved 🙂

topic Re: Data Engineering with Databricks Module 6.3L Error: Autoload CSV in Data Engineering

Data Engineering with Databricks Module 6.3L Error: Autoload CSV

Re: Data Engineering with Databricks Module 6.3L Error: Autoload CSV

Re: Data Engineering with Databricks Module 6.3L Error: Autoload CSV

Re: Data Engineering with Databricks Module 6.3L Error: Autoload CSV