cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Data Engineering with Databricks Module 6.3L Error: Autoload CSV

Dave_Nithio
Contributor

I am currently taking the Data Engineering with Databricks course and have run into an error. I have also attempted this with my own data and had a similar error. In the lab, we are using autoloader to read a spark stream of csv files saved in the DBFS. The answer for this lab is:

# ANSWER
customers_checkpoint_path = f"{DA.paths.checkpoints}/customers"
 
(spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "csv")
      .option("cloudFiles.schemaLocation", customers_checkpoint_path)
      .load("/databricks-datasets/retail-org/customers/")
      .createOrReplaceTempView("customers_raw_temp"))

This results in an error message:

java.lang.UnsupportedOperationException: Schema inference is not supported for format: csv. Please specify the schema.

It seems that when using csv, a pre-defined schema is required. I attempted with my personal databricks data and had to create a schema first, then add that schema to my stream:

schema = StructType([
  StructField("Test1",StringType(),True),
  StructField("Test2",StringType(),True),
  StructField("Test3",StringType(),True)])
 
spark.readStream
                  .format("cloudFiles")
                  .option("cloudFiles.format", source_format)
                  .option("header", "True")
                  .schema(schema)
                  .load(data_source)

Is this the best solution for this error or is there a way for autoloader to get the schema as shown in the solution to the Databricks lab?

1 ACCEPTED SOLUTION

Accepted Solutions

Dave_Nithio
Contributor

After a bit more research, it looks like I was using a cluster with an outdated DBR. I updated to 11.1 and no longer received the error

View solution in original post

3 REPLIES 3

Dave_Nithio
Contributor

After a bit more research, it looks like I was using a cluster with an outdated DBR. I updated to 11.1 and no longer received the error

Hubert-Dudek
Esteemed Contributor III

Yes recently it was improved 🙂

Anonymous
Not applicable

As a small aside, you don't need the third argument in the structfields

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.