10-12-2022 01:30 PM
I am currently taking the Data Engineering with Databricks course and have run into an error. I have also attempted this with my own data and had a similar error. In the lab, we are using autoloader to read a spark stream of csv files saved in the DBFS. The answer for this lab is:
# ANSWER
customers_checkpoint_path = f"{DA.paths.checkpoints}/customers"
(spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("cloudFiles.schemaLocation", customers_checkpoint_path)
.load("/databricks-datasets/retail-org/customers/")
.createOrReplaceTempView("customers_raw_temp"))This results in an error message:
java.lang.UnsupportedOperationException: Schema inference is not supported for format: csv. Please specify the schema.
It seems that when using csv, a pre-defined schema is required. I attempted with my personal databricks data and had to create a schema first, then add that schema to my stream:
schema = StructType([
StructField("Test1",StringType(),True),
StructField("Test2",StringType(),True),
StructField("Test3",StringType(),True)])
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", source_format)
.option("header", "True")
.schema(schema)
.load(data_source)Is this the best solution for this error or is there a way for autoloader to get the schema as shown in the solution to the Databricks lab?
10-12-2022 01:46 PM
After a bit more research, it looks like I was using a cluster with an outdated DBR. I updated to 11.1 and no longer received the error
10-12-2022 01:46 PM
After a bit more research, it looks like I was using a cluster with an outdated DBR. I updated to 11.1 and no longer received the error
10-16-2022 12:04 PM
Yes recently it was improved 🙂
10-12-2022 04:02 PM
As a small aside, you don't need the third argument in the structfields
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now