Hi!
I am pulling data from a Blob storage to Databrick using Autoloader. This process is working well for almost 10 resources, but for a specific one I am getting this error java.lang.NullPointerException.
Looks like this issue in when I connect to the blob storage, but when I try to connect to this resource using spark.read.parquet("/mnt/path/to/files/*.parquet") the process works well.
So the issue is when I am runninng the Structure Streaming with format "couldFiles".
Below the code used:
downtimeuptime_df = (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "parquet")
.option("cloudFiles.schemaLocation", f"/mnt/hist_data_delta/hist_data_delta.db/checkpoints/table_name_data_hmc")
.option("cloudFiles.schemaEvolutionMode", None)
.load(f'/mnt/source_data_bu/table_name_data/')
.select(
"*",
lit(_bu).alias("_bu"),
col("_metadata.file_path").alias("_source_file"),
current_timestamp().alias("_processing_time"),
)
)
Error description:
Py4JJavaError: An error occurred while calling o2702.load. : java.lang.NullPointerException at com.databricks.sql.cloudfiles.options.CloudFilesOptionsBase.$anonfun$userProvidedEvolutionMode$1(CloudFilesOptionsBase.scala:162) at scala.Option.map(Option.scala:230) at com.databricks.sql.cloudfiles.options.CloudFilesOptionsBase.<init>(CloudFilesOptionsBase.scala:162) at com.databricks.sql.fileNotification.autoIngest.CloudFilesSourceOptions.<init>(CloudFilesSourceOptions.scala:45) at com.databricks.sql.fileNotification.autoIngest.CloudFilesSourceProvider.sourceSchema(CloudFilesSourceProvider.scala:84) at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:266) at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:150) at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:150) at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:40) at org.apache.spark.sql.streaming.DataStreamReader.loadInternal(DataStreamReader.scala:223) at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:267) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397) at py4j.Gateway.invoke(Gateway.java:306) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195) at py4j.ClientServerConnection.run(ClientServerConnection.java:115) at java.lang.Thread.run(Thread.java:750)