Databricks

espenol · ‎12-18-2022

Hey, so our notebooks reading a bunch of json files from storage typically use a input_file_name() when moving from raw to bronze, but after upgrading to Unity Catalog we get an error message:

AnalysisException: [UC_COMMAND_NOT_SUPPORTED] input_file_name is not supported in Unity Catalog.;

Why is this? Are we following a bad practice by wanting to have filenames for tracing data through the different storage layers? Does Unity Catalog perhaps do this automatically in some way? Just barely started testing Unity Catalog, so struggling a bit to grasp what the differences are. I thought it was merely a tool that did some stuff automatically (lineage etc) and gave us a simple metastore to interact with.

Cedric · ‎12-19-2022

Hi @Espen Solvang,

Thanks for reaching out to us. Python UDF / UDAFs or Pandas UDFS are currently not supported in Shared Unity Catalog clusters. Instead, please change the mode to "Single User". This should support input_file_name.

espenol · ‎12-19-2022

Thanks a lot for your response. I'll give it a try and get back to you.

Not sure I understand that input_file_name() is an UDF - I didn't write it myself, it's imported from pyspark. I guess what you are saying is that it still is an UDF, please correct me if I'm wrong.

najmead · ‎01-29-2023

I can't answer the question of why input_file_name() doesn't work with the unity catalog, but I did manage to find a workaround, using the file metadata.

You can basically query the _metadata field, which will give you a json string with file path, name, size and modified datetime. So something like this should work;

select. _metadata['file_name'], *
from my_catalog.my_schema.my_table

harraz · ‎07-18-2023

this wont work if you are creating a table for the first time from the stream, for example the code below when running for the first time. I need a way to capture the file name going in the stream

# Configure Auto Loader

streaming_query = (spark.readStream

.format("cloudFiles")

.option("cloudFiles.format", "csv")

.option("cloudFiles.schemaLocation", raw_checkpoint_path)

.option("sep", "|")

.option("inferSchema", "true")

.option("lineSep", "\r\n") # Specify the Windows-style EOL character (CRLF)

.option("pathGlobfilter", file_pattern)

.load(f"{file_path}")

.select("*", input_file_name().alias("source_file"), current_timestamp().alias("processing_time"))

.writeStream

.option("checkpointLocation", raw_checkpoint_path)

.trigger(availableNow=True)

.toTable(raw_table_name))

Magnus · ‎05-12-2023

Hi @Cedric Law Hing Ping,

Are there any plans to support input_file_name in Unity Catalog? I'm using Unity Catalog in Delta Live Tables (DLT), which is in preview, and would like to let DLT handle what cluster is used and still be able to use input_file_name for traceability.

harraz · ‎07-18-2023

I have to say that I ran into these undocumented restrictions multiple times with the shared instance and it's annoying.

datahero · ‎11-16-2023

I had the similar issue with the Unity Catalog updrage, found the following solution working, based on the documentation -

https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/functions/input_file_name

 .selectExpr("*", "_metadata as source_metadata")
  # .withColumn('file_path', input_file_name())

JasonThomas · ‎12-22-2023

.withColumn("RECORD_FILE_NAME", col("_metadata.file_name"))

Will work for spark.read to get the file name, or:

.withColumn("RECORD_FILE_NAME", col("_metadata.file_path"))

To get the whole file path

mgiglia · ‎03-03-2024

This worked perfectly for me. The error message mentioned _metadata.file_path as an alternative to input_file_name, but it wasn't clear how to reference it. Thanks for making it clear that its technically a column that's available. I'm going to explore what else is available in _metadata. This should be marked as the solution in my opinion. Thanks again.

Databricks

input_file_name() not supported in Unity Catalog

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI