cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

input_file_name() not supported in Unity Catalog

espenol
New Contributor III

Hey, so our notebooks reading a bunch of json files from storage typically use a input_file_name() when moving from raw to bronze, but after upgrading to Unity Catalog we get an error message:

AnalysisException: [UC_COMMAND_NOT_SUPPORTED] input_file_name is not supported in Unity Catalog.;

Why is this? Are we following a bad practice by wanting to have filenames for tracing data through the different storage layers? Does Unity Catalog perhaps do this automatically in some way? Just barely started testing Unity Catalog, so struggling a bit to grasp what the differences are. I thought it was merely a tool that did some stuff automatically (lineage etc) and gave us a simple metastore to interact with.

9 REPLIES 9

Cedric
Valued Contributor

Hi @Espen Solvang​,

Thanks for reaching out to us. Python UDF / UDAFs or Pandas UDFS are currently not supported in Shared Unity Catalog clusters. Instead, please change the mode to "Single User". This should support input_file_name.

espenol
New Contributor III

Thanks a lot for your response. I'll give it a try and get back to you.

Not sure I understand that input_file_name() is an UDF - I didn't write it myself, it's imported from pyspark. I guess what you are saying is that it still is an UDF, please correct me if I'm wrong.

I can't answer the question of why input_file_name() doesn't work with the unity catalog, but I did manage to find a workaround, using the file metadata.

You can basically query the _metadata field, which will give you a json string with file path, name, size and modified datetime. So something like this should work;

select. _metadata['file_name'], *
from my_catalog.my_schema.my_table

harraz
New Contributor III

this wont work if you are creating a table for the first time from the stream, for example the code below when running for the first time. I need a way to capture the file name going in the stream 

# Configure Auto Loader
streaming_query = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("cloudFiles.schemaLocation", raw_checkpoint_path)
.option("sep", "|")
.option("inferSchema", "true")
.option("lineSep", "\r\n") # Specify the Windows-style EOL character (CRLF)
.option("pathGlobfilter", file_pattern)
.load(f"{file_path}")
.select("*", input_file_name().alias("source_file"), current_timestamp().alias("processing_time"))
.writeStream
.option("checkpointLocation", raw_checkpoint_path)
.trigger(availableNow=True)
.toTable(raw_table_name))

Hi @Cedric Law Hing Ping​,

Are there any plans to support input_file_name in Unity Catalog? I'm using Unity Catalog in Delta Live Tables (DLT), which is in preview, and would like to let DLT handle what cluster is used and still be able to use input_file_name for traceability.

harraz
New Contributor III

I have to say that I ran into these undocumented restrictions multiple times with the shared instance and it's annoying.

 

datahero
New Contributor II

I had the similar issue with the Unity Catalog updrage, found the following solution working, based on the documentation - 

https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/functions/input_file_name

 .selectExpr("*", "_metadata as source_metadata")
  # .withColumn('file_path', input_file_name())

 

JasonThomas
New Contributor III
.withColumn("RECORD_FILE_NAME", col("_metadata.file_name"))

Will work for spark.read to get the file name, or:
 
.withColumn("RECORD_FILE_NAME", col("_metadata.file_path"))

To get the whole file path

This worked perfectly for me.  The error message mentioned _metadata.file_path as an alternative to input_file_name, but it wasn't clear how to reference it.  Thanks for making it clear that its technically a column that's available.  I'm going to explore what else is available in _metadata.  This should be marked as the solution in my opinion. Thanks again.  

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group