12-18-2022 11:54 PM
Hey, so our notebooks reading a bunch of json files from storage typically use a input_file_name() when moving from raw to bronze, but after upgrading to Unity Catalog we get an error message:
AnalysisException: [UC_COMMAND_NOT_SUPPORTED] input_file_name is not supported in Unity Catalog.;
Why is this? Are we following a bad practice by wanting to have filenames for tracing data through the different storage layers? Does Unity Catalog perhaps do this automatically in some way? Just barely started testing Unity Catalog, so struggling a bit to grasp what the differences are. I thought it was merely a tool that did some stuff automatically (lineage etc) and gave us a simple metastore to interact with.
12-19-2022 04:01 AM
Hi @Espen Solvang,
Thanks for reaching out to us. Python UDF / UDAFs or Pandas UDFS are currently not supported in Shared Unity Catalog clusters. Instead, please change the mode to "Single User". This should support input_file_name.
12-19-2022 11:04 PM
Thanks a lot for your response. I'll give it a try and get back to you.
Not sure I understand that input_file_name() is an UDF - I didn't write it myself, it's imported from pyspark. I guess what you are saying is that it still is an UDF, please correct me if I'm wrong.
01-29-2023 10:22 PM
I can't answer the question of why input_file_name() doesn't work with the unity catalog, but I did manage to find a workaround, using the file metadata.
You can basically query the _metadata field, which will give you a json string with file path, name, size and modified datetime. So something like this should work;
select. _metadata['file_name'], *
from my_catalog.my_schema.my_table
07-18-2023 04:43 PM - edited 07-18-2023 04:44 PM
this wont work if you are creating a table for the first time from the stream, for example the code below when running for the first time. I need a way to capture the file name going in the stream
05-12-2023 03:50 AM
Hi @Cedric Law Hing Ping,
Are there any plans to support input_file_name in Unity Catalog? I'm using Unity Catalog in Delta Live Tables (DLT), which is in preview, and would like to let DLT handle what cluster is used and still be able to use input_file_name for traceability.
07-18-2023 04:20 PM
I have to say that I ran into these undocumented restrictions multiple times with the shared instance and it's annoying.
11-16-2023 08:21 PM
I had the similar issue with the Unity Catalog updrage, found the following solution working, based on the documentation -
https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/functions/input_file_name
.selectExpr("*", "_metadata as source_metadata")
# .withColumn('file_path', input_file_name())
12-22-2023 12:03 PM
03-03-2024 12:13 AM
This worked perfectly for me. The error message mentioned _metadata.file_path as an alternative to input_file_name, but it wasn't clear how to reference it. Thanks for making it clear that its technically a column that's available. I'm going to explore what else is available in _metadata. This should be marked as the solution in my opinion. Thanks again.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group