cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to add a column to a new table containing the original source filenames in DataBricks.

joeyslaptop
New Contributor II

If this isn't the right spot to post this, please move it or refer me to the right area.

I recently learned about the "_metadata.file_name".  It's not quite what I need.

I'm creating a new table in DataBricks and want to add a USR_File_Name column containing the filenames of the uploaded/imported files.  For example, I'm loading a table with a bunch of files with a naming scheme like "HT00114_DXLS_PROD_20240102_00001.txt".  I need to keep the filename as a field value in my table. 

I also want to create a USR_File_Create_Date column where I add the '20240102' from the filename as a value.

Can anyone direct me to the right info for this?

“Everybody is a genius. But if you judge a fish by its ability to climb a tree, it will live its whole life believing that it is stupid.” — Albert Einstein
5 REPLIES 5

Debayan
Databricks Employee
Databricks Employee

Hi, Could you please elaborate more on the expectation here? 

joeyslaptop
New Contributor II

I have files with file names like "HT00114_DXLS_PROD_20240102_00001.txt" where the 2024xxxx is the date the file was created.  I want to load the file from my hard drive to a table on my sandbox-type space on our DataBricks data lake as a SQL DB table.  I have a "Data Ingestion" option available, but no choices within that (that I can see) to add the file name as a column (field value).  I'm wondering how I can accomplish adding other details.  In an SSIS package, I'd be able to specify other formulas and values for additional fields to accompany my dataset on the upload.  I'm hoping I can accomplish the same thing here.  

“Everybody is a genius. But if you judge a fish by its ability to climb a tree, it will live its whole life believing that it is stupid.” — Albert Einstein

@joeyslaptop  - While loading the file and before creating the delta table. you can use input_file_name function and add file name as a column. Below is a sample code.  Please try and let us know.

from pyspark.sql.functions import input_file_name
df = spark.read.format("csv").option("header","true").option("inferSchema","true").load("<file-directory path>")
df_with_filename = df.withColumn("filename", input_file_name())
display(df_with_filename)

 

Hi, Shan_Chandra.

Thanks for the suggestion.  The method I'm using to import my file is the Data Ingestion>Data Sources> From Local Files> Create or Modify Table.

It doesn't provide a place for code or to specify column values outside of what's found in my csv file.  What method should I be using to import data?  Maybe my permissions are limited?  I don't know yet.

I'm coming from MS SQL where I could create the SSIS package via the import wizard, and then modify the SSIS code in the wizard or in the saved SSIS file.  I'm still new to DataBricks and only have experience so far of querying and of importing data via the automated tool.  

“Everybody is a genius. But if you judge a fish by its ability to climb a tree, it will live its whole life believing that it is stupid.” — Albert Einstein

@joeyslaptop - could you please try the following?

1. click on upload to DBFS

click on upload to DBFSclick on upload to DBFS

2. Upload the file from local to a DBFS location

Screenshot 2024-01-25 at 1.50.14 PM.png

3. copy the location where the file is uploaded. Now, click on the DBFS tab, to  create a table using Notebook will open up a pre-built notebook with sample code that allow you to load the file directly in to a dataframe. 

 Screenshot 2024-01-25 at 1.41.17 PM.png

4. Edit the below notebook to introduce input_file_name() method using the sample code snippet shared. 

Screenshot 2024-01-25 at 1.52.21 PM.png

Hope this helps. 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group