topic Re: How to add a column to a new table containing the original source filenames in DataBricks. in Data Engineering

How to add a column to a new table containing the original source filenames in DataBricks.

joeyslaptop — Thu, 18 Jan 2024 04:27:50 GMT

If this isn't the right spot to post this, please move it or refer me to the right area.

I recently learned about the "_metadata.file_name". It's not quite what I need.

I'm creating a new table in DataBricks and want to add a USR_File_Name column containing the filenames of the uploaded/imported files. For example, I'm loading a table with a bunch of files with a naming scheme like "HT00114_DXLS_PROD_20240102_00001.txt". I need to keep the filename as a field value in my table.

I also want to create a USR_File_Create_Date column where I add the '20240102' from the filename as a value.

Can anyone direct me to the right info for this?

Re: How to add a column to a new table containing the original source filenames in DataBricks.

Debayan — Thu, 18 Jan 2024 19:15:35 GMT

Hi, Could you please elaborate more on the expectation here?

Re: How to add a column to a new table containing the original source filenames in DataBricks.

joeyslaptop — Mon, 22 Jan 2024 18:28:38 GMT

I have files with file names like "HT00114_DXLS_PROD_20240102_00001.txt" where the 2024xxxx is the date the file was created. I want to load the file from my hard drive to a table on my sandbox-type space on our DataBricks data lake as a SQL DB table. I have a "Data Ingestion" option available, but no choices within that (that I can see) to add the file name as a column (field value). I'm wondering how I can accomplish adding other details. In an SSIS package, I'd be able to specify other formulas and values for additional fields to accompany my dataset on the upload. I'm hoping I can accomplish the same thing here.

Re: How to add a column to a new table containing the original source filenames in DataBricks.

shan_chandra — Mon, 22 Jan 2024 23:09:05 GMT

@joeyslaptop - While loading the file and before creating the delta table. you can use input_file_name function and add file name as a column. Below is a sample code. Please try and let us know.

from pyspark.sql.functions import input_file_name df = spark.read.format("csv").option("header","true").option("inferSchema","true").load("<file-directory path>") df_with_filename = df.withColumn("filename", input_file_name()) display(df_with_filename)

Re: How to add a column to a new table containing the original source filenames in DataBricks.

joeyslaptop — Thu, 25 Jan 2024 19:12:47 GMT

Hi, Shan_Chandra.

Thanks for the suggestion. The method I'm using to import my file is the Data Ingestion>Data Sources> From Local Files> Create or Modify Table.

It doesn't provide a place for code or to specify column values outside of what's found in my csv file. What method should I be using to import data? Maybe my permissions are limited? I don't know yet.

I'm coming from MS SQL where I could create the SSIS package via the import wizard, and then modify the SSIS code in the wizard or in the saved SSIS file. I'm still new to DataBricks and only have experience so far of querying and of importing data via the automated tool.

Re: How to add a column to a new table containing the original source filenames in DataBricks.

shan_chandra — Thu, 25 Jan 2024 19:55:06 GMT

@joeyslaptop - could you please try the following?

1. click on upload to DBFS

click on upload to DBFS

2. Upload the file from local to a DBFS location

3. copy the location where the file is uploaded. Now, click on the DBFS tab, to create a table using Notebook will open up a pre-built notebook with sample code that allow you to load the file directly in to a dataframe.

4. Edit the below notebook to introduce input_file_name() method using the sample code snippet shared.

Hope this helps.

Re: How to add a column to a new table containing the original source filenames in DataBricks.

just-Vlad — Wed, 16 Jul 2025 15:41:48 GMT

How to do the same for a bunch of files at once (i.e., wildcard or recursive) without an iteration loop? Thanks.