Databricks Community

joeyslaptop · ‎01-17-2024

If this isn't the right spot to post this, please move it or refer me to the right area.

I recently learned about the "_metadata.file_name". It's not quite what I need.

I'm creating a new table in DataBricks and want to add a USR_File_Name column containing the filenames of the uploaded/imported files. For example, I'm loading a table with a bunch of files with a naming scheme like "HT00114_DXLS_PROD_20240102_00001.txt". I need to keep the filename as a field value in my table.

I also want to create a USR_File_Create_Date column where I add the '20240102' from the filename as a value.

Can anyone direct me to the right info for this?

“Everybody is a genius. But if you judge a fish by its ability to climb a tree, it will live its whole life believing that it is stupid.” — Albert Einstein

Debayan · ‎01-18-2024

Hi, Could you please elaborate more on the expectation here?

joeyslaptop · ‎01-22-2024

I have files with file names like "HT00114_DXLS_PROD_20240102_00001.txt" where the 2024xxxx is the date the file was created. I want to load the file from my hard drive to a table on my sandbox-type space on our DataBricks data lake as a SQL DB table. I have a "Data Ingestion" option available, but no choices within that (that I can see) to add the file name as a column (field value). I'm wondering how I can accomplish adding other details. In an SSIS package, I'd be able to specify other formulas and values for additional fields to accompany my dataset on the upload. I'm hoping I can accomplish the same thing here.

“Everybody is a genius. But if you judge a fish by its ability to climb a tree, it will live its whole life believing that it is stupid.” — Albert Einstein

shan_chandra · ‎01-22-2024

@joeyslaptop - While loading the file and before creating the delta table. you can use input_file_name function and add file name as a column. Below is a sample code. Please try and let us know.

from pyspark.sql.functions import input_file_name
df = spark.read.format("csv").option("header","true").option("inferSchema","true").load("<file-directory path>")
df_with_filename = df.withColumn("filename", input_file_name())
display(df_with_filename)

joeyslaptop · ‎01-25-2024

Hi, Shan_Chandra.

Thanks for the suggestion. The method I'm using to import my file is the Data Ingestion>Data Sources> From Local Files> Create or Modify Table.

It doesn't provide a place for code or to specify column values outside of what's found in my csv file. What method should I be using to import data? Maybe my permissions are limited? I don't know yet.

I'm coming from MS SQL where I could create the SSIS package via the import wizard, and then modify the SSIS code in the wizard or in the saved SSIS file. I'm still new to DataBricks and only have experience so far of querying and of importing data via the automated tool.

“Everybody is a genius. But if you judge a fish by its ability to climb a tree, it will live its whole life believing that it is stupid.” — Albert Einstein

shan_chandra · ‎01-25-2024

@joeyslaptop - could you please try the following?

1. click on upload to DBFS

click on upload to DBFS

2. Upload the file from local to a DBFS location

3. copy the location where the file is uploaded. Now, click on the DBFS tab, to create a table using Notebook will open up a pre-built notebook with sample code that allow you to load the file directly in to a dataframe.