Extract datetime value from the file name
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-24-2024 02:47 AM
I've the filename as below and I want to extract the datetime values and convert to datetime data type.
This_is_new_file_2024_12_06T11_00_49_AM.csv
Here I want to extract only '2024_12_06T11_00_49' and convert to datetime value in new field. I tried Substr with yyyyMMddHHmmss and it's not working. It's producing only null values instead of the datetime value.
Any help?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-24-2024 04:16 AM
Hello David, you can use something like:
from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_extract, to_timestamp
# Initialize Spark session
spark = SparkSession.builder.appName("ExtractDatetime").getOrCreate()
# Sample data
data = [("This_is_new_file_2024_12_06T11_00_49_AM.csv",)]
df = spark.createDataFrame(data, ["filename"])
# Extract the datetime string using regexp_extract
datetime_pattern = r"(\d{4}_\d{2}_\d{2}T\d{2}_\d{2}_\d{2})"
df = df.withColumn("datetime_str", regexp_extract("filename", datetime_pattern, 1))
# Convert the extracted string to a datetime data type
df = df.withColumn("datetime", to_timestamp("datetime_str", "yyyy_MM_dd'T'HH_mm_ss"))
# Show the result
df.show(truncate=False)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-24-2024 05:04 AM
@Walter_C thanks for your help. Can't we do it with split_part or substr function to extract the datetime value and then apply the datetime format?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-24-2024 05:49 AM
I got same result as you when using split as I am getting Null result, another option I can suggest to only get the datetime is:
import re
from datetime import datetime
filename = "This_is_new_file_2024_12_06T11_00_49_AM.csv"
match = re.search(r"\d{4}_\d{2}_\d{2}T\d{2}_\d{2}_\d{2}", filename)
datetime_string = match.group(0) if match else None
if datetime_string:
datetime_object = datetime.strptime(datetime_string, "%Y_%m_%dT%H_%M_%S")
print(datetime_object)
else:
print("Datetime not found in the filename")
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-24-2024 06:16 AM
@Walter_C So not possible to achieve this result with SQL functions and datetime format? Reason is I want to incorporate this solution in the existing SELECT statement
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-24-2024 06:37 AM - edited 12-24-2024 06:39 AM
@David_Billa how about
import pyspark.sql.functions as F
df_with_datetime = df.withColumn(
'extracted_datetime',
F.to_timestamp(
F.concat(
*[F.split_part(F.col('file_name'), F.lit("_"), F.lit(i)) for i in range(-6, -1)]
),
'yyyyMMdd\'T\'HHmmss'
)
)
display(df_with_datetime)
or
SELECT *,
to_timestamp(
concat_ws('',
split_part(file_name, '_', -6),
split_part(file_name, '_', -5),
split_part(file_name, '_', -4),
split_part(file_name, '_', -3),
split_part(file_name, '_', -2)
),
'yyyyMMdd\'T\'HHmmss'
) AS extracted_datetime
FROM df
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-24-2024 10:06 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-27-2024 06:20 AM
I think its not really possible, though I am quite new to Databricks. Here are the types that one can use: Data types | Databricks on AWS
There is a date type and a timestamp type, but it doesn't look like there is something in between. (You could of course save a string representation of the datetime...)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-24-2024 06:42 AM
Unfortunately I am not able to make it work with SQL functions

