cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Extract datetime value from the file name

David_Billa
New Contributor III

I've the filename as below and I want to extract the datetime values and convert to datetime data type. 

This_is_new_file_2024_12_06T11_00_49_AM.csv

Here I want to extract only '2024_12_06T11_00_49' and convert to datetime value in new field. I tried Substr with yyyyMMddHHmmss and it's not working. It's producing only null values instead of the datetime value.

Any help?

 

 

 

7 REPLIES 7

Walter_C
Databricks Employee
Databricks Employee

Hello David, you can use something like:

from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_extract, to_timestamp

# Initialize Spark session
spark = SparkSession.builder.appName("ExtractDatetime").getOrCreate()

# Sample data
data = [("This_is_new_file_2024_12_06T11_00_49_AM.csv",)]
df = spark.createDataFrame(data, ["filename"])

# Extract the datetime string using regexp_extract
datetime_pattern = r"(\d{4}_\d{2}_\d{2}T\d{2}_\d{2}_\d{2})"
df = df.withColumn("datetime_str", regexp_extract("filename", datetime_pattern, 1))

# Convert the extracted string to a datetime data type
df = df.withColumn("datetime", to_timestamp("datetime_str", "yyyy_MM_dd'T'HH_mm_ss"))

# Show the result
df.show(truncate=False)

David_Billa
New Contributor III

@Walter_C thanks for your help. Can't we do it with split_part or substr function to extract the datetime value and then apply the datetime format?

Walter_C
Databricks Employee
Databricks Employee

I got same result as you when using split as I am getting Null result, another option I can suggest to only get the datetime is:

import re
from datetime import datetime

filename = "This_is_new_file_2024_12_06T11_00_49_AM.csv"

match = re.search(r"\d{4}_\d{2}_\d{2}T\d{2}_\d{2}_\d{2}", filename)
datetime_string = match.group(0) if match else None

if datetime_string:
    datetime_object = datetime.strptime(datetime_string, "%Y_%m_%dT%H_%M_%S")
    print(datetime_object)
else:
    print("Datetime not found in the filename")

David_Billa
New Contributor III

@Walter_C So not possible to achieve this result with SQL functions and datetime format? Reason is I want to incorporate this solution in the existing SELECT statement

ck1
New Contributor II

@David_Billa how about

 

import pyspark.sql.functions as F

df_with_datetime = df.withColumn(
    'extracted_datetime',
    F.to_timestamp(
        F.concat(
            *[F.split_part(F.col('file_name'), F.lit("_"), F.lit(i)) for i in range(-6, -1)]
        ),
        'yyyyMMdd\'T\'HHmmss'
    )
)

display(df_with_datetime)

 

or

 

 

SELECT *,
       to_timestamp(
           concat_ws('', 
               split_part(file_name, '_', -6),
               split_part(file_name, '_', -5),
               split_part(file_name, '_', -4),
               split_part(file_name, '_', -3),
               split_part(file_name, '_', -2)
           ),
           'yyyyMMdd\'T\'HHmmss'
       ) AS extracted_datetime
FROM df

 

David_Billa
New Contributor III

@ck1 looks good now. I see that date time value is like 2024-12-06T11:00:49.000+00.00.

Can't we ignore .000+00.00?

Any help here @Walter_C 

Walter_C
Databricks Employee
Databricks Employee

Unfortunately I am not able to make it work with SQL functions

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group