Databricks Community

David_Billa · ‎12-24-2024

I've the filename as below and I want to extract the datetime values and convert to datetime data type.

This_is_new_file_2024_12_06T11_00_49_AM.csv

Here I want to extract only '2024_12_06T11_00_49' and convert to datetime value in new field. I tried Substr with yyyyMMddHHmmss and it's not working. It's producing only null values instead of the datetime value.

Any help?

Walter_C · ‎12-24-2024

Hello David, you can use something like:

from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_extract, to_timestamp

# Initialize Spark session
spark = SparkSession.builder.appName("ExtractDatetime").getOrCreate()

# Sample data
data = [("This_is_new_file_2024_12_06T11_00_49_AM.csv",)]
df = spark.createDataFrame(data, ["filename"])

# Extract the datetime string using regexp_extract
datetime_pattern = r"(\d{4}_\d{2}_\d{2}T\d{2}_\d{2}_\d{2})"
df = df.withColumn("datetime_str", regexp_extract("filename", datetime_pattern, 1))

# Convert the extracted string to a datetime data type
df = df.withColumn("datetime", to_timestamp("datetime_str", "yyyy_MM_dd'T'HH_mm_ss"))

# Show the result
df.show(truncate=False)

David_Billa · ‎12-24-2024

@Walter_C thanks for your help. Can't we do it with split_part or substr function to extract the datetime value and then apply the datetime format?

Walter_C · ‎12-24-2024

I got same result as you when using split as I am getting Null result, another option I can suggest to only get the datetime is:

import re
from datetime import datetime

filename = "This_is_new_file_2024_12_06T11_00_49_AM.csv"

match = re.search(r"\d{4}_\d{2}_\d{2}T\d{2}_\d{2}_\d{2}", filename)
datetime_string = match.group(0) if match else None

if datetime_string:
    datetime_object = datetime.strptime(datetime_string, "%Y_%m_%dT%H_%M_%S")
    print(datetime_object)
else:
    print("Datetime not found in the filename")

David_Billa · ‎12-24-2024

@Walter_C So not possible to achieve this result with SQL functions and datetime format? Reason is I want to incorporate this solution in the existing SELECT statement

ck1 · ‎12-24-2024

@David_Billa how about

import pyspark.sql.functions as F

df_with_datetime = df.withColumn(
    'extracted_datetime',
    F.to_timestamp(
        F.concat(
            *[F.split_part(F.col('file_name'), F.lit("_"), F.lit(i)) for i in range(-6, -1)]
        ),
        'yyyyMMdd\'T\'HHmmss'
    )
)

display(df_with_datetime)

or

SELECT *,
       to_timestamp(
           concat_ws('', 
               split_part(file_name, '_', -6),
               split_part(file_name, '_', -5),
               split_part(file_name, '_', -4),
               split_part(file_name, '_', -3),
               split_part(file_name, '_', -2)
           ),
           'yyyyMMdd\'T\'HHmmss'
       ) AS extracted_datetime
FROM df

David_Billa · ‎12-24-2024

@ck1 looks good now. I see that date time value is like 2024-12-06T11:00:49.000+00.00.

Can't we ignore .000+00.00?

Any help here @Walter_C

ck1 · ‎12-27-2024

I think its not really possible, though I am quite new to Databricks. Here are the types that one can use: Data types | Databricks on AWS
There is a date type and a timestamp type, but it doesn't look like there is something in between. (You could of course save a string representation of the datetime...)