โ12-24-2024 02:47 AM
I've the filename as below and I want to extract the datetime values and convert to datetime data type.
This_is_new_file_2024_12_06T11_00_49_AM.csv
Here I want to extract only '2024_12_06T11_00_49' and convert to datetime value in new field. I tried Substr with yyyyMMddHHmmss and it's not working. It's producing only null values instead of the datetime value.
Any help?
โ12-24-2024 04:16 AM
Hello David, you can use something like:
from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_extract, to_timestamp
# Initialize Spark session
spark = SparkSession.builder.appName("ExtractDatetime").getOrCreate()
# Sample data
data = [("This_is_new_file_2024_12_06T11_00_49_AM.csv",)]
df = spark.createDataFrame(data, ["filename"])
# Extract the datetime string using regexp_extract
datetime_pattern = r"(\d{4}_\d{2}_\d{2}T\d{2}_\d{2}_\d{2})"
df = df.withColumn("datetime_str", regexp_extract("filename", datetime_pattern, 1))
# Convert the extracted string to a datetime data type
df = df.withColumn("datetime", to_timestamp("datetime_str", "yyyy_MM_dd'T'HH_mm_ss"))
# Show the result
df.show(truncate=False)
โ12-24-2024 05:04 AM
@Walter_C thanks for your help. Can't we do it with split_part or substr function to extract the datetime value and then apply the datetime format?
โ12-24-2024 05:49 AM
I got same result as you when using split as I am getting Null result, another option I can suggest to only get the datetime is:
import re
from datetime import datetime
filename = "This_is_new_file_2024_12_06T11_00_49_AM.csv"
match = re.search(r"\d{4}_\d{2}_\d{2}T\d{2}_\d{2}_\d{2}", filename)
datetime_string = match.group(0) if match else None
if datetime_string:
datetime_object = datetime.strptime(datetime_string, "%Y_%m_%dT%H_%M_%S")
print(datetime_object)
else:
print("Datetime not found in the filename")
โ12-24-2024 06:16 AM
@Walter_C So not possible to achieve this result with SQL functions and datetime format? Reason is I want to incorporate this solution in the existing SELECT statement
โ12-24-2024 06:37 AM - edited โ12-24-2024 06:39 AM
@David_Billa how about
import pyspark.sql.functions as F
df_with_datetime = df.withColumn(
'extracted_datetime',
F.to_timestamp(
F.concat(
*[F.split_part(F.col('file_name'), F.lit("_"), F.lit(i)) for i in range(-6, -1)]
),
'yyyyMMdd\'T\'HHmmss'
)
)
display(df_with_datetime)
or
SELECT *,
to_timestamp(
concat_ws('',
split_part(file_name, '_', -6),
split_part(file_name, '_', -5),
split_part(file_name, '_', -4),
split_part(file_name, '_', -3),
split_part(file_name, '_', -2)
),
'yyyyMMdd\'T\'HHmmss'
) AS extracted_datetime
FROM df
โ12-24-2024 10:06 AM
โ12-27-2024 06:20 AM
I think its not really possible, though I am quite new to Databricks. Here are the types that one can use: Data types | Databricks on AWS
There is a date type and a timestamp type, but it doesn't look like there is something in between. (You could of course save a string representation of the datetime...)
โ12-24-2024 06:42 AM
Unfortunately I am not able to make it work with SQL functions
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group