15 hours ago
I've the filename as below and I want to extract the datetime values and convert to datetime data type.
This_is_new_file_2024_12_06T11_00_49_AM.csv
Here I want to extract only '2024_12_06T11_00_49' and convert to datetime value in new field. I tried Substr with yyyyMMddHHmmss and it's not working. It's producing only null values instead of the datetime value.
Any help?
13 hours ago
Hello David, you can use something like:
from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_extract, to_timestamp
# Initialize Spark session
spark = SparkSession.builder.appName("ExtractDatetime").getOrCreate()
# Sample data
data = [("This_is_new_file_2024_12_06T11_00_49_AM.csv",)]
df = spark.createDataFrame(data, ["filename"])
# Extract the datetime string using regexp_extract
datetime_pattern = r"(\d{4}_\d{2}_\d{2}T\d{2}_\d{2}_\d{2})"
df = df.withColumn("datetime_str", regexp_extract("filename", datetime_pattern, 1))
# Convert the extracted string to a datetime data type
df = df.withColumn("datetime", to_timestamp("datetime_str", "yyyy_MM_dd'T'HH_mm_ss"))
# Show the result
df.show(truncate=False)
12 hours ago
@Walter_C thanks for your help. Can't we do it with split_part or substr function to extract the datetime value and then apply the datetime format?
12 hours ago
I got same result as you when using split as I am getting Null result, another option I can suggest to only get the datetime is:
import re
from datetime import datetime
filename = "This_is_new_file_2024_12_06T11_00_49_AM.csv"
match = re.search(r"\d{4}_\d{2}_\d{2}T\d{2}_\d{2}_\d{2}", filename)
datetime_string = match.group(0) if match else None
if datetime_string:
datetime_object = datetime.strptime(datetime_string, "%Y_%m_%dT%H_%M_%S")
print(datetime_object)
else:
print("Datetime not found in the filename")
11 hours ago
@Walter_C So not possible to achieve this result with SQL functions and datetime format? Reason is I want to incorporate this solution in the existing SELECT statement
11 hours ago - last edited 11 hours ago
@David_Billa how about
import pyspark.sql.functions as F
df_with_datetime = df.withColumn(
'extracted_datetime',
F.to_timestamp(
F.concat(
*[F.split_part(F.col('file_name'), F.lit("_"), F.lit(i)) for i in range(-6, -1)]
),
'yyyyMMdd\'T\'HHmmss'
)
)
display(df_with_datetime)
or
SELECT *,
to_timestamp(
concat_ws('',
split_part(file_name, '_', -6),
split_part(file_name, '_', -5),
split_part(file_name, '_', -4),
split_part(file_name, '_', -3),
split_part(file_name, '_', -2)
),
'yyyyMMdd\'T\'HHmmss'
) AS extracted_datetime
FROM df
7 hours ago
11 hours ago
Unfortunately I am not able to make it work with SQL functions
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group