â02-03-2023 07:34 AM
Let's say I have a DataFrame with a timestamp and an offset column in milliseconds respectively in the timestamp and long format.
E.g.
from datetime import datetime
df = spark.createDataFrame(
[
(datetime(2021, 1, 1), 1500, ),
(datetime(2021, 1, 2), 1200, )
],
["timestamp", "offsetmillis", ],
)
Now I want to add these offsets to the datetime, so that I get:
2021-01-01T00:00:01.500 and 2021-01-0T00:00:01.200
If I add these directly I get an error about type mismatch, which does make sense:
[DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve "(timestamp + offsetmillis)" due to data type mismatch: the left and right operands of the binary operator have incompatible types ("TIMESTAMP" and "BIGINT")
However I'm not sure how I can best cast this to a duration or interval.
â02-06-2023 05:26 AM
Hi @Ivo Merchiersâ ,
Here is how I did it. As you mentioned, I am considering a date with milliseconds as input in "ts" column and offset to be added in "offSetMillis" column. First of all, I converted the "ts" column to milliseconds and then added "offSetMillis" to it and finally converted this new value back to timestamp in "new_ts" column
â02-03-2023 11:02 AM
Hi @Ivo Merchiersâ , If you are just trying to create a date with milliseconds, you can create them directly by providing the value in datetime as below.
However, if your usecase is to add milliseconds to the date value then you have to convert the date to milliseconds before adding milliseconds to it.
â02-05-2023 11:56 PM
Hi @Lakshay Goelâ,
I've just added the `spark.createDataFrame` command here as an example, the real data is coming from some existing tables, so I can't do it in the python initialisation.
I want to do the addition of some milliseconds (in integer/long/whatever) format to a timestamp (which should already have milliseconds precision) in Pyspark.
How would I go about doing the second approach you proposed?
â02-06-2023 05:26 AM
Hi @Ivo Merchiersâ ,
Here is how I did it. As you mentioned, I am considering a date with milliseconds as input in "ts" column and offset to be added in "offSetMillis" column. First of all, I converted the "ts" column to milliseconds and then added "offSetMillis" to it and finally converted this new value back to timestamp in "new_ts" column
â03-01-2023 11:41 PM
Although @Lakshay Goelâ's solution works, we've been using an alternative approach, that we found to be a bit more readable:
from pyspark.sql import Column, functions as f
def make_dt_interval_sec(col: Column):
return f.expr(f"make_dt_interval(0,0,0,{col._jc.toString()})")
df.withColumn(
start_col,
f.col("timestamp") - make_dt_interval_sec(f.col("offsetmillis") / 1000),
)
I'm not sure if there is any performance difference between both methods.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonât want to miss the chance to attend and share knowledge.
If there isnât a group near you, start one and help create a community that brings people together.
Request a New Group