โ02-03-2023 07:34 AM
Let's say I have a DataFrame with a timestamp and an offset column in milliseconds respectively in the timestamp and long format.
E.g.
from datetime import datetime
df = spark.createDataFrame(
    [
        (datetime(2021, 1, 1), 1500, ),
        (datetime(2021, 1, 2), 1200, )
    ],
    ["timestamp", "offsetmillis", ],
)Now I want to add these offsets to the datetime, so that I get:
2021-01-01T00:00:01.500 and 2021-01-0T00:00:01.200
If I add these directly I get an error about type mismatch, which does make sense:
[DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve "(timestamp + offsetmillis)" due to data type mismatch: the left and right operands of the binary operator have incompatible types ("TIMESTAMP" and "BIGINT")However I'm not sure how I can best cast this to a duration or interval.
โ02-06-2023 05:26 AM
Hi @Ivo Merchiersโ ,
Here is how I did it. As you mentioned, I am considering a date with milliseconds as input in "ts" column and offset to be added in "offSetMillis" column. First of all, I converted the "ts" column to milliseconds and then added "offSetMillis" to it and finally converted this new value back to timestamp in "new_ts" column
โ02-03-2023 11:02 AM
Hi @Ivo Merchiersโ , If you are just trying to create a date with milliseconds, you can create them directly by providing the value in datetime as below.
However, if your usecase is to add milliseconds to the date value then you have to convert the date to milliseconds before adding milliseconds to it.
โ02-05-2023 11:56 PM
Hi @Lakshay Goelโ,
I've just added the `spark.createDataFrame` command here as an example, the real data is coming from some existing tables, so I can't do it in the python initialisation.
I want to do the addition of some milliseconds (in integer/long/whatever) format to a timestamp (which should already have milliseconds precision) in Pyspark.
How would I go about doing the second approach you proposed?
โ02-06-2023 05:26 AM
Hi @Ivo Merchiersโ ,
Here is how I did it. As you mentioned, I am considering a date with milliseconds as input in "ts" column and offset to be added in "offSetMillis" column. First of all, I converted the "ts" column to milliseconds and then added "offSetMillis" to it and finally converted this new value back to timestamp in "new_ts" column
โ03-01-2023 11:41 PM
Although @Lakshay Goelโ's solution works, we've been using an alternative approach, that we found to be a bit more readable:
from pyspark.sql import Column, functions as f
 
 
def make_dt_interval_sec(col: Column):
    return f.expr(f"make_dt_interval(0,0,0,{col._jc.toString()})")
 
df.withColumn(
      start_col,
        f.col("timestamp") - make_dt_interval_sec(f.col("offsetmillis") / 1000),
     )I'm not sure if there is any performance difference between both methods.
 
					
				
				
			
		
 
					
				
				
			
		
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now