Databricks Community

Pálmi · ‎06-18-2024

I'm reading data from the default endpoint of an IoT hub in azure using the kafka connector in Databricks. Most data items are straight forward, but the device id and the timestamp I haven't been able to properly decode

For example, the key-value map of the headers {"key": "iothub-enqueuedtime", "value": "gwAAAZAMsGjg"} should be a recent timestamp. Any ideas on how to decode this, using pyspark?

Pálmi · ‎06-20-2024

Hi @Retired_mod , thanks for your reply. The iothub-enqueuedtime does not (directly ) cast into a timestamp, but an unix timestamp with milliseconds is somewhere in there

from pyspark.sql import SparkSession

from pyspark.sql.functions import col, from_json, explode, get_json_object, schema_of_json

from pyspark.sql.functions import col, explode, expr, unbase64, from_unixtime,hex,length

df = spark.read.format("delta").table("iot_ps2")

#df.display()

df2=df.select("headers")

df2.display()

# Explode the array of structs into individual rows

df_exploded = df.withColumn("json_item", explode(col("headers")))

# Filter rows to get only the 'iothub-enqueuedtime' key

df_filtered = df_exploded.filter(col("json_item.key") == "iothub-enqueuedtime")

df3=df_filtered.select("json_item.key","json_item.value")

df3=df3.withColumn("str_value",expr("cast(value as STRING)"))

df3=df3.withColumn("hex",expr("hex(str_value)"))

df3.display()

looking at the hex code it is possible to determine that the 6 rightmost bytes "01 90 36 4C 1B 5C" turn into a unix timestamp with milliseconds. That leaves 3 unknown bytes

I'm hoping that a more straightforward way is available