topic Re: Best way to parse Google Analytics data in Databricks notebook in Data Engineering

Best way to parse Google Analytics data in Databricks notebook

AnaMocanu — Tue, 23 Apr 2024 02:27:27 GMT

I managed to extract the Google Analytics data via lakehouse federation and the Big Query connection but the events table values are in a weird JSON format

{"v":[{"v":{"f":[{"v":"ga_session_number"},{"v":{"f":[{"v":null},{"v":"2"},{"v":null},{"v":null}]}}]}},{"v":{"f":[{"v":"blabla"},{"v":{"f":[{"v":null},{"v":"1"},{"v":null},{"v":null}]}}]}},{"v":{"f":[{"v":"ga_session_id"},{"v":{"f":[{"v":null},{"v":"XXXX"},{"v":null},{"v":null}]}}]}}]}

Does anyone have a good technique for parsing this data, or do I need to manually parse all these columns manually?

Many thanks!

Ana

Re: Best way to parse Google Analytics data in Databricks notebook

daniel_sahal — Tue, 23 Apr 2024 05:05:47 GMT

@AnaMocanu
I was using this function, with a little modifications on my end:
https://gist.github.com/shreyasms17/96f74e45d862f8f1dce0532442cc95b2

Maybe this will be helpful for you 🙂

Re: Best way to parse Google Analytics data in Databricks notebook

AnaMocanu — Tue, 23 Apr 2024 20:53:10 GMT

Thank you @daniel_sahal

I decided to go with parsing the data from the json format, as I don't need too many columns and the elements in the list that I need will stay the same.

For example, when you're picking the first element in the list

df = df.withColumn('device_category', get_json_object(col("device"), "$.v.f[0].v")).alias("device_category")